Regex stumper

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

User avatar
webmasta
Posts: 169
Joined: Mon Jul 28, 2003 8:16 pm
Location: Toronto

Regex stumper

Post by webmasta »

John "Harry" "Timothy Jessica Mel" "Pat" Ginger

How do I split this string so that the output is
John
"Harry"
"Timothy Jessica Mel"
"Pat"
Ginger

In other words, split on space only if its NOT between "quotes".
Also some strings could very well have a "quoted" value at the end or could very well start with a quoted value.

eg; "John" Harry "Timothy Jessica Mel" Pat "Ginger"
wierd data, I know...

UNIX please,
Thnx.

webM
User avatar
Bob Hansen
Posts: 1516
Joined: Sun Mar 02, 2003 8:15 pm
Location: Salem, NH
Contact:

Post by Bob Hansen »

Got it.........

How about this?

Three passes (Using underscore "_" to represent space character):

Start at beginning
First past replace "_" with "\n"

Start at beginning
Second pass replace _" with \n"

Start at beginning
Third pass, replace "_ with "\n

Works for me.......should work for you too.........good luck,
Bob
==================================
I'm a humble man..............and PROUD of it!
User avatar
webmasta
Posts: 169
Joined: Mon Jul 28, 2003 8:16 pm
Location: Toronto

Post by webmasta »

That was smart using the three passes.. I was trying to do it in one pass with unix regex...That is not eze dude.. :lol:

However .. the cruncher comes when theres lines that start with the quote
and end with the quote.

Thnx for the pointer .. i'll figure it out.

webM
User avatar
Bob Hansen
Posts: 1516
Joined: Sun Mar 02, 2003 8:15 pm
Location: Salem, NH
Contact:

Post by Bob Hansen »

You noted:
when theres lines that start with the quote and end with the quote.
Yeah, easy for you to say with no example! :D :D

I suspect that just means a 4th/5th pass replacing combinations of \n, spaces, and "
User avatar
webmasta
Posts: 169
Joined: Mon Jul 28, 2003 8:16 pm
Location: Toronto

Post by webmasta »

LOL Bob .. the example is in the first post....

And actually, I didnt want to replace the quotes, I wanted to split the string on whitespace ONLY if that space is NOT between "quotes"
In other words, split on space only if its NOT between "quotes".
Also some strings could very well have a "quoted" value at the end or could very well start with a quoted value.

eg; "John" Harry "Timothy Jessica Mel" Pat "Ginger"
wierd data, I know...
Still crunching at it here .. this one is a back breaker/head banger/ using regex.

Right now I feel like giving this data file its first and only flying lesson >>> through the window.

webM
User avatar
Bob Hansen
Posts: 1516
Joined: Sun Mar 02, 2003 8:15 pm
Location: Salem, NH
Contact:

Post by Bob Hansen »

You mentioned that:
the example is in the first post
The solution I used was tested on your example, and gave the exact results that you showed.

If it's a problem to go through three passes, how about making a macro to do that for you:
CTRL-HOME
Find and Replace PASS1
CTRL-HOME
Find and Replace PASS2
CTRL-HOME
Find and Replace PASS3
============================

By demanding a single REGEX you are going to really test me. I look at these as challenges, and it forces me to learn new things. But I'm not in the mood right now, and time is tight. GIMME A BREAK, will you? :cry: :D :cry:
User avatar
webmasta
Posts: 169
Joined: Mon Jul 28, 2003 8:16 pm
Location: Toronto

Post by webmasta »

GIMME A BREAK, will you?
Take two or even three. Then come back and read my first post.... :wink:

Your instructions replaces the quotes with \n...not right, it should replace SPACES with \n if the space DOES NOT fall within quotes.

No sweat,... but the reason why i was trying to do it with regex is because I have to do a perl script to retrieve the file from the web.. push the lines into @array, masssage the data and write to a new file... I ws only testing the outcome in TP before setting the script loose.

I was trying to employ the same 3/4 pass technique with regex in the script but having a hard time remembering when the " is passed so as not to split on the next space but continue till the next " then split.

The dope who came up with that data structure is job hunting now.... so much for that bright spark.

webM
User avatar
s_reynisson
Posts: 939
Joined: Tue May 06, 2003 1:59 pm

Post by s_reynisson »

Give this a "flying lesson" ;)

Code: Select all

Pass 1:
_("[a-zA-Z]+_*[a-zA-Z]*_*[a-zA-Z]*") -> \n\1

Pass 2:
"_([a-zA-Z]) -> "\n\1

Using POSIX and _=space

Code: Select all

"John" Harry "Timothy Jessica Mel" Pat "Ginger" 

becomes

"John"
Harry 
"Timothy Jessica Mel"
Pat 
"Ginger"
Add as many middle names as you need ie. _*[a-zA-Z]* in pass 1.
Also beware that I'm only using a-z to grab a name...
Hmm, on my n-th edit here but, is there any way to grab all visual chars?
Might by handy to be sure you're getting all names.
User avatar
Bob Hansen
Posts: 1516
Joined: Sun Mar 02, 2003 8:15 pm
Location: Salem, NH
Contact:

Post by Bob Hansen »

Hello s_reynisson. It look like you fell into the same trap that I did. He (webmasta) submitted another example in his first posting:
"John" Harry "Timothy Jessica Mel" Pat "Ginger"

This does not come out correctly with your solution, but it works good on the first sample.

I got this result using your solution on the second example:
"John"Harry
"Timothy Jessica Mel"Pat
"Ginger"
============================
My earlier version still works but end up with a different result from what was displayed, but the first display was for the first sample. This result looks like what would be expected:
"John"
Harry
"Timothy Jessica Mel"
Pat
"Ginger"

Which looks good to me.

I would add one more pass replacing all " with nothing. ==============================
So I would like to resubmit:

Start at beginning
First past replace "_" with "\n"

Start at beginning
Second pass replace _" with \n"

Start at beginning
Third pass, replace "_ with "\n

Start at beginning
Fourth pass, replace " with nothing, delete them all.

Final result for both models =:
John
Harry
Timothy Jessica Mel
Pat
Ginger
=====================================
Thanks for letting me take a break, but enough for tonight. good luck.
User avatar
s_reynisson
Posts: 939
Joined: Tue May 06, 2003 1:59 pm

Post by s_reynisson »

hmm, I get

Code: Select all

"John"
Harry
"Timothy Jessica Mel"
Pat
"Ginger"

from

"John" Harry "Timothy Jessica Mel" Pat "Ginger"
I think the " are supposed to be left in.

Code: Select all

Using POSIX and _=space
p1 _("[a-zA-Z]+_*[a-zA-Z]*_*[a-zA-Z]*") -> \n\1
p2 "_([a-zA-Z]) -> "\n\1
User avatar
webmasta
Posts: 169
Joined: Mon Jul 28, 2003 8:16 pm
Location: Toronto

Post by webmasta »

Rey .. I gotta sleep and take a break from this.. tomorro is another day..

First .. I keep getting invalid regex... phew...
POSIX is checked ...all underscores were replaced with spaces.. that means that the first regex starts with a space.

Next and MOST IMPORTANT .. TP is driving me over the wall .. I am already up it.

Been at this since 8 am this morn.. midnight now..
[rant]The darn s/r dialog is so small even on my 800x600 res .. cannot pull the box to expand it .. the s/r fields cut off the search terms, the arial text is hard to read in the search field, letters are so close together you cant select properly, eyes are sh*t right now... rave rave more rave[/rant]

Bob.. you need a vacation .. Rey is right.. the quotes are supposed to be left in...(split on space only when the space is not between quotes)

Dont sweat this... I wont get back to it for a couple of days at least..

Got another headache to deal with...Norton Internet Securities firewall popup blocker.

Thnx guys... will be back... :?
Milonguero
Posts: 3
Joined: Tue Sep 16, 2003 7:08 am

Post by Milonguero »

I think this works (1 pass only :wink: ):

_*\(\([^"][^_"]*\)\|\("[^"]*"\)\)_+ => \1\n

where "_" should be read as space.

However, it cannot cope with syntaxerrors in the text to pass.

Time to work - :cry:
User avatar
bbadmin
Site Admin
Posts: 854
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

Nice one Milonguero! I'll just point out that you've used non-POSIX syntax, in case anybody comes across this thread in years to come. With POSIX syntax you lose all the backslashes:

Code: Select all

Find what: *(([^"][^ "]*)|("[^"]*")) +
Replace with:\1\n
Keith MacDonald
Helios Software Solutions
Milonguero
Posts: 3
Joined: Tue Sep 16, 2003 7:08 am

Post by Milonguero »

Keith! You are so right. Sorry.

I had forgotten TextPad could handle POSIX format.
User avatar
webmasta
Posts: 169
Joined: Mon Jul 28, 2003 8:16 pm
Location: Toronto

Post by webmasta »

Well WTF .. It works on both examples in one fell swoop .. Keith, where were ya all day yesterday?? love ya.... thanx heaps... :lol: Suddenly my day seems like its gonna be a good one.

John "Harry" "Timothy Jessica Mel" "Pat" Ginger
"John" Harry "Timothy Jessica Mel" Pat "Ginger"

John
"Harry"
"Timothy Jessica Mel"
"Pat"
Ginger
"John"
Harry
"Timothy Jessica Mel"
Pat
"Ginger"

Still cant understand why Rey's regex was returning invalid regex, I didnt change anything this morning and Keith's regex works from the word go.

Bob, take that vaction, you need it. :wink:

Thnx again guys,
Last edited by webmasta on Wed Sep 17, 2003 12:08 am, edited 1 time in total.
Post Reply