Regex stumper

webmasta · Post by **webmasta** » Mon Sep 15, 2003 8:25 pm

John "Harry" "Timothy Jessica Mel" "Pat" Ginger

How do I split this string so that the output is
John
"Harry"
"Timothy Jessica Mel"
"Pat"
Ginger

In other words, split on space only if its NOT between "quotes".
Also some strings could very well have a "quoted" value at the end or could very well start with a quoted value.

eg; "John" Harry "Timothy Jessica Mel" Pat "Ginger"
wierd data, I know...

UNIX please,
Thnx.

webM

Bob Hansen · Post by **Bob Hansen** » Mon Sep 15, 2003 9:10 pm

Got it.........

How about this?

Three passes (Using underscore "_" to represent space character):

Start at beginning
First past replace "_" with "\n"

Start at beginning
Second pass replace _" with \n"

Start at beginning
Third pass, replace "_ with "\n

Works for me.......should work for you too.........good luck,
Bob
==================================
I'm a humble man..............and PROUD of it!

webmasta · Post by **webmasta** » Mon Sep 15, 2003 9:52 pm

That was smart using the three passes.. I was trying to do it in one pass with unix regex...That is not eze dude..

However .. the cruncher comes when theres lines that start with the quote
and end with the quote.

Thnx for the pointer .. i'll figure it out.

webM

Bob Hansen · Post by **Bob Hansen** » Mon Sep 15, 2003 11:34 pm

You noted:

when theres lines that start with the quote and end with the quote.

Yeah, easy for you to say with no example!

I suspect that just means a 4th/5th pass replacing combinations of \n, spaces, and "

webmasta · Post by **webmasta** » Mon Sep 15, 2003 11:46 pm

LOL Bob .. the example is in the first post....

And actually, I didnt want to replace the quotes, I wanted to split the string on whitespace ONLY if that space is NOT between "quotes"

In other words, split on space only if its NOT between "quotes".
Also some strings could very well have a "quoted" value at the end or could very well start with a quoted value.

eg; "John" Harry "Timothy Jessica Mel" Pat "Ginger"
wierd data, I know...

Still crunching at it here .. this one is a back breaker/head banger/ using regex.

Right now I feel like giving this data file its first and only flying lesson >>> through the window.

webM

Bob Hansen · Post by **Bob Hansen** » Mon Sep 15, 2003 11:54 pm

You mentioned that:

the example is in the first post

The solution I used was tested on your example, and gave the exact results that you showed.

If it's a problem to go through three passes, how about making a macro to do that for you:
CTRL-HOME
Find and Replace PASS1
CTRL-HOME
Find and Replace PASS2
CTRL-HOME
Find and Replace PASS3
============================

By demanding a single REGEX you are going to really test me. I look at these as challenges, and it forces me to learn new things. But I'm not in the mood right now, and time is tight. GIMME A BREAK, will you?

webmasta · Post by **webmasta** » Tue Sep 16, 2003 12:13 am

GIMME A BREAK, will you?

Take two or even three. Then come back and read my first post....

Your instructions replaces the quotes with \n...not right, it should replace SPACES with \n if the space DOES NOT fall within quotes.

No sweat,... but the reason why i was trying to do it with regex is because I have to do a perl script to retrieve the file from the web.. push the lines into @array, masssage the data and write to a new file... I ws only testing the outcome in TP before setting the script loose.

I was trying to employ the same 3/4 pass technique with regex in the script but having a hard time remembering when the " is passed so as not to split on the next space but continue till the next " then split.

The dope who came up with that data structure is job hunting now.... so much for that bright spark.

webM

s_reynisson · Post by **s_reynisson** » Tue Sep 16, 2003 1:22 am

Give this a "flying lesson"

Code: Select all

Pass 1:
_("[a-zA-Z]+_*[a-zA-Z]*_*[a-zA-Z]*") -> \n\1

Pass 2:
"_([a-zA-Z]) -> "\n\1

Using POSIX and _=space

Code: Select all

"John" Harry "Timothy Jessica Mel" Pat "Ginger" 

becomes

"John"
Harry 
"Timothy Jessica Mel"
Pat 
"Ginger"

Add as many middle names as you need ie. _*[a-zA-Z]* in pass 1.
Also beware that I'm only using a-z to grab a name...
Hmm, on my n-th edit here but, is there any way to grab all visual chars?
Might by handy to be sure you're getting all names.

Bob Hansen · Post by **Bob Hansen** » Tue Sep 16, 2003 2:18 am

Hello s_reynisson. It look like you fell into the same trap that I did. He (webmasta) submitted another example in his first posting:
"John" Harry "Timothy Jessica Mel" Pat "Ginger"

This does not come out correctly with your solution, but it works good on the first sample.

I got this result using your solution on the second example:
"John"Harry
"Timothy Jessica Mel"Pat
"Ginger"
============================
My earlier version still works but end up with a different result from what was displayed, but the first display was for the first sample. This result looks like what would be expected:
"John"
Harry
"Timothy Jessica Mel"
Pat
"Ginger"

Which looks good to me.

I would add one more pass replacing all " with nothing. ==============================
So I would like to resubmit:

Start at beginning
First past replace "_" with "\n"

Start at beginning
Second pass replace _" with \n"

Start at beginning
Third pass, replace "_ with "\n

Start at beginning
Fourth pass, replace " with nothing, delete them all.

Final result for both models =:
John
Harry
Timothy Jessica Mel
Pat
Ginger
=====================================
Thanks for letting me take a break, but enough for tonight. good luck.

s_reynisson · Post by **s_reynisson** » Tue Sep 16, 2003 2:50 am

hmm, I get

Code: Select all

"John"
Harry
"Timothy Jessica Mel"
Pat
"Ginger"

from

"John" Harry "Timothy Jessica Mel" Pat "Ginger"

I think the " are supposed to be left in.

Code: Select all

Using POSIX and _=space
p1 _("[a-zA-Z]+_*[a-zA-Z]*_*[a-zA-Z]*") -> \n\1
p2 "_([a-zA-Z]) -> "\n\1

webmasta · Post by **webmasta** » Tue Sep 16, 2003 4:31 am

Rey .. I gotta sleep and take a break from this.. tomorro is another day..

First .. I keep getting invalid regex... phew...
POSIX is checked ...all underscores were replaced with spaces.. that means that the first regex starts with a space.

Next and MOST IMPORTANT .. TP is driving me over the wall .. I am already up it.

Been at this since 8 am this morn.. midnight now..
[rant]The darn s/r dialog is so small even on my 800x600 res .. cannot pull the box to expand it .. the s/r fields cut off the search terms, the arial text is hard to read in the search field, letters are so close together you cant select properly, eyes are sh*t right now... rave rave more rave[/rant]

Bob.. you need a vacation .. Rey is right.. the quotes are supposed to be left in...(split on space only when the space is not between quotes)

Dont sweat this... I wont get back to it for a couple of days at least..

Got another headache to deal with...Norton Internet Securities firewall popup blocker.

Thnx guys... will be back...

Milonguero · Post by **Milonguero** » Tue Sep 16, 2003 7:22 am

I think this works (1 pass only

):

_*\(\([^"][^_"]*\)\|\("[^"]*"\)\)_+ => \1\n

where "_" should be read as space.

However, it cannot cope with syntaxerrors in the text to pass.

Time to work -

Post by **bbadmin** » Tue Sep 16, 2003 8:23 am

Nice one Milonguero! I'll just point out that you've used non-POSIX syntax, in case anybody comes across this thread in years to come. With POSIX syntax you lose all the backslashes:

Code: Select all

Find what: *(([^"][^ "]*)|("[^"]*")) +
Replace with:\1\n

Keith MacDonald
Helios Software Solutions

Milonguero · Post by **Milonguero** » Tue Sep 16, 2003 12:38 pm

Keith! You are so right. Sorry.

I had forgotten TextPad could handle POSIX format.

webmasta · Post by **webmasta** » Tue Sep 16, 2003 2:15 pm

Well WTF .. It works on both examples in one fell swoop .. Keith, where were ya all day yesterday?? love ya.... thanx heaps...

Suddenly my day seems like its gonna be a good one.

John "Harry" "Timothy Jessica Mel" "Pat" Ginger
"John" Harry "Timothy Jessica Mel" Pat "Ginger"

John
"Harry"
"Timothy Jessica Mel"
"Pat"
Ginger
"John"
Harry
"Timothy Jessica Mel"
Pat
"Ginger"

Still cant understand why Rey's regex was returning invalid regex, I didnt change anything this morning and Keith's regex works from the word go.

Bob, take that vaction, you need it.

Thnx again guys,