RegEx

trespasser · Post by **trespasser** » Thu Feb 02, 2006 12:29 pm

Hi,

I am wondering whether TextPad/RegEx can do the following, I have a large number of files that I want to get certain pieces of information out of, the files that looks like this, this being only a small sample of the total contents,

<Paragrap>This is going to be a large piece of text<Paragraph/><HouseStyle>Detached<HouseStyle/><Price>Â£300,000<Price/>
<Beds>3<Beds/>

For each file in a directory I want a row that will be formatted like -:

Paragraph HouseStyle Price Beds
This is going.... Detached Â£300,000 3

Is this at all possible or is it just wishful thinking?

Thanks PD

s_reynisson · Post by **s_reynisson** » Thu Feb 02, 2006 2:33 pm

The formatting can be done, is the data on a single or multiple lines? Pls post your sample using the Code tag. TP can handle single lines but you need a multiple line capable regex engine, a tool like WildEdit to name but one, for multiple lines.

trespasser · Post by **trespasser** » Sat Feb 04, 2006 8:51 pm

Hi,

Thanks for the reply with regards to my posting, the actual data is XML. I am afriad that I cant post the exact file due to security reasons but it is in the same format as my example :

There is Data above the TransformedXml Tag

<TransformedXml>
<Paragrap>This is going to be a large piece of text<Paragraph/><HouseStyle>Detached<HouseStyle/><Price>Â£300,000<Price/>
<Beds>3<Beds/><TransformedXml/>

There is Data below the TransformedXml Tag

So the pieces of data that I need stripping out is halfway through the Xml file.

The data is on multiple lines and all files are in one folder with the same format.

Sorry I cant be any more specific, hope you can help

Regards PD 8)

s_reynisson · Post by **s_reynisson** » Sat Feb 04, 2006 10:10 pm

Ok, Wildedit it is. To use it on files larger than 10KB you'll need to register.

First clean your files of newlines within the Paragraph tags.
Something like
Find
(<Paragrap>.*?)\r\n(.*?<Paragraph/>)
Replace
$1$2

Before you do that you need to tick "'.' does not match a newline
character" in the options. Narrow your search in the first step on the
Paragraph tag as needed, I'm just grabbing them all.
Repeat this until WE reports zero changes made, check the log tab.

Next clear the tick for "'.' does not match a newline character" in the
options.
Find - this is all on one line
<TransformedXml>.*?<Paragrap>(.*?)<Paragraph/>.*?<HouseStyle>(.*?)<HouseStyle/>.*?<Price>(.*?)<Price/>.*?<Beds>(.*?)<Beds/>.*?<TransformedXml/>
Replace - four lines

Code: Select all

<TransformedXml>
Paragraph HouseStyle Price Beds
$1 $2 $3 $4
<TransformedXml/>

A word of warning to cover my royal beh*, I'm doing this on a very
small sample of data, take care, back up etc

trespasser · Post by **trespasser** » Wed Feb 08, 2006 12:12 pm

Hi there,

Thanks again for the assistance and my apolgise for not getting back to you sooner. I have tried what you suggested and my new data layout is the same as my old one.

I might be doing something wrong but these are the steps that I carried out

I put a tick in the '.' does not match a newline character

Then I ran the

Find
(<Paragrap>.*?)\r\n(.*?<Paragraph/>)
Replace
$1$2

Then I un-ticked the box that I previosuly ticked

Then I did a find

TransformedXml>.*?<Paragrap>(.*?)<Paragraph/>.*?<HouseStyle>(.*?)<HouseStyle/>.*?<Price>(.*?)<Price/>.*?<Beds>(.*?)<Beds/>.*?<TransformedXml/>

And Replaced it with

<TransformedXml>
Paragraph HouseStyle Price Beds
$1 $2 $3 $4
<TransformedXml/>

Have I not understood you answer and being a bit dim, never used WildEdit before so all replies have to be very very very simple

ben_josephs · Post by **ben_josephs** » Wed Feb 08, 2006 1:04 pm

Your examples are not XML. The forward slash in an end tag is in front of the tag name, not after it. I have corrected this in my example below. I have also corrected the misspelling of Paragraph.

You have not made your requirements clear and you have not explained what doesn't work.

Do you want the items laid out in columns? How do you want the large piece of text arranged?

You can't output fixed-width columns if the items in one column are are of different widths. But you can approximate them with tabs.

I would try something like this as a starting point, with '.' does not match a newline character not selected:

Find what:
<Paragraph>(.*?)</Paragraph>\s*<HouseStyle>(.*?)</HouseStyle>\s*<Price>(.*?)</Price>\s*<Beds>(.*?)</Beds>

Replace with:
Paragraph\tHouseStyle\tPrice\t\tBeds
$1
\t\t$2\t$3\t4

Community

RegEx

RegEx

REgEx

Reply!!