Isolate target string or number

tcebob · Post by **tcebob** » Tue Mar 06, 2012 10:10 pm

Here's a sample line from an .xml file:

<Data ss:Type="String">Robert Shaw Atlanta Symphony</Data></Cell><Cell ss:StyleID="ce2"><Data ss:Type="Number">1986<

This RE will find the "String">Robert Shaw or "Number">1986 . . . :

">[[:alnum:] ]+

I'd like to isolate the [[:alnum:] ]+ so it can be copied and inserted into a spreadsheet. (Maybe even make a macro to do it.) How can I highlight only the string or number?

tcebob

sosimple · Post by **sosimple** » Wed Mar 07, 2012 3:28 am

Assuming all the lines in the file look like the line you have shown in the example, then this should work:

If the data (file content) looks like this:

Code: Select all

<Data ss:Type="String">Abc</Data></Cell><Cell ss:StyleID="ce2"><Data ss:Type="Number">1975<.../...>
<Data ss:Type="String">Robert Shaw Atlanta Symphony</Data></Cell><Cell ss:StyleID="ce2"><Data ss:Type="Number">1986<.../...>
<Data ss:Type="String">Def ghi jkl</Data></Cell><Cell ss:StyleID="ce2"><Data ss:Type="Number">2012<.../...>

First, select to use "POSIX" regular expression syntax:

Configure -> Preferences -> Editor

[X] Use POSIX regular expression syntax

Choose Search-Replace... (<F8> key):

Find what: ^<[^>]+>([[:alnum:] ]+)<.+>([[:alnum:] ]+)<.+$
Replace with: \1\n\2\n

[X] Regular expression

[Replace All]

The resulting file content will be:

Code: Select all

Abc
1975

Robert Shaw Atlanta Symphony
1986

Def ghi jkl
2012

Then you should be able to select the entire file and [Copy] it, and [Paste] it into your spreadsheet.

If you need to add "quotes" around the string data, that can be added using the search/replace parameters:

Find what: ^<[^>]+>([[:alnum:] ]+)<.+>([[:alnum:] ]+)<.+$
Replace with: "\1"\n\2\n

[X] Regular expression

[Replace All]

In this case, the resulting file content will be:

Code: Select all

"Abc"
1975

"Robert Shaw Atlanta Symphony"
1986

"Def ghi jkl"
2012

If there are other lines in the file that need to be handled (or deleted), or if there are other requirements, just let me know and I'll have another look.

Kevin

tcebob · Post by **tcebob** » Wed Mar 07, 2012 8:27 pm

Kevin, thanks much for your quick reply. Your macro is well thought out - but unfortunately I didn't give you enough info. The data that I want to isolate appear here and there throughout a very large file, with no eols. If I break up the file into separate lines the specified data can appear anywhere in the line, but most lines do not contain it at all. I enclose a larger sample so's you can get a better view of what's going on. Because some lines contain more than one entry, we can't simply discard the remainder of the line after finding one.

Maybe this needs to be done in stages: place eols before and after each target then throw out all barren lines. I'll work on this. Can you suggest a way to remove a line if it doesn't contain a target?

ss:StyleID="ce1"/><Cell ss:StyleID="ce1"/></Row><Row ss:Height="12.8376"><Cell><Data ss:Type="String">C</Data></Cell>
<Cell><Data ss:Type="String">Mozart K.626</Data></Cell><Cell><Data ss:Type="String">Requiem</Data></Cell><Cell
ss:StyleID="ce2"><Data ss:Type="String">Telarc 80128</Data></Cell><Cell><Data ss:Type="String">Robert Shaw</Data>
</Cell><Cell><Data ss:Type="String">Atlanta Symphony</Data></Cell><Cell ss:StyleID="ce2"><Data ss:Type="Number">
1986</Data></Cell><Cell ss:StyleID="ce2"><Data ss:Type="Number">1<

ps: I think the search needs to include the ". We are looking for ">.

tcebob

sosimple · Post by **sosimple** » Sun Mar 11, 2012 5:21 am

tcebob wrote:ps: I think the search needs to include the ". We are looking for ">.

The searches I have defined below include the quote-mark (") along with the data-type as in:
Search: "(>)(<Data ss:Type="String">[^<]*</Data>)(</Cell>)"

So, to be clear:
1) You are looking to capture all data associated with the "Tag" <Data ss:Type="String"> or <Data ss:Type="Number">.
(or: are you actually looking for any data following (">) ?)
2) These data elements can be located anywhere within the file
3) Multiple of these data elements can and do occur on the same line (actually file is one continuous line)
4) You wish to capture all occurrences of these data items regardless of their order of appearance... for example, you are NOT looking to capture data ONLY in certain cases such as cases when "Number" follows "String".
5) Other data items and/or other text may be present in the file but all of this other data/text is not important.
6) Desired data will contain only letters, numbers, and spaces (specifically, no "<>" characters).

From looking at the data you provided it looks like the "Data" tags are always surrounded by "<Cell ...></Cell>" tags (or at least surrounded by "><"). This may not be entirely important, but it may help in isolating the desired data elements from the non desired data elements.

You mentioned that the file does not contain any eol's (newlines), but in case you find that there are some cases where you do find newlines present, it would be important to know if all of the individual cases of:
<Cell...><Data ss:Type="String">String Data</Data></Cell> and
<Cell...><Data ss:Type="Number">Number Data</Data></Cell>
always begin and end on the same line... in other words, it would be important to know if (or if-not), it would be possible to find any individual desired data elements where the beginning data tag, and the data, and the ending data tag could be split up in such a way that they would not all be on the same line.

Assuming all of that is correct, I would approach this by:
1) Search-replace to find all occurrences of "<Cell...><Data ss:Type="String">String Data</Data></Cell>" and insert a newline between those tags like:
"<Cell...>\n<Data ss:Type="String">String Data</Data>\n</Cell>"
2) Search-replace to find all occurrences of "<Cell...><Data ss:Type="Number">Number Data</Data></Cell>" and insert a newline between those tags like:
"<Cell...>\n<Data ss:Type="Number">Number Data</Data>\n</Cell>"
3) Compose a unique string that either contains one or more characters that are guaranteed to be not present elsewhere in the file, or a "Random" string that is of sufficient length to be unique. Examples of this could be something like "~" or "~~" or "@#b5m345mbhoiuy3kjhg4mnb5hkpoiu34h5kj43h5#@"
4) Using search-replace, add the unique-string to the beginning of ALL lines.
5) For all cases of lines containing: <unique-string><Data ss:Type="String">String Data</Data>, delete "<unique-string>", "<Data ss:Type="String">", and "</Data>" leaving only "String Data".
6) For all cases of lines containing: <unique-string><Data ss:Type="Number">Number Data</Data>, delete "<unique-string>", "<Data ss:Type="Number">", and "</Data>" leaving only "Number Data".
7) Next, for all lines that begin with "<unique-string>", delete the entire line including the terminating newline.

This should leave only lines that contain the "String" and "Number" data, with one value per line.

If you need to format the data in any way, such as putting "quotes" around string data, you can do that in step 5 and 6.

If there are other data "Types" like: <Data ss:Type="Other">More Data</Data> then add additional search-replace steps like 1-2 and 5-6.

If all data will be treated the same, then you can combine steps 1 and 2, and combine steps 5 and 6, using "alteration" like:
1) Search for: "(>)(<Data ss:Type=")(String|Number)(">[^<]*</Data>)(</Cell>)", Replace with: \1\n\2\3\4\n\5, Replace: All
1) Search for: "(>)(<Data ss:Type=")(String|Number|Other)(">[^<]*</Data>)(</Cell>)", Replace with: \1\n\2\3\4\n\5, Replace: All
5) Search for: "^<Unique><Data ss:Type="(String|Number)">([^<]*)</Data>", Replace with: \2, Replace: All
5) Search for: "^<Unique><Data ss:Type="(String|Number|Other)">([^<]*)</Data>", Replace with: \2, Replace: All

If this all looks good, then here are the required search-replace parameters:

First, select to use "POSIX" regular expression syntax:

Configure -> Preferences -> Editor

[X] Use POSIX regular expression syntax

1) Search for: "(>)(<Data ss:Type="String">[^<]*</Data>)(</Cell>)", Replace with: \1\n\2\n\3, Replace: All
Choose Search-Replace... (<F8> key):

Find what: (>)(<Data ss:Type="String">[^<]*</Data>)(</Cell>)
Replace with: \1\n\2\n\3

[X] Regular expression

[Replace All]

2) Search for: "(>)(<Data ss:Type="Number">[^<]*</Data>)(</Cell>)", Replace with: \1\n\2\n\3, Replace: All
Choose Search-Replace... (<F8> key):

Find what: (>)(<Data ss:Type="Number">[^<]*</Data>)(</Cell>)
Replace with: \1\n\2\n\3

[X] Regular expression

[Replace All]

3-4) Create a "Unique" string like: "@#b5m345mbhoiuy3kjhg4mnb5hkpoiu34h5kj43h5#@" or whatever works for you. For this example, I have used: "<Unique>" as the string which may work just fine for you. Insert the "<Unique>" string before the first character on EVERY line...
Search: "^(.?)" Replace with: "<Unique>\1", Replace: All
Choose Search-Replace... (<F8> key):

Find what: ^(.?)
Replace with: <Unique>\1

[X] Regular expression

[Replace All]

5) Search: "^<Unique><Data ss:Type="String">([^<]*)</Data>", Replace with: \1, Replace: All
Choose Search-Replace... (<F8> key):

Find what: ^<Unique><Data ss:Type="String">([^<]*)</Data>
Replace with: \1

[X] Regular expression

[Replace All]

6) Search: "^<Unique><Data ss:Type="Number">([^<]*)</Data>", Replace with: \1, Replace: All
Choose Search-Replace... (<F8> key):

Find what: ^<Unique><Data ss:Type="Number">([^<]*)</Data>
Replace with: \1

[X] Regular expression

[Replace All]

7) Search: "^<Unique>.*\n", Replace with: <nothing>, Replace: All
Choose Search-Replace... (<F8> key):

Find what: ^<Unique>.*\n
Replace with: <nothing> ... Replace with: should be empty to delete these lines

[X] Regular expression

[Replace All]

You can record all these actions in one macro.

If the assumptions I have made about the file content are not correct and the procedure doesn't work, just let me know and I will have another look.

Kevin

Community

Isolate target string or number

Isolate target string or number

Re: Isolate target string or number