Page 1 of 1

Padding a page range

Posted: Fri Jul 13, 2012 8:15 pm
by eressner
I've found several similar questions on the forum, but nothing quite like what I need.

I'm screen-scraping data from a website, fielding it, and entering it in a database. One field is PAGE_RANGE, basically, the starting and ending page numbers of a journal article. The website presents the data in a truncated form, by dropping repeated "most significant digits" from the second number, but I have to store the full starting and ending page numbers.

So we'll see data like this, and I need it to turn into ==> that

156-9 ==> 156-159
45-55 ==> 45-55 (no change needed here because all digits are there)
989-1004 ==> 989-1004 (ditto ... no change needed)
10487-92 ==> 10487-10492

Another way to state the problem is this: if the second number has fewer digits than the first, copy enough of the most significant digits from the first number and prepend them to the second one.

Any help very gratefully welcome ... as would an authoritative "sorry, regex can't do that" so I can stop trying. I know I can do this in Excel, so my Plan B is to use that for a clean-up pass after the data is in the database table.

--Eric Ressner
--St Louis MO USA

Re: Padding a page range

Posted: Fri Jul 13, 2012 8:58 pm
by eressner
eressner wrote:I know I can do this in Excel, so my Plan B is to use that for a clean-up pass after the data is in the database table.
Of course, I can use SQL to do that post-processing, too (DUH), but still, is there a way I can get the right data in on the first pass?

--Eric Ressner again

Posted: Fri Jul 13, 2012 10:07 pm
by ben_josephs
You can't do it in a single step. That would require a far more powerful regex engine than TextPad's (for example, Perl's) or a script (in, for example, Perl).

But you can do it rather inconveniently in TextPad in many steps. For example, if the maximum length of a page number is 5 digits, then using "Posix" regular expression syntax:
Configure | Preferences | Editor

[X] Use POSIX regular expression syntax
you can search in turn for each of the following 10 regexes:

\<([0-9]{1})([0-9]{4})-([0-9]{4})\>
\<([0-9]{2})([0-9]{3})-([0-9]{3})\>
\<([0-9]{3})([0-9]{2})-([0-9]{2})\>
\<([0-9]{4})([0-9]{1})-([0-9]{1})\>

\<([0-9]{1})([0-9]{3})-([0-9]{3})\>
\<([0-9]{2})([0-9]{2})-([0-9]{2})\>
\<([0-9]{3})([0-9]{1})-([0-9]{1})\>

\<([0-9]{1})([0-9]{2})-([0-9]{2})\>
\<([0-9]{2})([0-9]{1})-([0-9]{1})\>

\<([0-9]{1})([0-9]{1})-([0-9]{1})\>

using in each case the replacement expression:

\1\2-\1\3

Posted: Fri Jul 13, 2012 10:33 pm
by eressner
Thanks a lot, Ben. Yes, I had a feeling it could only be done with explicit string lengths, and quite a lot of different combinations thereof. I'm tending to think that post-processing the entry inside the database is going to be my best bet.

This forum is wonderful, by the way. Excellent response time AND quality! and lots to learn even from others' questions.

--Eric