I've found several similar questions on the forum, but nothing quite like what I need.
I'm screen-scraping data from a website, fielding it, and entering it in a database. One field is PAGE_RANGE, basically, the starting and ending page numbers of a journal article. The website presents the data in a truncated form, by dropping repeated "most significant digits" from the second number, but I have to store the full starting and ending page numbers.
So we'll see data like this, and I need it to turn into ==> that
156-9 ==> 156-159
45-55 ==> 45-55 (no change needed here because all digits are there)
989-1004 ==> 989-1004 (ditto ... no change needed)
10487-92 ==> 10487-10492
Another way to state the problem is this: if the second number has fewer digits than the first, copy enough of the most significant digits from the first number and prepend them to the second one.
Any help very gratefully welcome ... as would an authoritative "sorry, regex can't do that" so I can stop trying. I know I can do this in Excel, so my Plan B is to use that for a clean-up pass after the data is in the database table.
--Eric Ressner
--St Louis MO USA
Padding a page range
Moderators: AmigoJack, bbadmin, helios, MudGuard
Re: Padding a page range
Of course, I can use SQL to do that post-processing, too (DUH), but still, is there a way I can get the right data in on the first pass?eressner wrote:I know I can do this in Excel, so my Plan B is to use that for a clean-up pass after the data is in the database table.
--Eric Ressner again
-
ben_josephs
- Posts: 2464
- Joined: Sun Mar 02, 2003 9:22 pm
You can't do it in a single step. That would require a far more powerful regex engine than TextPad's (for example, Perl's) or a script (in, for example, Perl).
But you can do it rather inconveniently in TextPad in many steps. For example, if the maximum length of a page number is 5 digits, then using "Posix" regular expression syntax:
\<([0-9]{1})([0-9]{4})-([0-9]{4})\>
\<([0-9]{2})([0-9]{3})-([0-9]{3})\>
\<([0-9]{3})([0-9]{2})-([0-9]{2})\>
\<([0-9]{4})([0-9]{1})-([0-9]{1})\>
\<([0-9]{1})([0-9]{3})-([0-9]{3})\>
\<([0-9]{2})([0-9]{2})-([0-9]{2})\>
\<([0-9]{3})([0-9]{1})-([0-9]{1})\>
\<([0-9]{1})([0-9]{2})-([0-9]{2})\>
\<([0-9]{2})([0-9]{1})-([0-9]{1})\>
\<([0-9]{1})([0-9]{1})-([0-9]{1})\>
using in each case the replacement expression:
\1\2-\1\3
But you can do it rather inconveniently in TextPad in many steps. For example, if the maximum length of a page number is 5 digits, then using "Posix" regular expression syntax:
you can search in turn for each of the following 10 regexes:Configure | Preferences | Editor
[X] Use POSIX regular expression syntax
\<([0-9]{1})([0-9]{4})-([0-9]{4})\>
\<([0-9]{2})([0-9]{3})-([0-9]{3})\>
\<([0-9]{3})([0-9]{2})-([0-9]{2})\>
\<([0-9]{4})([0-9]{1})-([0-9]{1})\>
\<([0-9]{1})([0-9]{3})-([0-9]{3})\>
\<([0-9]{2})([0-9]{2})-([0-9]{2})\>
\<([0-9]{3})([0-9]{1})-([0-9]{1})\>
\<([0-9]{1})([0-9]{2})-([0-9]{2})\>
\<([0-9]{2})([0-9]{1})-([0-9]{1})\>
\<([0-9]{1})([0-9]{1})-([0-9]{1})\>
using in each case the replacement expression:
\1\2-\1\3
Thanks a lot, Ben. Yes, I had a feeling it could only be done with explicit string lengths, and quite a lot of different combinations thereof. I'm tending to think that post-processing the entry inside the database is going to be my best bet.
This forum is wonderful, by the way. Excellent response time AND quality! and lots to learn even from others' questions.
--Eric
This forum is wonderful, by the way. Excellent response time AND quality! and lots to learn even from others' questions.
--Eric