RegEx and non-English Characters

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
geoffreykidd
Posts: 35
Joined: Thu Aug 02, 2007 8:50 pm

RegEx and non-English Characters

Post by geoffreykidd »

I'm trying to generate a list of all the unique words in a text. Basic approach is to change anything that's NOT a hyphen, single-quote mark (both straight and curly), a "word character" or a digit into a "newline character, and then sort the list with case-sensitivity turned on and deletion of duplicates.

I've run into a problem with this approach regarding accented characters such as e-grave, o-umlaut etc. etc.

Using [^-'’\w\d\r\n] as my search function replaces the accented characters, as well as the ones I DO want replaced. I've tried putting the accented characters into the search string:

[^-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzߊŽšžŸÀ�ÂÃÄÅÆÇÈÉÊËÌ�Î��ÑÒÓÔÕÖØÙÚÛÜ�àáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'’\r\n]

with the same result.

Is there any setting or workaround for this, please?
geoffreykidd
Posts: 35
Joined: Thu Aug 02, 2007 8:50 pm

Update: my bad

Post by geoffreykidd »

I re-checked the source text I was using and discovered that all the accented characters had already been changed to "?" because that's what textpad does when the default display font doesn't support accented characters. I changed the default font to Courier New and the accented/non-english characters (like the German capital-B) showed up for the party.

A quick run with [^-'’\w\d\r\n] -> \n followed by case-sensitive sort with duplicate deletion got me my wordlist, including words like café.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

I was going to say that I could reproduce this, but only if the script (View | Document Properties | Font | Script) was not Western, but you discovered essentially the same thing.

( The German letter ß is not a capital B. It's an Eszett or scharfes S, used in some cases instead of a double s. )
geoffreykidd
Posts: 35
Joined: Thu Aug 02, 2007 8:50 pm

Post by geoffreykidd »

I didn't know that. [grin] It looked like a capital-B.

Thank you.

I must say, IMHO, the new RegEx engine supersedes sliced bread for utility.
Post Reply