RegEx and non-English Characters

geoffreykidd · Post by **geoffreykidd** » Tue Apr 23, 2013 8:51 pm

I'm trying to generate a list of all the unique words in a text. Basic approach is to change anything that's NOT a hyphen, single-quote mark (both straight and curly), a "word character" or a digit into a "newline character, and then sort the list with case-sensitivity turned on and deletion of duplicates.

I've run into a problem with this approach regarding accented characters such as e-grave, o-umlaut etc. etc.

Using [^-'â€™\w\d\r\n] as my search function replaces the accented characters, as well as the ones I DO want replaced. I've tried putting the accented characters into the search string:

[^-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÃŸÅ Å½Å¡Å¾Å¸Ã€Ã�Ã‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃ�ÃŽÃ�Ã�Ã‘Ã’Ã“Ã”Ã•Ã–Ã˜Ã™ÃšÃ›ÃœÃ�Ã Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿'â€™\r\n]

with the same result.

Is there any setting or workaround for this, please?

geoffreykidd · Post by **geoffreykidd** » Tue Apr 23, 2013 10:01 pm

I re-checked the source text I was using and discovered that all the accented characters had already been changed to "?" because that's what textpad does when the default display font doesn't support accented characters. I changed the default font to Courier New and the accented/non-english characters (like the German capital-B) showed up for the party.

A quick run with [^-'â€™\w\d\r\n] -> \n followed by case-sensitive sort with duplicate deletion got me my wordlist, including words like cafÃ©.

ben_josephs · Post by **ben_josephs** » Tue Apr 23, 2013 10:25 pm

I was going to say that I could reproduce this, but only if the script (View | Document Properties | Font | Script) was not Western, but you discovered essentially the same thing.

( The German letter ÃŸ is not a capital B. It's an Eszett or scharfes S, used in some cases instead of a double s. )

geoffreykidd · Post by **geoffreykidd** » Tue Apr 23, 2013 10:29 pm

I didn't know that. [grin] It looked like a capital-B.

Thank you.

I must say, IMHO, the new RegEx engine supersedes sliced bread for utility.

Community

RegEx and non-English Characters

RegEx and non-English Characters

Update: my bad