I'm trying to generate a list of all the unique words in a text. Basic approach is to change anything that's NOT a hyphen, single-quote mark (both straight and curly), a "word character" or a digit into a "newline character, and then sort the list with case-sensitivity turned on and deletion of duplicates.
I've run into a problem with this approach regarding accented characters such as e-grave, o-umlaut etc. etc.
Using [^-'’\w\d\r\n] as my search function replaces the accented characters, as well as the ones I DO want replaced. I've tried putting the accented characters into the search string:
[^-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzߊŽšžŸÀÃ�ÂÃÄÅÆÇÈÉÊËÌÃ�ÃŽÃ�Ã�ÑÒÓÔÕÖØÙÚÛÜÃ�à áâãäåæçèéêëìÃîïðñòóôõöøùúûüýÿ'’\r\n]
with the same result.
Is there any setting or workaround for this, please?
RegEx and non-English Characters
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
-
- Posts: 35
- Joined: Thu Aug 02, 2007 8:50 pm
Update: my bad
I re-checked the source text I was using and discovered that all the accented characters had already been changed to "?" because that's what textpad does when the default display font doesn't support accented characters. I changed the default font to Courier New and the accented/non-english characters (like the German capital-B) showed up for the party.
A quick run with [^-'’\w\d\r\n] -> \n followed by case-sensitive sort with duplicate deletion got me my wordlist, including words like café.
A quick run with [^-'’\w\d\r\n] -> \n followed by case-sensitive sort with duplicate deletion got me my wordlist, including words like café.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
-
- Posts: 35
- Joined: Thu Aug 02, 2007 8:50 pm