RegEx and non-English Characters
Posted: Tue Apr 23, 2013 8:51 pm
I'm trying to generate a list of all the unique words in a text. Basic approach is to change anything that's NOT a hyphen, single-quote mark (both straight and curly), a "word character" or a digit into a "newline character, and then sort the list with case-sensitivity turned on and deletion of duplicates.
I've run into a problem with this approach regarding accented characters such as e-grave, o-umlaut etc. etc.
Using [^-'’\w\d\r\n] as my search function replaces the accented characters, as well as the ones I DO want replaced. I've tried putting the accented characters into the search string:
[^-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzßŠŽšžŸÀÃ�ÂÃÄÅÆÇÈÉÊËÌÃ�ÃŽÃ�Ã�ÑÒÓÔÕÖØÙÚÛÜÃ�à áâãäåæçèéêëìÃîïðñòóôõöøùúûüýÿ'’\r\n]
with the same result.
Is there any setting or workaround for this, please?
I've run into a problem with this approach regarding accented characters such as e-grave, o-umlaut etc. etc.
Using [^-'’\w\d\r\n] as my search function replaces the accented characters, as well as the ones I DO want replaced. I've tried putting the accented characters into the search string:
[^-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzßŠŽšžŸÀÃ�ÂÃÄÅÆÇÈÉÊËÌÃ�ÃŽÃ�Ã�ÑÒÓÔÕÖØÙÚÛÜÃ�à áâãäåæçèéêëìÃîïðñòóôõöøùúûüýÿ'’\r\n]
with the same result.
Is there any setting or workaround for this, please?