Page 1 of 1

Finding Non-ASCII characters

Posted: Thu Dec 03, 2015 3:19 pm
by redcairo
Hi. Search did not turn up anything helpful on this topic.

I work with a system that is choking on any file that has a non-ascii character in the text such as a MS Word "smartquote" as one example. But these are fiendishly difficult to "see" plainly in a lot of text.

I'm looking for a regular expression which will basically "find" any character beyond the standard keyboard characters, so I can find whatever might be buried in some files and throwing the error.

Would be SUPER appreciative if anyone could give me a clue in how to go about this. I've worked hard on regex stuff but so far it's always been on things that were regular chars.

RC (PJ)

Posted: Thu Dec 03, 2015 6:45 pm
by ben_josephs
Is this what you need:
[\x80-\xFF]
?

Posted: Fri Dec 04, 2015 3:17 am
by redcairo
Thank you! I was just coming back here to paste in:

[^\x00-\x7F]

Which I found elsewhere and believed was the answer. I note they're a bit diff...

Posted: Fri Dec 04, 2015 6:52 am
by ben_josephs
[\x80-\xFF] means every character in the range hex 80 (128) to FF (255).
[^\x00-\x7F] means every character not in the range hex 00 (0) to 7F (127).

Thery are equivalent if the text consists entirely of 8-bit characters. Yours is better because it works with characters of arbitrary width.