I need to search all open files OR a folder of files (OR if all else fails at least a file) to find any any char that is not, basically, what you would see on the keyboard, and either put out a list or mark them.
[Reason: I need to replace any non-keyboard character with its unicode equivalent. I can do that, but sometimes in big text files, frankly some little unicode character gets missed, like an o with an umlaut or a dash that turns out to be a microsoft endash or something like that. When I zip my files to import them into my corp software, it barfs if there is even one binary char.]
So I want a way to search and find "any non-keyboard char" basically.
I searched on this and found a lot about binary files and some about chars but nothing about searching for chars based on that classification.
Best,
PJ
searching for binary chars
Moderators: AmigoJack, bbadmin, helios, MudGuard
-
ben_josephs
- Posts: 2464
- Joined: Sun Mar 02, 2003 9:22 pm
Hmmn, maybe I don't know the right term.
That script seemed to mark every line that had a ><.: but didn't find the 'sample variable' chars I was looking for just as a test (degree, an a with umlaut, 1/2, all chars copied from MS Word (copy/paste) to text).
We take Word output from an editorial group (like Q&A stuff, no code in it) which tends to have symbols, math chars, foreign chars, endashes, stuff like that. We have to reformat it for parsing and make it viable in an XML unicode environ. So everything that isn't 'plain text' I guess you could say, has to get its unicode numeric entity replaced. Which is easy except first you have to FIND all those chars. (Certain ones get 'lost' (invisible) in the copy from Word to TextPad but we know to look for those. All the rest do copy over and are visible in TextPad, but in the midst of a lot of text, too easy to miss one here and there.)
non-plain-text? I thought if it wasn't plain text it meant it was binary but I'm sorry I must have the wrong term. Aside from ampersand (ignore that, it's its own issue), everything that I can see on my keyboard is fine in the content. Everything ELSE I need to replace with unicode entities.
Does that help??
PJ
That script seemed to mark every line that had a ><.: but didn't find the 'sample variable' chars I was looking for just as a test (degree, an a with umlaut, 1/2, all chars copied from MS Word (copy/paste) to text).
We take Word output from an editorial group (like Q&A stuff, no code in it) which tends to have symbols, math chars, foreign chars, endashes, stuff like that. We have to reformat it for parsing and make it viable in an XML unicode environ. So everything that isn't 'plain text' I guess you could say, has to get its unicode numeric entity replaced. Which is easy except first you have to FIND all those chars. (Certain ones get 'lost' (invisible) in the copy from Word to TextPad but we know to look for those. All the rest do copy over and are visible in TextPad, but in the midst of a lot of text, too easy to miss one here and there.)
non-plain-text? I thought if it wasn't plain text it meant it was binary but I'm sorry I must have the wrong term. Aside from ampersand (ignore that, it's its own issue), everything that I can see on my keyboard is fine in the content. Everything ELSE I need to replace with unicode entities.
Does that help??
PJ