Unicode Conformance

PCPete · Post by **PCPete** » Fri Sep 07, 2007 7:42 am

I'm currently thrashing various software companies to include "basic" (i.e. non-corrupting) unicode support in their programs.

The companies are Adobe (Audition and Premiere Pro), Microvision Development (Surething CD Labeler), Yahoo (Musicmatch Jukebox), MySQL AB (MySQL), the Firebird development teams, and now Helios.

Dealing with international (or what we used to call NLS, National Language Support) character sets is just not optional anymore. It's not a matter of pandering to noisy rabblerousers (shhh! Sit down! They'll see us!), it's a critical facet of real world work.

Jeebers, I'm an audio engineer, and I'm having to drag database developers, audio and video editors, and ancillary developers kicking and screaming into the 21st century, and with ONE exception (Microvision), they're really kicking and screaming like babies. It's embarrasing to deal with these people sometimes, I can tell you that for nothing. Idiots at the wheel, every time.

So Helios, please extract your creative digit and get full (and I do mean FULL) unicode support into Textpad as quickly and calmly as possible. I don't have any argument for introducing the changes step by step, and I'd be delighted to be of some assistance debugging or trying stuff out; but you've gotta change.

I'm currently manually editing a 23 megabyte XML output file from Microsoft Access in Notepad on an x64 system. It takes 26 minutes to load the file. It takes more than 50 seconds to jump down 10 pages (~90 lines per display page), and it takes 11 (eleven) minutes to find a simple 6-character unique text string in this file.

So you really don't have any competition. Quadruple the memory? I'll double that and raise you triple. In the time it takes you guys to come up with an answer, I might almost be finished editing my database file. Maybe.

OK, so this is a bit of an extreme case, but seriously, you can make a killing, AND keep us rabble happy at the same time! What a win-win situation!

gpuk · Post by **gpuk** » Mon Sep 17, 2007 2:39 pm

I am a long term user and fan of textpad and swear by it. I would like to add my vote to this request as currently I am simply unable to use this editor for any multilingual work. We are in the process of adding Russian support to a website and I have been reduced to working with Notepad.

PCPete · Post by **PCPete** » Mon Sep 17, 2007 11:36 pm

I've just found that PFE32 (Programmer's file editor, written by Alan Philips in the UK), a freeware text editor I last used in Windows95 (1988 to early 1996) handles the 23MB unicode text file on my x64 system without any major problems. Unicode characters are displayed incorrectly, but "straight" search-and-replace text happens just as fast as Textpad, and when the file is saved, all the unicode strings are preserved. Plus, it's MUCH nicer and MUCH MUCH faster than Notepad.

If anyone would like a copy of PFE32 to try out on unicode files in the interim, I'll make it available on my website for download (it will be a single ZIP archive with all documentation and help, this is in full compliance with it's original distribution permissions). Reply to this message and I'll post a download link.

It's not much, but it's better than Notepad or Boxer (and, unfortunately, Textpad V5) with NLS/MBCS/Unicode text, and it beats waiting. I'll provide some (limited) support with configuration and so on, but "out of the box" it should allow editing of any source textfile format without corruption. But if you do wish to use it, please test it out first!

Nicholas Jordan · Post by **Nicholas Jordan** » Wed Sep 19, 2007 2:22 am

90+% of my concerns right now have to do with an os or editor making decisions for me about Unicode, UTF-8 and character conversions.

Currently my Notepad will display glyphs, really really wild - but of no use in keeping on top of what is going on. Many people use, and will have to use some sort of MBCS, but lets not make the ISO latin-1 culture an obsolete discard: make some user-selectables that allow use of the editor without having to write a file analyzer that preps the file according to which editor one is using.

Double byte character sets are the purvey of the Unice .... Textpad is for Whino the OS .... leave the mandated switch to the Unice that is so pre-emtive about their Elvis and Emacs and Grep and many powerful tools they have.

gpuk · Post by **gpuk** » Tue Sep 25, 2007 6:08 pm

Hi PCPete,

Thanks for the tip. I might yet have a look at this but currently have started using Notepad2 which so far seems to be sufficient for what I need (at least it's better than Notepad).

In an ideal world I'd be using Textpad! One feature I am really missing is tabbed document editing. Sadly, Notepad2 can only handle one file per instance which is irritating.

gpuk

David Haslam · Post by **David Haslam** » Wed Mar 19, 2008 8:36 am

If you search the internet for the phrase

Code: Select all

Bush hid the facts

you will find amusing tales of how Notepad guesses wrong about whether the text file it is opening is Unicode or not.

See http://www.datamystic.com/forums/viewtopic.php?p=1610

David Haslam · Post by **David Haslam** » Wed Mar 19, 2008 8:41 am

During the past 12 months I have worked a lot with Unicode files. Anyone needing further tips on Unicode compatible editors could find useful information by searching the Go Bible Forums. See http://jolon.org/vanillaforum/

I currently use Notepad++ and SC Unipad for most needs, but EditPad Lite still comes in useful, as does MS Wordpad.

PFSchaffner · Post by **PFSchaffner** » Wed Mar 19, 2008 7:11 pm

The only point I would quibble with in the original post is this:

In reality, while quadrupling the size of data would be a serious problem for a DVD video, it's simply no big deal for a text file. How many text files do you encounter that won't fit on a floppy disk, for example?

I run a TextPad-based shop producing full-text SGML-encoded transcriptions of books, routinely deal with files in the 50+ MB range, and often edit dozens or even hundreds of files simultaneously. So memory use is an issue.

But minimal Unicode comformance is an even bigger issue. For editing the XML versions of our files (numbering in the tens of thousands; hundreds of thousands if you count their MARC records and TEI headers), we have to abandon TextPad in favor of EditPad Pro, BabelPad (for simple tasks), NotePad++, NotePad2, jEdit, or a dedicated XML editor like oXygen, XML-copy-editor. XMetaL, or EPCedit. Minimal conformance would allow us to use TextPad for at least some of this work, and I'd strongly support it if it were possible to implement it without disruption to the most Unicode-sensitive areas of the program, especially the sort and regex functions, on which I absolutely rely.

Nicholas Jordan · Post by **Nicholas Jordan** » Thu Mar 20, 2008 8:50 pm

It is inevitable, you provide what should be modeled as the cannonical framing for tp transition - as well as regexs are moving toward MBCS implementations. A few hundred thousand memory locatons when tera-byte persistent storage is available is not of deep consequence. Hand-held devices have megabytes of transient writeable space and any reasonable text file is small compared to graphics that are thrown around and discarded with minimal remark by most end observers.

Block or sliding window handling of 50+ mb files is not beyond minimal coding skills of undergrad cs, I question for clarity whether anyone can utilize 50,000,000 visual renderings of textual representation. I throw around large code chunks, I have found 10-k to be an upper limit of reasonable metrics. Handy also in that it seems to match the unspoken defacto upper bound of 10-k over squeaking dialup. Some remote regions who would wish for unicode still have dialup speeds so would display of a maximun of 10-k in any given edit operation hamper minimal Unicode comformance for you?

Again, not to countervail.

smjg · Post by **smjg** » Sat Nov 01, 2008 11:00 pm

For those who haven't found it already, there's a recent thread on how to approach it, with a poll on when it should be done. Your contribution would be appreciated!

k9dog · Post by **k9dog** » Fri Sep 14, 2012 11:58 pm

I find myself editing more and more files with international characters that just break the old 8 bit boundaries because the was edited by programs that support Windows widestrings. They are usually called unicode in msdn context, but I believe we are talking UTF-16 encoded files. I guess it means that they can always be seen 1 byte + 1 ascii. I am guessing this is the kind of files we want to be able to edit without destroying the content of the original file. Maybe keeping the characters beyond ascii as a ? (maybe inverted) would be okay, but it would sure be nice with some support of the format.

Community

Unicode Conformance

Should TextPad become Unicode Conformant

I'm With The Clever Guy What Started This Thread

Adding my support

An oldie but goodie

snaggle proof

Even Notepad guesses wrongly about Unicode files

I do a lot of work with Unicode files

Qualified support (large files; interaction with sort/regex)

minimal Unicode comformance

Windows is unicode so are the files.