Umlauts - never ending story

haeb · Post by **haeb** » Sun Jan 03, 2016 7:54 pm

Hi all,

when 7.0.0 was introduced, "search in files" for an 'umlaut' did not work at all. After some time on Version 7.4.0 or 7.5.0 it worked correct - Thank you for that!

Now there is another problem about umlauts in 8.0.0 which is a quite bigger problem for me.

Some utf-8 files which were displayed and saved correct in 7.6.0 suddenly appeared in 8.0.0 as ANSI files. So the umlauts were distroyed and the file also were distroyed.

I do not know which files were recognized correct as utf-8 and which were recognized wrong as ANSi files because only some of them were recognized wrong. All these files are utf-8 files in 7.6.0 and all were saved without BOM.

At the moment TP 8.0.0 is unusable for me because it distroyed some of my files until i found this bug. So i switched back to 7.0.6.

Regards
Horst ... hoping a solution will be found soon

haeb · Post by **haeb** » Mon Jan 04, 2016 8:33 pm

Hi all,

it is different and more complicated - after a intensive research i discovered the following :

All files i am talking about, were saved as utf-8 files by TP7.

#First
There is a border of about 4010 chars which TP looks for umlauts. So i have two test files, one file has the first umlaut-chars after char 4010 named after4010.txt and the second file has the first umlaut-chars befor char 4010 named before4010.txt.

#Second
There is a difference on how a umlaut file displays umlauts between opening the file. We have at least 4 types of opening a file:

1. "drag and drop" from explorer to TP
2. opening by clicking on a result of "search in files"
3. "menu > file > open"
4. "menu > file

"

#Third
When opening after4010.txt by opening types 1 or 2, TP8 displays the file as ANSI and replaces the umlauts with the HEX value of the umlaut and saves it as ANSI. When opening after4010.txt by opening types 3 or 4 umlauts are displayed correct and TP saves it correct as utf-8.

#Fourth
When opening after4010.txt by opening type 2 and afterwards closing the file and using opening type 4 opens also file with wrong displayed umlauts. Even if TP8 is closed and started again, reopennig with type 4 opens a file with wrong displayed umlauts. Only when opening - i am always talking about the same file, no saving, just opening and closing it - again with type 3 closing and opening with type 4 the correct umlauts are displayed.

#Sixth
The before4010.txt file does not have any problems on displaying umlauts with any method of opening a file.

#Seventh
TP7 does not make any difference between the 4 opening types on after4010.txt files like TP8 does.

So it depend on how i open a after4010.txt file whether umlauts are displayed correct or not in TP8.

I do have many files which are like the after4010.txt file because these are code files which are containing German comments or German interface text in some parts. So it could easily happen, there is no comment in the first 4010 chars followed by several comments with umlauts.

Now knowing the opening differences is helping little. But TP8 is still unusable for me till type 1 an 2 are behaving like type 3 and 4 especially type 2 i am using very heavy.

Btw.
I wrote "search in files" with umlauts did work since TP 7.4.0. This was complete wrong. It does not work in 7.x.x and it does work 'a little' in 8.0.0. TP 7.x.x does not find any word which contains umlauts. TP 8.0.0 does find words containing umlauts in before4010.txt file types but not in after4010.txt file types.

Thank you for reading
Horst

Post by **bbadmin** » Mon Jan 04, 2016 8:51 pm

If a file does not start with a BOM, TextPad reads the first 4Kb and uses heuristics to determine if it contains any UTF-8, UTF-16 or UTF16-BE characters. If none are found, the file is assumed to be in the default system code page. This behaviour can be overridden by selecting the encoding on the Open File dialog box, or by setting the default encoding for the corresponding document class.

I hope this helps.

haeb · Post by **haeb** » Mon Jan 04, 2016 8:59 pm

Yes, this is what i found. Of course i have set TP to open the files (standard, txt, php, ...) as utf-8.

BUT when you open the file in another way than "menu > file > open" there is another behavior in TP8. TP8 does NOT use the settings when opening a file by "search in files".

Horst

Post by **bbadmin** » Thu Jan 14, 2016 6:25 pm

In the next release, Find in Files will use the corresponding document class encoding for files with indeterminate encodings.

haeb · Post by **haeb** » Thu Jan 14, 2016 10:27 pm

I am looking forward to get this release!

haeb · Post by **haeb** » Thu Jan 21, 2016 11:59 am

Ii works!

TP 8.0.1 does now:

1. open files by 'find in files' => encoding is used like defined by class
2. detect files when umlaut is in search strings in 'find in files' AND diplays all umlaut correct in the result list
3. open files by double click => encoding is used like defined by class

TP 7.6.1 has still the same problems about umlauts. But i have now switched to TP8.

Regards
Horst