named character entities
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
named character entities
I edit web pages. In them, I use named entities like . These are usually hilited as a dark green. However, today I used — and – for the first time. Instead of being hilited dark green, they were left white. I take this to mean that TextPad doesn't recognize them. What file do I edit to add these names to the list? I looked at the html.syn file and it doesn't contain them. I looked at the colors for the HTML document class, and that color is not one of the colors I set when I customized the settings.
Re: named character entities
But it contains the following 3 interesting lines:WayneCa wrote:I looked at the html.syn file and it doesn't contain them.
Code: Select all
HTML=1
...
CharStart = &
CharEnd = ;
Why would nowadays someone need entities anyway? Unicode solves this for decades already - only the 4 XML entities are needed for angle brackets, ampersand and quotation to not interfere with the syntax - anything else can be easily used directly as per text encoding.
Re: named character entities
I had not looked at wikipedia. TextPad only supporting the 8-bit entities had not occurred to me.AmigoJack wrote:That means that HTML entities are stored in TextPad and those don't need to be defined elsewhere. As per https://en.wikipedia.org/wiki/List_of_X ... es_in_HTML TextPad might only support pre 4.0 entities.
I know that the browsers display the characters without an issue, but apparently the vnu validator doesn't agree. According to it every character that I have replaced with a named entity was seen by it as "malformed". Replacing it with an entity corrected the error. Since vnu is the validator designed by the W3C, I assumed it was current. I use the standalone version as an external tool in TextPad to validate the pages I am editing. The only things that don't validate are the characters that I have to replace with a named entity.AmigoJack wrote:Why would nowadays someone need entities anyway? Unicode solves this for decades already - only the 4 XML entities are needed for angle brackets, ampersand and quotation to not interfere with the syntax - anything else can be easily used directly as per text encoding.
Thank you for replying as quickly as you did. It helps me get on with other things more quickly.
Re: named character entities
Care to link?WayneCa wrote:the vnu validator
Sounds a bit contradicting. Most likely the error is not any validator, but how you feed it: if it's direct I/O then take care of which encodings are used and expected. It's rarily clear who assumes/provides UTF-8, 8859 or even 850. Otherwise save your file with an UTF-BOM and open it directly in "the tool".WayneCa wrote:According to it every character that I have replaced with a named entity was seen by it as "malformed". Replacing it with an entity corrected the error. ... I use the standalone version as an external tool in TextPad to validate the pages I am editing. The only things that don't validate are the characters that I have to replace with a named entity.
Just as with https://validator.w3.org/ you better upload a file to make sure copy/paste text encodings don't interfere.
Re: named character entities
https://validator.github.io/validator/ This is where I downloaded it. It runs standalone from the command line or can be run from TextPad as a Tool.AmigoJack wrote:Care to link?WayneCa wrote:the vnu validator
How is it contradicting? I put the em-dash character back into the document and ran the validator just so I could show you the output. In TextPad I have it set under the Tool category with the following options:AmigoJack wrote:Sounds a bit contradicting. Most likely the error is not any validator, but how you feed it: if it's direct I/O then take care of which encodings are used and expected. It's rarily clear who assumes/provides UTF-8, 8859 or even 850. Otherwise save your file with an UTF-BOM and open it directly in "the tool".
Just as with https://validator.w3.org/ you better upload a file to make sure copy/paste text encodings don't interfere.
Parameters: --html $File
Initial Folder: $FileDir
Capture Output and Sound alert when completed are checked, everything else is unchecked.
Regular expression to match output is the default: ^([^(]+)\((\d+)\):
File is blank, Line is 1, Column is 1
To invoke it, I have the document I am checking open and in the foreground in TextPad, then type <CTRL>-1 (control-one). It does the rest automatically. (I put the ~ in the file path myself. The validator supplied the entire path.)
Code: Select all
"file:/C:/Users/~/Source/errors.htm":120.30-120.30: error: Malformed byte sequence: “97�.
Tool completed with exit code 1
Edit: I went to the online validator and pasted the document into the validate by direct input pane. The validator didn't have a problem with the emdash being there. I guess it's just the standalone vnu validator that has the issue.
Re: named character entities
Re-read what you wrote - one of your sentences misses an important word to invert its meaning. As it stands now you do the same action twice.WayneCa wrote:How is it contradicting?
Save the document in UTF-8 encoding with a BOM before running the tool.WayneCa wrote:I put the em-dash character back into the document
No, the error clearly says "byte sequence", not "character". Those things happen when the consumer (the tool) expects UTF-8 and you're feeding 8859 - didn't you stumble upon all the details I wrote? It's not enough to let HTML declare the encoding - the actual file must be encoded the same way.WayneCa wrote:this type of error when there is a character it doesn't like
As I wrote before: the tool might also not be the problem, but how you save your file. As I wrote before: upload your actual file instead of copy/pasting text - if results differ then it's obvious you're not saving your file correctly to begin with.WayneCa wrote:Edit: I went to the online validator and pasted the document into the validate by direct input pane. The validator didn't have a problem with the emdash being there. I guess it's just the standalone vnu validator that has the issue.
Also consider using the parameter nu.validator.client.charset of v.Nu to provide the encoding of the document you feed it with, as per the manual.
Re: named character entities
I'm not certain what you mean here, but I will restate what I said. Hopefully that will remove any contradiction you see.AmigoJack wrote:Re-read what you wrote - one of your sentences misses an important word to invert its meaning. As it stands now you do the same action twice.
When I run the validator on a document that has, say, the è character in it, the validator reports a malformed byte sequence. If I replace the è with é the error is resolved.
I have been saving all my text documents the same way for over 30 years. Plain text. For the purposes of this discussion, my TextPad save parameters are:AmigoJack wrote:Save the document in UTF-8 encoding with a BOM before running the tool.
Line endings: PC
Encoding: Default
UNICODE BOM: unchecked
This is the first time I have ever been told that I may be saving the files wrong.
I understand that the error says byte sequence. I refer to character because it is those characters that are causing the error. Replacing them with a named entity has been correcting the error.AmigoJack wrote:No, the error clearly says "byte sequence", not "character". Those things happen when the consumer (the tool) expects UTF-8 and you're feeding 8859 - didn't you stumble upon all the details I wrote? It's not enough to let HTML declare the encoding - the actual file must be encoded the same way.
I will look into that too. Many things in the manual were not understandable to me as to whether or not they related to my use of it in TextPad as an external tool.AmigoJack wrote:As I wrote before: the tool might also not be the problem, but how you save your file. As I wrote before: upload your actual file instead of copy/pasting text - if results differ then it's obvious you're not saving your file correctly to begin with.
I will try that, I used copy/paste as it was more expedient at the time.
Also consider using the parameter nu.validator.client.charset of v.Nu to provide the encoding of the document you feed it with, as per the manual.
Edit: I used the file upload to validate the page. To my utter amazement the validator kept telling me that the <!DOCTYPE html> statement contained characters that it couldn't interpret as UTF-8! I resaved the document with UTF-8 encoding and UNICODE BOM checked and it validated. Well, after I learned that I must include a <META charset="utf-8"/> statement in the document as well. I had not run across this requirement before either.
Also, I validated the page using the vnu I have installed as a tool, but without updating the parameter. It validated, even with the character still there and not replacing it with the named entity. I guess this means I need to go back and resave all of the documents using UTF-8 encoding with UNICODE BOM checked, and adding the META statement to each one.
Thanks for the help, and for your patience with me learning to wrap my head around this. It is very much appreciated.
I'm happy if you can now say to yourself "heck, I never thought about that context - all the time I was fixing symptoms instead of finding the actual culprit - good bye entities".
Text files can have different encodings, which should now be obvious to you. In the internet context (i.e. your internet browser or any *ML validator) the consuming software is most likely able to recognize Unicode encodings if the file has a BOM in it. So if you "just" save your files in UTF-8 encoding with a BOM then chances are you don't even need the <META> HTML tag anymore (in the worst case it's redundant and in the best case it confirms the recognized encoding).
Try it once and it should also validate.
Text files can have different encodings, which should now be obvious to you. In the internet context (i.e. your internet browser or any *ML validator) the consuming software is most likely able to recognize Unicode encodings if the file has a BOM in it. So if you "just" save your files in UTF-8 encoding with a BOM then chances are you don't even need the <META> HTML tag anymore (in the worst case it's redundant and in the best case it confirms the recognized encoding).
Try it once and it should also validate.
Yes, yes I can, and I did! I changed all of the differing entities back to their actual characters in every file that contained them. Most of the occurrences were no-break spaces. It's nice to not have to fall back on them anymore, but it's also nice there's a complete list of all HTML entities I can refer to when I need to replace one. It's not easy trying to find them in Character Map.AmigoJack wrote:I'm happy if you can now say to yourself "heck, I never thought about that context - all the time I was fixing symptoms instead of finding the actual culprit - good bye entities".
I will do that. It's just too bad I didn't see this before I added that META statement to about 30 documents.AmigoJack wrote: <snip>
Try it once and it should also validate.
Agree with MudGuard
Hi all,
I have the same opinion as MudGard. I'm using textpad to edit files for a huge HTML-documentation just in raw HTML with some tools for generating parts of the pages. And this documentation is still "oldfashioned" and is in the source code human friendly and in the same way typographically correct (a minus sign is not a replacement for ndash or mdash). The documentation set contains 10.000+ files and it would not be an easy task to switch text encoding.
In the syntax file CharStart, CharEnd and HTML=1 is correctly defined (see post from AmigoJack, June 10, 2020)
O. K., it looks like I have to live with the restriction (it's my opinion that this is a restriction).
I have the same opinion as MudGard. I'm using textpad to edit files for a huge HTML-documentation just in raw HTML with some tools for generating parts of the pages. And this documentation is still "oldfashioned" and is in the source code human friendly and in the same way typographically correct (a minus sign is not a replacement for ndash or mdash). The documentation set contains 10.000+ files and it would not be an easy task to switch text encoding.
In the syntax file CharStart, CharEnd and HTML=1 is correctly defined (see post from AmigoJack, June 10, 2020)
O. K., it looks like I have to live with the restriction (it's my opinion that this is a restriction).
pbaumann
But different dashes can be used directly: what your internet browser renders are actual characters which could be copied and used instead of the entity you use. And for the editor those characters (of course) will remain as those (and not be converted for magical reasons to hyphens).
That is the whole point - if the editor does not give you a good solution to display zero width joiners then look for a better editor - that is the same argumentation as not using tab characters because they always look like spaces.
The only real need for entities is to express text that must not collide with the markup (i.e. < for <).
That is the whole point - if the editor does not give you a good solution to display zero width joiners then look for a better editor - that is the same argumentation as not using tab characters because they always look like spaces.
The only real need for entities is to express text that must not collide with the markup (i.e. < for <).