
时间:2022-09-13 16:02:44

I programmed a guestbook using PHP4 and HTML 4.01 (with the charset ISO-8859-15, i.e. latin-9). The data is saved in a MySQL-database with the charset (ISO-8859-1, i.e. latin-1).

我使用PHP4和HTML 4.01 (charset ISO-8859-15,即latin-9)编写了一个guestbook。数据保存在带有字符集的mysql数据库中(ISO-8859-1,即latin-1)。

When somebody enters characters from a different charset, it seems that the browsers send the data encoded (actually I have not checked where it gets encoded, ...).


Anyway, in some cases, it seems that characters are not saved encoded in the database. Thus, the validator returns an error message when I add show the data within an HTML4.01 document:


non SGML character number 146


You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical quote marks and similar in proprietary character sets. The validator has found one of these undefined characters in your document. The character may appear on your browser as a curly quote, or a trademark symbol, or some other fancy glyph; on a different computer, however, it will likely appear as a completely different character, or nothing at all.

你在文本中使用了非法字符。HTML使用标准的UNICODE Consortium (UNICODE Consortium)字符表,它留下65个(其中包括)未定义的字符代码(0到31包含,127到159包含),这些代码有时用于排版引号,在专有字符集中类似。验证器在您的文档中找到了这些未定义字符中的一个。这个字符可能会出现在你的浏览器中,作为一个卷曲的引用,或者一个商标符号,或者其他一些花哨的符号;然而,在另一台电脑上,它可能会以完全不同的字符出现,或者根本不会出现。

Your best bet is to replace the character with the nearest equivalent ASCII character, or to use an appropriate character entity. For more information on Character Encoding on the web, see Alan Flavell's excellent HTML Character Set Issues reference.

最好的办法是用最近的等价ASCII字符替换字符,或者使用适当的字符实体。有关web上字符编码的更多信息,请参见Alan Flavell的优秀HTML字符集发布参考。

This error can also be triggered by formatting characters embedded in documents by some word processors. If you use a word processor to edit your HTML documents, be sure to use the "Save as ASCII" or similar command to save the document without formatting information.

这个错误也可以由一些字处理器对嵌入文档中的字符进行格式化而触发。如果您使用文字处理程序编辑HTML文档,请确保使用“Save as ASCII”或类似的命令来保存文档,而不需要格式化信息。

I'm now using PHP5.2.17, and played a bit with htmlspecialchars, but nothing worked. How can I encode thoses characters, so that there are no more validation errors?


2 个解决方案



In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.


SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

SGML指的是ISO 8859-1(注意ISO和8859-1之间的空间,这不是一个连字符,就像你使用的字符集一样)。它不允许控制字符,只允许三个(这里是HTML中的SGML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).


You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.


I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.


From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).


It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.


Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)­Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

编辑:上面的部分有点严格。我突然想到你收到的输入不是ISO-8859-1(5),而是别的东西,比如windows代码页。我可能会说,这是windows - 1252(cp1252)­*。与C1范围的ISO-8859-1(128-159)相比,它有几个非控制字符。

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

Wikipedia页面还指出,大多数浏览器将ISO-8859-1视为Windows-1252/CP1252/CP-1252。PHP htmlentities()函数不能处理这些字符,HTML实体的转换表不包含代码点(PHP 5.3,没有针对5.4进行测试)。您需要创建自己的翻译表,并使用它与strtr一起替换windows-1252中ISO 8859-15中没有的字符:

 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
$cp1252HTML401Entities = array(
    "\x80" => '€',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:


$cp1252HTMLNumericEntities = array(
    "\x80" => '€',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.

希望现在能更有帮助。请参阅上面链接的*页面,了解windows-1242和ISO 8859-15中不同位置的一些字符。你应该考虑在你的网站上使用UTF-8。



A web page that has a text input field should be UTF-8 encoded, because this is the only way to ensure that all characters entered by the user will be correctly transmitted. How you deal with them server-side (e.g., rejecting characters outside some specific range) is a different issue.


If you use some other encoding and the user enters a character that has no representation in that encoding, this is an error condition that browsers may handle in any way they like. Modern browsers do something that is very odd in principle though useful in practice: they represent the characters as character references, like ’ for the right single quote (’). In this case, the data received is the same as if the user had actually typed the characters ’ (but this is so theoretical that browser vendors apparently ignore the problem).


What happens server-side in your case is unclear, but it may involve many types of processing. In any case, you cannot in general store ISO-8859-15 in ISO-8859-1 encoding (ISO-8859-15 was designed to replace some characters in ISO-8859-1 by other characters). It is unclear what your software does with character references like ’. It would be slightly odd, though surely possible, for software to replace them by character references like ’ (which are based on using windows-1252 as the document character set, contrary to HTML rules; they are technically undefined—not illegal—in HTML but so widely supported by browsers that HTML5 turns this to a rule).




In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.


SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

SGML指的是ISO 8859-1(注意ISO和8859-1之间的空间,这不是一个连字符,就像你使用的字符集一样)。它不允许控制字符,只允许三个(这里是HTML中的SGML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).


You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.


I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.


From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).


It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.


Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)­Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

编辑:上面的部分有点严格。我突然想到你收到的输入不是ISO-8859-1(5),而是别的东西,比如windows代码页。我可能会说,这是windows - 1252(cp1252)­*。与C1范围的ISO-8859-1(128-159)相比,它有几个非控制字符。

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

Wikipedia页面还指出,大多数浏览器将ISO-8859-1视为Windows-1252/CP1252/CP-1252。PHP htmlentities()函数不能处理这些字符,HTML实体的转换表不包含代码点(PHP 5.3,没有针对5.4进行测试)。您需要创建自己的翻译表,并使用它与strtr一起替换windows-1252中ISO 8859-15中没有的字符:

 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
$cp1252HTML401Entities = array(
    "\x80" => '€',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:


$cp1252HTMLNumericEntities = array(
    "\x80" => '€',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.

希望现在能更有帮助。请参阅上面链接的*页面,了解windows-1242和ISO 8859-15中不同位置的一些字符。你应该考虑在你的网站上使用UTF-8。



A web page that has a text input field should be UTF-8 encoded, because this is the only way to ensure that all characters entered by the user will be correctly transmitted. How you deal with them server-side (e.g., rejecting characters outside some specific range) is a different issue.


If you use some other encoding and the user enters a character that has no representation in that encoding, this is an error condition that browsers may handle in any way they like. Modern browsers do something that is very odd in principle though useful in practice: they represent the characters as character references, like ’ for the right single quote (’). In this case, the data received is the same as if the user had actually typed the characters ’ (but this is so theoretical that browser vendors apparently ignore the problem).


What happens server-side in your case is unclear, but it may involve many types of processing. In any case, you cannot in general store ISO-8859-15 in ISO-8859-1 encoding (ISO-8859-15 was designed to replace some characters in ISO-8859-1 by other characters). It is unclear what your software does with character references like ’. It would be slightly odd, though surely possible, for software to replace them by character references like ’ (which are based on using windows-1252 as the document character set, contrary to HTML rules; they are technically undefined—not illegal—in HTML but so widely supported by browsers that HTML5 turns this to a rule).
