在PHP中检测正确的字符编码?

时间:2022-05-21 21:37:29

I'm trying to detect the character encoding of a string but I can't get the right result.
For example:

我正在尝试检测字符串的字符编码,但我无法得到正确的结果。例如:

$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

That code outputs ISO-8859-1 but it should be Windows-1252.

该代码输出ISO-8859-1但它应该是Windows-1252。

What's wrong with this?

这有什么问题?

EDIT:
Updated example, in response to @raina77ow.

编辑:更新示例,以响应@ raina77ow。

$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

I get the wrong result again.

我又得到了错误的结果。

2 个解决方案

#1


1  

The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.

PHP中Windows-1252的问题在于它几乎永远不会被检测到,因为只要文本包含0x80到0x9f之外的任何字符,它就不会被检测为Windows-1252。

This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.

这意味着如果你的字符串包含一个普通的ASCII字母,如“A”,甚至是一个空格字符,PHP会说这不是有效的Windows-1252,在你的情况下,它会回退到下一个可能的编码,即ISO 8859-1。这是一个PHP错误,请参阅https://bugs.php.net/bug.php?id=64667。

#2


0  

Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:

虽然使用ISO-8859-1和CP-1252编码的字符串具有不同的字节代码表示:

<?php
$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
    $new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
    printf('%15s: %s detected: %10s explicitly: %10s',
        $encoding,
        implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
        mb_detect_encoding($new),
        mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
    );
    echo PHP_EOL;
}

Results:

Windows-1252: 802082208320842085 detected:            explicitly: ISO-8859-1
  ISO-8859-1: 3f203f203f203f203f detected:      ASCII explicitly: ISO-8859-1

...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.

...从我们在这里可以看到,看起来mb_detect_encoding的第二个参数存在问题。使用mb_detect_order而不是参数会产生非常相似的结果。

#1


1  

The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.

PHP中Windows-1252的问题在于它几乎永远不会被检测到,因为只要文本包含0x80到0x9f之外的任何字符,它就不会被检测为Windows-1252。

This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.

这意味着如果你的字符串包含一个普通的ASCII字母,如“A”,甚至是一个空格字符,PHP会说这不是有效的Windows-1252,在你的情况下,它会回退到下一个可能的编码,即ISO 8859-1。这是一个PHP错误,请参阅https://bugs.php.net/bug.php?id=64667。

#2


0  

Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:

虽然使用ISO-8859-1和CP-1252编码的字符串具有不同的字节代码表示:

<?php
$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
    $new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
    printf('%15s: %s detected: %10s explicitly: %10s',
        $encoding,
        implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
        mb_detect_encoding($new),
        mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
    );
    echo PHP_EOL;
}

Results:

Windows-1252: 802082208320842085 detected:            explicitly: ISO-8859-1
  ISO-8859-1: 3f203f203f203f203f detected:      ASCII explicitly: ISO-8859-1

...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.

...从我们在这里可以看到,看起来mb_detect_encoding的第二个参数存在问题。使用mb_detect_order而不是参数会产生非常相似的结果。