如何在PHP中检测格式错误的utf-8字符串?

时间:2021-04-18 08:40:52

iconv function sometimes gives me an error:

iconv函数有时会给我一个错误:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in utf-8 string before putting data to inconv ?

有没有办法在将数据输入到无线电之前检测到utf-8字符串中存在非法字符?

4 个解决方案

#1


47  

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

首先,请注意,无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

您可以使用PHP 4.3.5之后的preg_match [PHP手册]中提供的UTF-8有效性检查。如果给出了无效的字符串,它将返回0(没有附加信息):

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding [PHP Manual]:

另一种可能性是mb_check_encoding [PHP手册]:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding [PHP Manual]:

您可以使用的另一个功能是mb_detect_encoding [PHP手册]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

将strict参数设置为true非常重要。

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

此外,iconv [PHP手册]允许您动态更改/删除无效序列。 (但是,如果iconv遇到这样的序列,它会生成通知;此行为无法更改。)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

您可以使用@并检查返回字符串的长度:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

查看iconv手册页上的示例。

You have not shared the source code where the notice is resulting from. You should add it if you want a more concrete suggestion.

您尚未共享产生通知的源代码。如果你想要一个更具体的建议,你应该添加它。

#2


0  

You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

您可以尝试使用mb_detect_encoding来检测您是否拥有不同的字符集(而不是UTF-8),然后根据需要将mb_convert_encoding转换为UTF-8。人们更有可能在不同的字符集中为您提供有效内容,而不是为您提供无效的UTF-8。

#3


0  

The specification on which characters that are invalid in UTF-8 is pretty clear. You probably wanna strip those out before trying to parse it. They shouldn't be there so if you could avoid it even before generating the XML that would be even better.

UTF-8中无效字符的规范非常清楚。在尝试解析它之前,您可能想要删除它们。他们不应该在那里,所以如果你甚至可以在生成更好的XML之前就避免它。

See here for a reference:

见这里参考:

http://www.w3.org/TR/xml/#charsets

http://www.w3.org/TR/xml/#charsets

That isn't a complete list, many parser also disallow some low-numbered control characters, but I can't find a comprehensive list right now.

这不是一个完整的列表,许多解析器也不允许一些低编号的控制字符,但我现在找不到一个全面的列表。

However, iconv might have builtin support for this:

但是,iconv可能内置了对此的支持:

http://www.zeitoun.net/articles/clear-invalid-utf8/start

http://www.zeitoun.net/articles/clear-invalid-utf8/start

#4


0  

put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in source encoding id to ignore invalid characters:

在iconv()前面放一个@来抑制NOTICE,在源编码id中使用UTN-8之后的// IGNORE忽略无效字符:

@iconv( 'UTF-8//IGNORE', $destinationEncoding, $yourString );

#1


47  

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

首先,请注意,无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

您可以使用PHP 4.3.5之后的preg_match [PHP手册]中提供的UTF-8有效性检查。如果给出了无效的字符串,它将返回0(没有附加信息):

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding [PHP Manual]:

另一种可能性是mb_check_encoding [PHP手册]:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding [PHP Manual]:

您可以使用的另一个功能是mb_detect_encoding [PHP手册]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

将strict参数设置为true非常重要。

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

此外,iconv [PHP手册]允许您动态更改/删除无效序列。 (但是,如果iconv遇到这样的序列,它会生成通知;此行为无法更改。)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

您可以使用@并检查返回字符串的长度:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

查看iconv手册页上的示例。

You have not shared the source code where the notice is resulting from. You should add it if you want a more concrete suggestion.

您尚未共享产生通知的源代码。如果你想要一个更具体的建议,你应该添加它。

#2


0  

You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

您可以尝试使用mb_detect_encoding来检测您是否拥有不同的字符集(而不是UTF-8),然后根据需要将mb_convert_encoding转换为UTF-8。人们更有可能在不同的字符集中为您提供有效内容,而不是为您提供无效的UTF-8。

#3


0  

The specification on which characters that are invalid in UTF-8 is pretty clear. You probably wanna strip those out before trying to parse it. They shouldn't be there so if you could avoid it even before generating the XML that would be even better.

UTF-8中无效字符的规范非常清楚。在尝试解析它之前,您可能想要删除它们。他们不应该在那里,所以如果你甚至可以在生成更好的XML之前就避免它。

See here for a reference:

见这里参考:

http://www.w3.org/TR/xml/#charsets

http://www.w3.org/TR/xml/#charsets

That isn't a complete list, many parser also disallow some low-numbered control characters, but I can't find a comprehensive list right now.

这不是一个完整的列表,许多解析器也不允许一些低编号的控制字符,但我现在找不到一个全面的列表。

However, iconv might have builtin support for this:

但是,iconv可能内置了对此的支持:

http://www.zeitoun.net/articles/clear-invalid-utf8/start

http://www.zeitoun.net/articles/clear-invalid-utf8/start

#4


0  

put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in source encoding id to ignore invalid characters:

在iconv()前面放一个@来抑制NOTICE,在源编码id中使用UTN-8之后的// IGNORE忽略无效字符:

@iconv( 'UTF-8//IGNORE', $destinationEncoding, $yourString );