如何防止json_encode()使用无效字符删除字符串

时间:2020-12-14 00:28:29

Is there a way to keep json_encode() from returning null for a string that contains an invalid (non-UTF-8) character?

有没有一种方法可以防止json_encode()为包含无效(非utf -8)字符的字符串返回null ?

It can be a pain in the ass to debug in a complex system. It would be much more fitting to actually see the invalid character, or at least have it omitted. As it stands, json_encode() will silently drop the entire string.

在一个复杂的系统中调试会很麻烦。实际查看无效字符会更合适,或者至少省略它。此时,json_encode()将无声地删除整个字符串。

Example (in UTF-8):

示例(utf - 8):

$string = 
  array(utf8_decode("Düsseldorf"), // Deliberately produce broken string
        "Washington",
        "Nairobi"); 

print_r(json_encode($string));

Results in

结果

[null,"Washington","Nairobi"]

Desired result:

预期的结果:

["D�sseldorf","Washington","Nairobi"]

Note: I am not looking to make broken strings work in json_encode(). I am looking for ways to make it easier to diagnose encoding errors. A null string isn't helpful for that.

注意:我不希望在json_encode()中使断字串工作。我正在寻找使诊断编码错误更容易的方法。空字符串对此没有帮助。

5 个解决方案

#1


39  

php does try to spew an error, but only if you turn display_errors off. This is odd because the display_errors setting is only meant to control whether or not errors are printed to standard output, not whether or not an error is triggered. I want to emphasize that when you have display_errors on, even though you may see all kinds of other php errors, php doesn't just hide this error, it will not even trigger it. That means it will not show up in any error logs, nor will any custom error_handlers get called. The error just never occurs.

php确实会尝试输出错误,但只有在关闭display_errors时才会这样。这很奇怪,因为display_errors设置仅用于控制是否将错误打印到标准输出,而不是是否触发错误。我想强调的是,当你有display_errors时,即使你可能会看到各种各样的php错误,php也不会仅仅隐藏这个错误,它甚至不会触发它。这意味着它不会出现在任何错误日志中,也不会调用任何定制的error_handlers。错误永远不会发生。

Here's some code that demonstrates this:

这里有一些代码可以说明这一点:

error_reporting(-1);//report all errors
$invalid_utf8_char = chr(193);

ini_set('display_errors', 1);//display errors to standard output
var_dump(json_encode($invalid_utf8_char));
var_dump(error_get_last());//nothing

ini_set('display_errors', 0);//do not display errors to standard output
var_dump(json_encode($invalid_utf8_char));
var_dump(error_get_last());// json_encode(): Invalid UTF-8 sequence in argument

That bizarre and unfortunate behavior is related to this bug https://bugs.php.net/bug.php?id=47494 and a few others, and doesn't look like it will ever be fixed.

这种奇怪而不幸的行为与这个bug有关:https://bugs.php.net/bug.php?id=47494和其他一些,而且看起来不像会被修复。

workaround:

处理:

Cleaning the string before passing it to json_encode may be a workable solution.

在将字符串传递给json_encode之前清洗字符串可能是一个可行的解决方案。

$stripped_of_invalid_utf8_chars_string = iconv('UTF-8', 'UTF-8//IGNORE', $orig_string);
if ($stripped_of_invalid_utf8_chars_string !== $orig_string) {
    // one or more chars were invalid, and so they were stripped out.
    // if you need to know where in the string the first stripped character was, 
    // then see http://*.com/questions/7475437/find-first-character-that-is-different-between-two-strings
}
$json = json_encode($stripped_of_invalid_utf8_chars_string);

http://php.net/manual/en/function.iconv.php

http://php.net/manual/en/function.iconv.php

The manual says

手册说

//IGNORE silently discards characters that are illegal in the target charset.

//忽略静默丢弃目标字符集中非法的字符。

So by first removing the problematic characters, in theory json_encode() shouldnt get anything it will choke on and fail with. I haven't verified that the output of iconv with the //IGNORE flag is perfectly compatible with json_encodes notion of what valid utf8 characters are, so buyer beware...as there may be edge cases where it still fails. ugh, I hate character set issues.

因此,通过首先删除有问题的字符,在理论上json_encode()不应该获取它将阻塞并失败的任何内容。我还没有验证iconv带有// /IGNORE标志的输出与json_encodes关于什么是有效的utf8字符的概念是完全兼容的,所以买方要注意……因为可能有边缘情况,它仍然失败。我讨厌字符集问题。

Edit
in php 7.2+, there seems to be some new flags for json_encode: JSON_INVALID_UTF8_IGNORE and JSON_INVALID_UTF8_SUBSTITUTE
There's not much documentation yet, but for now, this test should help you understand expected behavior: https://github.com/php/php-src/blob/master/ext/json/tests/json_encode_invalid_utf8.phpt

在php 7.2+中进行编辑,json_encode似乎有一些新的标志:JSON_INVALID_UTF8_IGNORE和json_invalid_utf8_replace,目前还没有多少文档,但是现在,这个测试应该可以帮助您理解预期的行为

There's also the possibility of a JSON_THROW_ON_ERROR flag in a future php version :)

在将来的php版本中还可能出现JSON_THROW_ON_ERROR标志:)

#2


6  

$s = iconv('UTF-8', 'UTF-8//IGNORE', $s);

This solved the problem. I am not sure why the guys from php haven't made the life easier by fixing json_encode().

这解决了这个问题。我不知道为什么php的人没有通过修改json_encode()来使事情变得更简单。

Anyway using the above allows json_encode() to create object even if the data contains special characters (swedish letters for example).

无论如何,使用上面的方法允许json_encode()创建对象,即使数据包含特殊字符(例如瑞典字母)。

You can then use the result in javascript without the need of decoding the data back to its original encoding (with escape(), unescape(), encodeURIComponent(), decodeURIComponent());

然后可以在javascript中使用结果,而不需要将数据解码回原始编码(使用escape()、unescape()、encodeURIComponent()、decodeURIComponent());

I am using it like this in php (smarty):

我在php (smarty)中使用它:

$template = iconv('UTF-8', 'UTF-8//IGNORE', $screen->fetch("my_template.tpl"));

Then I am sending the result to javascript and just innerHTML the ready template (html peace) in my document.

然后,我将结果发送给javascript,并在文档中插入就绪模板(html peace)。

Simply said above line should be implemented in json_encode() somehow in order to allow it to work with any encoding.

简单地说,这一行应该以某种方式在json_encode()中实现,以便允许它使用任何编码。

#3


4  

This function will remove all invalid UTF8 chars from a string:

此函数将从字符串中删除所有无效的UTF8字符:

function removeInvalidChars( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}

I use it after converting an Excel document to json, as Excel docs aren't guaranteed to be in UTF8.

我在将Excel文档转换为json之后才使用它,因为无法保证Excel文档在UTF8中。

I don't think there's a particularly sensible way of converting invalid chars to a visible but valid character. You could replace invalid chars with U+FFFD which is the unicode replacement character by turning the regex above around, but that really doesn't provide a better user experience than just dropping invalid chars.

我不认为有一种特别明智的方法可以将无效字符转换为可见但有效的字符。您可以用U+FFFD替换无效的字符,这是unicode替换字符,通过将regex转到上面,但是这并不能提供更好的用户体验,而不仅仅是删除无效的字符。

#4


3  

You need to know the encoding of all strings you're dealing with, or you're entering a world of pain.

你需要知道你所处理的所有字符串的编码,或者你正在进入一个痛苦的世界。

UTF-8 is an easy encoding to use. Also, JSON is defined to use UTF-8 (http://www.json.org/JSONRequest.html). So why not use it?

UTF-8是一种易于使用的编码。而且,JSON被定义为使用UTF-8 (http://www.json.org/JSONRequest.html)。那么为什么不使用它呢?

Short answer: the way to avoid json_encode() dropping your strings is to make sure they are valid UTF-8.

简短回答:避免json_encode()删除字符串的方法是确保它们是有效的UTF-8。

#5


1  

Instead of using the iconv function, you can direclty use the json_encode with the JSON_UNESCAPED_UNICODE option ( >= PHP5.4.0 )

除了使用iconv函数,还可以使用json_encode和JSON_UNESCAPED_UNICODE选项(>= PHP5.4.0)

Make sure you put "charset=utf-8" in the header of your php file:

请确保在php文件的标题中输入“charset=utf-8”:

header('Content-Type: application/json; charset=utf-8');

标题(“application / json内容类型:;charset = utf - 8 ');

#1


39  

php does try to spew an error, but only if you turn display_errors off. This is odd because the display_errors setting is only meant to control whether or not errors are printed to standard output, not whether or not an error is triggered. I want to emphasize that when you have display_errors on, even though you may see all kinds of other php errors, php doesn't just hide this error, it will not even trigger it. That means it will not show up in any error logs, nor will any custom error_handlers get called. The error just never occurs.

php确实会尝试输出错误,但只有在关闭display_errors时才会这样。这很奇怪,因为display_errors设置仅用于控制是否将错误打印到标准输出,而不是是否触发错误。我想强调的是,当你有display_errors时,即使你可能会看到各种各样的php错误,php也不会仅仅隐藏这个错误,它甚至不会触发它。这意味着它不会出现在任何错误日志中,也不会调用任何定制的error_handlers。错误永远不会发生。

Here's some code that demonstrates this:

这里有一些代码可以说明这一点:

error_reporting(-1);//report all errors
$invalid_utf8_char = chr(193);

ini_set('display_errors', 1);//display errors to standard output
var_dump(json_encode($invalid_utf8_char));
var_dump(error_get_last());//nothing

ini_set('display_errors', 0);//do not display errors to standard output
var_dump(json_encode($invalid_utf8_char));
var_dump(error_get_last());// json_encode(): Invalid UTF-8 sequence in argument

That bizarre and unfortunate behavior is related to this bug https://bugs.php.net/bug.php?id=47494 and a few others, and doesn't look like it will ever be fixed.

这种奇怪而不幸的行为与这个bug有关:https://bugs.php.net/bug.php?id=47494和其他一些,而且看起来不像会被修复。

workaround:

处理:

Cleaning the string before passing it to json_encode may be a workable solution.

在将字符串传递给json_encode之前清洗字符串可能是一个可行的解决方案。

$stripped_of_invalid_utf8_chars_string = iconv('UTF-8', 'UTF-8//IGNORE', $orig_string);
if ($stripped_of_invalid_utf8_chars_string !== $orig_string) {
    // one or more chars were invalid, and so they were stripped out.
    // if you need to know where in the string the first stripped character was, 
    // then see http://*.com/questions/7475437/find-first-character-that-is-different-between-two-strings
}
$json = json_encode($stripped_of_invalid_utf8_chars_string);

http://php.net/manual/en/function.iconv.php

http://php.net/manual/en/function.iconv.php

The manual says

手册说

//IGNORE silently discards characters that are illegal in the target charset.

//忽略静默丢弃目标字符集中非法的字符。

So by first removing the problematic characters, in theory json_encode() shouldnt get anything it will choke on and fail with. I haven't verified that the output of iconv with the //IGNORE flag is perfectly compatible with json_encodes notion of what valid utf8 characters are, so buyer beware...as there may be edge cases where it still fails. ugh, I hate character set issues.

因此,通过首先删除有问题的字符,在理论上json_encode()不应该获取它将阻塞并失败的任何内容。我还没有验证iconv带有// /IGNORE标志的输出与json_encodes关于什么是有效的utf8字符的概念是完全兼容的,所以买方要注意……因为可能有边缘情况,它仍然失败。我讨厌字符集问题。

Edit
in php 7.2+, there seems to be some new flags for json_encode: JSON_INVALID_UTF8_IGNORE and JSON_INVALID_UTF8_SUBSTITUTE
There's not much documentation yet, but for now, this test should help you understand expected behavior: https://github.com/php/php-src/blob/master/ext/json/tests/json_encode_invalid_utf8.phpt

在php 7.2+中进行编辑,json_encode似乎有一些新的标志:JSON_INVALID_UTF8_IGNORE和json_invalid_utf8_replace,目前还没有多少文档,但是现在,这个测试应该可以帮助您理解预期的行为

There's also the possibility of a JSON_THROW_ON_ERROR flag in a future php version :)

在将来的php版本中还可能出现JSON_THROW_ON_ERROR标志:)

#2


6  

$s = iconv('UTF-8', 'UTF-8//IGNORE', $s);

This solved the problem. I am not sure why the guys from php haven't made the life easier by fixing json_encode().

这解决了这个问题。我不知道为什么php的人没有通过修改json_encode()来使事情变得更简单。

Anyway using the above allows json_encode() to create object even if the data contains special characters (swedish letters for example).

无论如何,使用上面的方法允许json_encode()创建对象,即使数据包含特殊字符(例如瑞典字母)。

You can then use the result in javascript without the need of decoding the data back to its original encoding (with escape(), unescape(), encodeURIComponent(), decodeURIComponent());

然后可以在javascript中使用结果,而不需要将数据解码回原始编码(使用escape()、unescape()、encodeURIComponent()、decodeURIComponent());

I am using it like this in php (smarty):

我在php (smarty)中使用它:

$template = iconv('UTF-8', 'UTF-8//IGNORE', $screen->fetch("my_template.tpl"));

Then I am sending the result to javascript and just innerHTML the ready template (html peace) in my document.

然后,我将结果发送给javascript,并在文档中插入就绪模板(html peace)。

Simply said above line should be implemented in json_encode() somehow in order to allow it to work with any encoding.

简单地说,这一行应该以某种方式在json_encode()中实现,以便允许它使用任何编码。

#3


4  

This function will remove all invalid UTF8 chars from a string:

此函数将从字符串中删除所有无效的UTF8字符:

function removeInvalidChars( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}

I use it after converting an Excel document to json, as Excel docs aren't guaranteed to be in UTF8.

我在将Excel文档转换为json之后才使用它,因为无法保证Excel文档在UTF8中。

I don't think there's a particularly sensible way of converting invalid chars to a visible but valid character. You could replace invalid chars with U+FFFD which is the unicode replacement character by turning the regex above around, but that really doesn't provide a better user experience than just dropping invalid chars.

我不认为有一种特别明智的方法可以将无效字符转换为可见但有效的字符。您可以用U+FFFD替换无效的字符,这是unicode替换字符,通过将regex转到上面,但是这并不能提供更好的用户体验,而不仅仅是删除无效的字符。

#4


3  

You need to know the encoding of all strings you're dealing with, or you're entering a world of pain.

你需要知道你所处理的所有字符串的编码,或者你正在进入一个痛苦的世界。

UTF-8 is an easy encoding to use. Also, JSON is defined to use UTF-8 (http://www.json.org/JSONRequest.html). So why not use it?

UTF-8是一种易于使用的编码。而且,JSON被定义为使用UTF-8 (http://www.json.org/JSONRequest.html)。那么为什么不使用它呢?

Short answer: the way to avoid json_encode() dropping your strings is to make sure they are valid UTF-8.

简短回答:避免json_encode()删除字符串的方法是确保它们是有效的UTF-8。

#5


1  

Instead of using the iconv function, you can direclty use the json_encode with the JSON_UNESCAPED_UNICODE option ( >= PHP5.4.0 )

除了使用iconv函数,还可以使用json_encode和JSON_UNESCAPED_UNICODE选项(>= PHP5.4.0)

Make sure you put "charset=utf-8" in the header of your php file:

请确保在php文件的标题中输入“charset=utf-8”:

header('Content-Type: application/json; charset=utf-8');

标题(“application / json内容类型:;charset = utf - 8 ');