如何处理无效的UTF-8字符的用户输入?

I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users.

我正在寻找一个关于如何处理来自用户的无效UTF-8输入的策略/建议。

Even though my webapp uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.

尽管我的webapp使用的是UTF-8，但有些用户还是会输入无效字符。这在PHP的json_encode()中造成了错误，而且总体来说似乎是个坏主意。

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".

W3C I18N FAQ:多语言形式表示“如果接收到非utf -8数据，则应该返回错误消息。”

How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
在一个有几十个数据可以输入的站点的站点上，这到底应该怎么做呢?
How do you present the error in a helpful way to the user?
如何将错误呈现给用户?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
如何暂时存储和显示糟糕的表单数据，这样用户就不会丢失所有的文本?带坏字符?使用替换字符，如何使用?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
对于数据库中的现有数据，当检测到无效的UTF-8数据时，是否应该尝试转换它并将其保存(如何?utf8_encode()?mb_convert_encoding()?)，或在数据库中保持原样，但是在json_encode()之前做一些事情(什么?)?

EDIT: I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP". I'd like advice from people with experience in real-world situations how they've handled this.

编辑:我对mbstring扩展非常熟悉，并没有问“UTF-8在PHP中是如何工作的”。我想请教那些在现实生活中有经验的人如何处理这个问题。

EDIT2: As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD

EDIT2:作为解决方案的一部分，我非常希望看到一个快速的方法将无效字符转换为U+FFFD。

8 个解决方案

#1

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example...

accept-charset=“UTF-8”属性只是浏览器遵循的一个准则，他们不会*提交这样的结果，糟糕的表单提交机器人就是一个很好的例子……

What I usually do is ignore bad chars, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions, if you use iconv you also have the option to transliterate bad chars.

我通常所做的是忽略坏的字符，要么通过iconv()，要么使用不可靠的utf8_encode() / utf8_decode()函数，如果使用iconv，也可以选择转换坏字符。

Here is an example using iconv():

这里有一个使用iconv()的例子:

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine:

如果你想向你的用户显示一个错误信息，我可能会用全局的方式来做这个，而不是按每个值接收，这样做可能会很好:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this:

您可能还希望将新行和带(非)可见控制字符规范化，如下所示:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

Code to convert from UTF-8 to Unicode codepoints:

将UTF-8转换为Unicode编码点的代码:

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

Probably faster than any other alternative, haven't tested it extensively though.

也许比其他的方法都要快，但是还没有进行过广泛的测试。

Example:

例子:

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

Is this what you were looking for?

这就是你要找的吗?

#2

Receiving invalid characters from your web app might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

从web应用程序中接收无效字符可能与HTML表单中假定的字符集有关。您可以指定使用accept-charset属性为表单使用哪个字符集:

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions in * for pointers on how to handle invalid characters, e.g. those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

你也可能想看看类似的问题在*有关如何处理无效字符的指针,如那些列向右,但我认为信号错误的用户比试图清理这些无效字符造成意想不到的损失重要的数据或意想不到的改变用户的输入。

#3

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:

我组装了一个相当简单的类，以检查输入是否为UTF-8，并根据需要运行utf8_encode():

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);

#4

There is a multibyte extension for PHP, check it out: http://www.php.net/manual/en/book.mbstring.php

PHP有一个多字节扩展，请查看:http://www.php.net/manual/en/book.mbstring.php。

You should try mb_check_encoding() function.

您应该尝试mb_check_encoding()函数。

Good luck!

好运！

#5

For completeness to this question (not necessarily the best answer)...

对于这个问题的完整性(不一定是最好的答案)……

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

#6

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down. Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters. The data you store in your database then is data triggered by the user, but not actually user-supplied data.

我建议不要让垃圾进入。不要依赖自定义函数，这会使系统崩溃。简单地按照你设计的字母来完成提交的数据。创建一个可接受的字母字符串，并按字节按字节顺序遍历提交的数据，就像它是一个数组一样。将可接受的字符推到新的字符串，并省略不可接受的字符。在数据库中存储的数据是由用户触发的数据，而不是用户提供的数据。

EDIT #4: Replacing bad character with entiy: �

编辑# 4:用entiy代替坏字符:�

EDIT #3: Updated : Sept 22 2010 @ 1:32pm Reason: Now string returned is UTF-8, plus I used the test file you provided as proof.

编辑#3:更新:2010年9月22日@ 1:32pm原因:现在返回的字符串是UTF-8，加上我使用了您提供的测试文件作为证明。

<?php
// build alphabet
// optionally you can remove characters from this array

$alpha[]= chr(0); // null
$alpha[]= chr(9); // tab
$alpha[]= chr(10); // new line
$alpha[]= chr(11); // tab
$alpha[]= chr(13); // carriage return

for ($i = 32; $i <= 126; $i++) {
$alpha[]= chr($i);
}

/* remove comment to check ascii ordinals */

// /*
// foreach ($alpha as $key=>$val){
//  print ord($val);
//  print '<br/>';
// }
// print '<hr/>';
//*/
// 
// //test case #1
// 
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv   '.chr(160).chr(127).chr(126);
// 
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
// 
// //test case #2
// 
// $str = ''.'©?™???';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
// 
// $str = '©';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';

$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10),file($file));

$string = teststr($alpha,$testfile);
print $string;
print '<hr/>';


function teststr(&$alpha, &$str){
    $strlen = strlen($str);
    $newstr = chr(0); //null
    $x = 0;
    if($strlen >= 2){

        for ($i = 0; $i < $strlen; $i++) {
            $x++;
            if(in_array($str[$i],$alpha)){
                // passed
                $newstr .= $str[$i];
            }else{
                // failed
                print 'Found out of scope character. (ASCII: '.ord($str[$i]).')';
                print '<br/>';
                $newstr .= '&#65533;';
            }
        }
    }elseif($strlen <= 0){
        // failed to qualify for test
        print 'Non-existent.';

    }elseif($strlen === 1){
        $x++;
        if(in_array($str,$alpha)){
            // passed

            $newstr = $str;
        }else{
            // failed
            print 'Total character failed to qualify.';
            $newstr = '&#65533;';
        }
    }else{
        print 'Non-existent (scope).';
        }

if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8"){
// skip
}else{
    $newstr = utf8_encode($newstr);
}


// test encoding:
if(mb_detect_encoding($newstr, "UTF-8")=="UTF-8"){
    print 'UTF-8 :D<br/>';
    }else{
        print 'ENCODED: '.mb_detect_encoding($newstr, "UTF-8").'<br/>';
        }




return $newstr.' (scope: '.$x.', '.$strlen.')';
}

#7

How about stripping all chars outside your given subset. At least in some parts of my application I would not allow using chars outside the [a-Z] [0-9 sets], for example usernames. You can build a filter function that strips silently all chars outside this range, or that returns an error if it detects them and pushes the decision to the user.

去掉给定子集外面的所有字符。至少在我的应用程序的某些部分，我不允许在[a-Z][0-9集]之外使用chars，例如用户名。您可以构建一个过滤器函数，该函数可以在这个范围之外静默地执行所有的chars，或者如果检测到它们并将决策推给用户，则返回一个错误。

#8

Try doing what Rails does to force all browsers always to post UTF-8 data:

尝试做Rails所做的事情，迫使所有的浏览器总是发布UTF-8数据:

<form accept-charset="UTF-8" action="#{action}" method="post"><div
    style="margin:0;padding:0;display:inline">
    <input name="utf8" type="hidden" value="&#x2713;" />
  </div>
  <!-- form fields -->
</form>

See railssnowman.info or the initial patch for an explanation.

请参阅railssnowman.info或最初的补丁以获得解释。

To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
要让浏览器在UTF-8编码中发送表单提交数据，只需使用内容类型的“文本/html”标题呈现页面;charset=utf-8"(或使用meta http-equiv标签)。
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
要让浏览器在UTF-8编码中发送表单提交数据，即使用户使用页面编码(浏览器允许用户这样做)，也可以在表单中使用accept-charset=“UTF-8”。
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is IE and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ which can only be from the Unicode charset (and, in this example, not the Korean charset).
浏览器发送的表单提交的数据在utf - 8编码,即使用户小提琴与页面编码(浏览器允许用户这么做),即使浏览器IE和输入的用户页面编码转向朝鲜和韩国人物表单字段,添加一个隐藏的输入表单值如& # x2713;它只能来自Unicode字符集(在本例中，不是韩国字符集)。

#1