从字符串中删除非utf8字符。

时间:2021-10-20 00:10:26

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

我有一个问题,从字符串移除非utf8字符,这是不正确显示。字符类似于这个0x97 0x61 0x6C 0x6F(十六进制表示)

What is the best way to remove them? Regular expression or something else ?

去除它们的最好方法是什么?正则表达式还是别的什么?

15 个解决方案

#1


69  

Using a regex approach:

使用一个正则表达式的方法:

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

它搜索UTF-8序列,并将它们捕获到第1组。它也匹配不能被标识为UTF-8序列的一部分的单个字节,但是没有捕获这些字节。替换是在第一组中被捕获的。这有效地删除了所有无效字节。

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

通过将无效字节编码为UTF-8字符,可以修复字符串。但如果这些错误是随机的,就会留下一些奇怪的符号。

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]               # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [\x80-\xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "\xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "\xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

编辑:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • 空(x)将匹配非空值(“0”被认为是空的)。
  • x != "" will match non-empty values, including "0".
  • 将匹配非空值,包括“0”。
  • x !== "" will match anything except "".
  • x !== ""将匹配除""以外的任何东西"。

x != "" seem the best one to use in this case.

在这个例子中,“x !=”似乎是最好的一个。

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

我也加快了比赛的速度。它不是单独匹配每个字符,而是匹配有效的UTF-8字符序列。

#2


102  

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

如果您将utf8_encode()应用到一个已经UTF8字符串,那么它将返回一个混乱的UTF8输出。

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

我做了一个处理所有这些问题的函数。它´s称为编码:toUTF8()。

You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

您不需要知道字符串的编码是什么。它可以是Latin1 (ISO8859-1), window -1252或UTF8,或者字符串可以混合使用。编码::toUTF8()将所有内容转换为UTF8。

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

我这样做是因为一个服务给了我一个数据feed,把这些编码混合在同一个字符串中。

Usage:

用法:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

I've included another function, Encoding::fixUTF8(), wich will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

我已经包含了另一个函数,编码::fixUTF8(),它将修复所有看起来被编码成UTF8的UTF8字符串。

Usage:

用法:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

例子:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

下载:

https://github.com/neitanod/forceutf8

https://github.com/neitanod/forceutf8

#3


41  

You can use mbstring:

您可以使用mbstring:

$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');

...will remove invalid characters.

…将删除无效字符。

See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

参见:用问号,mbstring替换无效的UTF-8字符。substitute_character似乎忽略了

#4


14  

This function removes all NON ASCII characters, it's useful but not solving the question:
This is my function that always works, regardless of encoding:

这个函数删除了所有非ASCII字符,它是有用的,但没有解决问题:这是我的函数,不管编码是什么,它总是可以工作的。

function remove_bs($Str) {  
  $StrArr = str_split($Str); $NewStr = '';
  foreach ($StrArr as $Char) {    
    $CharNo = ord($Char);
    if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £ 
    if ($CharNo > 31 && $CharNo < 127) {
      $NewStr .= $Char;    
    }
  }  
  return $NewStr;
}

How it works:

它是如何工作的:

echo remove_bs('Hello õhowå åare youÆ?'); // Hello how are you?

#5


10  

$text = iconv("UTF-8", "UTF-8//IGNORE", $text);

This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

这就是我正在使用的。看起来效果不错。从http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

#6


6  

UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.

自PHP 5.5之后可以使用UConverter。如果您使用intl扩展而不使用mbstring,那么UConverter就更好了。

function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.

htmlspecialchars可用于删除PHP 5.4以来的无效字节序列。Htmlspecialchars比preg_match更适合处理大字节的字节和精度。通过使用正则表达式,可以看到很多错误的实现。

function replace_invalid_byte_sequence3($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

#7


6  

try this:

试试这个:

$string = iconv("UTF-8","UTF-8//IGNORE",$string);

According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.

根据iconv手册,函数将第一个参数作为输入字符集,第二个参数作为输出字符集,第三个参数作为实际输入字符串。

If you set both the input and output charset to UTF-8, and append the //IGNORE flag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.

如果将输入和输出字符集设置为UTF-8,并将//忽略标志添加到输出字符集,则该函数将删除输入字符串中不能由输出字符集表示的所有字符。因此,过滤输入字符串的效果。

#8


5  

The text may contain non-utf8 character. Try to do first:

文本可能包含非utf8字符。试着做:

$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');

You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.phpnews

您可以在这里阅读更多信息:http://php.net/manual/en/function.mb-convert-encoding.phpnews。

#9


3  

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

#10


3  

I have made a function that deletes invalid UTF-8 characters from a string. I'm using it to clear description of 27000 products before it generates the XML export file.

我已经创建了一个函数,该函数从字符串中删除无效的UTF-8字符。在生成XML导出文件之前,我使用它来清楚地描述27000个产品。

public function stripInvalidXml($value) {
    $ret = "";
    $current;
    if (empty($value)) {
        return $ret;
    }
    $length = strlen($value);
    for ($i=0; $i < $length; $i++) {
        $current = ord($value{$i});
        if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) {
                $ret .= chr($current);
        }
        else {
            $ret .= "";
        }
    }
    return $ret;
}

#11


2  

From recent patch to Drupal's Feeds JSON parser module:

从最近的补丁到Drupal的Feeds JSON解析器模块:

//remove everything except valid letters (from any language)
$raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);

If you're concerned yes it retains spaces as valid characters.

如果你担心,它会保留空格作为有效字符。

Did what I needed. It removes widespread nowadays emoji-characters that don't fit into MySQL's 'utf8' character set and that gave me errors like "SQLSTATE[HY000]: General error: 1366 Incorrect string value".

做了我需要的。它删除了目前流行的不符合MySQL“utf8”字符集的符号字符,它给了我像“SQLSTATE[HY000]:一般错误:1366不正确的字符串值”这样的错误。

For details see https://www.drupal.org/node/1824506#comment-6881382

详细信息请参阅https://www.drupal.org/node/1824506评论- 6881382

#12


1  

So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

因此,规则是第一个UTF-8 octlet具有高位集作为标记,然后1到4位表示有多少额外的octlet;然后每个额外的octlets必须将高的两个字节设置为10。

The pseudo-python would be:

pseudo-python将:

newstring = ''
cont = 0
for each ch in string:
  if cont:
    if (ch >> 6) != 2: # high 2 bits are 10
      # do whatever, e.g. skip it, or skip whole point, or?
    else:
      # acceptable continuation of multi-octlet char
      newstring += ch
    cont -= 1
  else:
    if (ch >> 7): # high bit set?
      c = (ch << 1) # strip the high bit marker
      while (c & 1): # while the high bit indicates another octlet
        c <<= 1
        cont += 1
        if cont > 4:
           # more than 4 octels not allowed; cope with error
      if !cont:
        # illegal, do something sensible
      newstring += ch # or whatever
if cont:
  # last utf-8 was not terminated, cope

This same logic should be translatable to php. However, its not clear what kind of stripping is to be done once you get a malformed character.

同样的逻辑应该可以翻译成php。然而,一旦你有了一个畸形的性格,你就不知道该做什么。

#13


1  

To remove all Unicode characters outside of the Unicode basic language plane:

要删除Unicode基本语言平面之外的所有Unicode字符:

$str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);

#14


0  

Slightly different to the question, but what I am doing is to use HtmlEncode(string),

与问题略有不同,但我所做的是使用HtmlEncode(string),

pseudo code here

伪代码

var encoded = HtmlEncode(string);
encoded = Regex.Replace(encoded, "&#\d+?;", "");
var result = HtmlDecode(encoded);

input and output

输入和输出

"Headlight\x007E Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢?"
"Headlight~ Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢?"

I know it's not perfect, but does the job for me.

我知道这不是十全十美的,但我能胜任这份工作。

#15


-1  

How about iconv:

iconv怎么样:

http://php.net/manual/en/function.iconv.php

http://php.net/manual/en/function.iconv.php

Haven't used it inside PHP itself but its always performed well for me on the command line. You can get it to substitute invalid characters.

还没有在PHP内部使用它,但是在命令行中它的性能总是很好。您可以让它替换无效的字符。

#1


69  

Using a regex approach:

使用一个正则表达式的方法:

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

它搜索UTF-8序列,并将它们捕获到第1组。它也匹配不能被标识为UTF-8序列的一部分的单个字节,但是没有捕获这些字节。替换是在第一组中被捕获的。这有效地删除了所有无效字节。

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

通过将无效字节编码为UTF-8字符,可以修复字符串。但如果这些错误是随机的,就会留下一些奇怪的符号。

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]               # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [\x80-\xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "\xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "\xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

编辑:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • 空(x)将匹配非空值(“0”被认为是空的)。
  • x != "" will match non-empty values, including "0".
  • 将匹配非空值,包括“0”。
  • x !== "" will match anything except "".
  • x !== ""将匹配除""以外的任何东西"。

x != "" seem the best one to use in this case.

在这个例子中,“x !=”似乎是最好的一个。

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

我也加快了比赛的速度。它不是单独匹配每个字符,而是匹配有效的UTF-8字符序列。

#2


102  

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

如果您将utf8_encode()应用到一个已经UTF8字符串,那么它将返回一个混乱的UTF8输出。

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

我做了一个处理所有这些问题的函数。它´s称为编码:toUTF8()。

You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

您不需要知道字符串的编码是什么。它可以是Latin1 (ISO8859-1), window -1252或UTF8,或者字符串可以混合使用。编码::toUTF8()将所有内容转换为UTF8。

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

我这样做是因为一个服务给了我一个数据feed,把这些编码混合在同一个字符串中。

Usage:

用法:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

I've included another function, Encoding::fixUTF8(), wich will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

我已经包含了另一个函数,编码::fixUTF8(),它将修复所有看起来被编码成UTF8的UTF8字符串。

Usage:

用法:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

例子:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

下载:

https://github.com/neitanod/forceutf8

https://github.com/neitanod/forceutf8

#3


41  

You can use mbstring:

您可以使用mbstring:

$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');

...will remove invalid characters.

…将删除无效字符。

See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

参见:用问号,mbstring替换无效的UTF-8字符。substitute_character似乎忽略了

#4


14  

This function removes all NON ASCII characters, it's useful but not solving the question:
This is my function that always works, regardless of encoding:

这个函数删除了所有非ASCII字符,它是有用的,但没有解决问题:这是我的函数,不管编码是什么,它总是可以工作的。

function remove_bs($Str) {  
  $StrArr = str_split($Str); $NewStr = '';
  foreach ($StrArr as $Char) {    
    $CharNo = ord($Char);
    if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £ 
    if ($CharNo > 31 && $CharNo < 127) {
      $NewStr .= $Char;    
    }
  }  
  return $NewStr;
}

How it works:

它是如何工作的:

echo remove_bs('Hello õhowå åare youÆ?'); // Hello how are you?

#5


10  

$text = iconv("UTF-8", "UTF-8//IGNORE", $text);

This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

这就是我正在使用的。看起来效果不错。从http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

#6


6  

UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.

自PHP 5.5之后可以使用UConverter。如果您使用intl扩展而不使用mbstring,那么UConverter就更好了。

function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.

htmlspecialchars可用于删除PHP 5.4以来的无效字节序列。Htmlspecialchars比preg_match更适合处理大字节的字节和精度。通过使用正则表达式,可以看到很多错误的实现。

function replace_invalid_byte_sequence3($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

#7


6  

try this:

试试这个:

$string = iconv("UTF-8","UTF-8//IGNORE",$string);

According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.

根据iconv手册,函数将第一个参数作为输入字符集,第二个参数作为输出字符集,第三个参数作为实际输入字符串。

If you set both the input and output charset to UTF-8, and append the //IGNORE flag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.

如果将输入和输出字符集设置为UTF-8,并将//忽略标志添加到输出字符集,则该函数将删除输入字符串中不能由输出字符集表示的所有字符。因此,过滤输入字符串的效果。

#8


5  

The text may contain non-utf8 character. Try to do first:

文本可能包含非utf8字符。试着做:

$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');

You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.phpnews

您可以在这里阅读更多信息:http://php.net/manual/en/function.mb-convert-encoding.phpnews。

#9


3  

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

#10


3  

I have made a function that deletes invalid UTF-8 characters from a string. I'm using it to clear description of 27000 products before it generates the XML export file.

我已经创建了一个函数,该函数从字符串中删除无效的UTF-8字符。在生成XML导出文件之前,我使用它来清楚地描述27000个产品。

public function stripInvalidXml($value) {
    $ret = "";
    $current;
    if (empty($value)) {
        return $ret;
    }
    $length = strlen($value);
    for ($i=0; $i < $length; $i++) {
        $current = ord($value{$i});
        if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) {
                $ret .= chr($current);
        }
        else {
            $ret .= "";
        }
    }
    return $ret;
}

#11


2  

From recent patch to Drupal's Feeds JSON parser module:

从最近的补丁到Drupal的Feeds JSON解析器模块:

//remove everything except valid letters (from any language)
$raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);

If you're concerned yes it retains spaces as valid characters.

如果你担心,它会保留空格作为有效字符。

Did what I needed. It removes widespread nowadays emoji-characters that don't fit into MySQL's 'utf8' character set and that gave me errors like "SQLSTATE[HY000]: General error: 1366 Incorrect string value".

做了我需要的。它删除了目前流行的不符合MySQL“utf8”字符集的符号字符,它给了我像“SQLSTATE[HY000]:一般错误:1366不正确的字符串值”这样的错误。

For details see https://www.drupal.org/node/1824506#comment-6881382

详细信息请参阅https://www.drupal.org/node/1824506评论- 6881382

#12


1  

So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

因此,规则是第一个UTF-8 octlet具有高位集作为标记,然后1到4位表示有多少额外的octlet;然后每个额外的octlets必须将高的两个字节设置为10。

The pseudo-python would be:

pseudo-python将:

newstring = ''
cont = 0
for each ch in string:
  if cont:
    if (ch >> 6) != 2: # high 2 bits are 10
      # do whatever, e.g. skip it, or skip whole point, or?
    else:
      # acceptable continuation of multi-octlet char
      newstring += ch
    cont -= 1
  else:
    if (ch >> 7): # high bit set?
      c = (ch << 1) # strip the high bit marker
      while (c & 1): # while the high bit indicates another octlet
        c <<= 1
        cont += 1
        if cont > 4:
           # more than 4 octels not allowed; cope with error
      if !cont:
        # illegal, do something sensible
      newstring += ch # or whatever
if cont:
  # last utf-8 was not terminated, cope

This same logic should be translatable to php. However, its not clear what kind of stripping is to be done once you get a malformed character.

同样的逻辑应该可以翻译成php。然而,一旦你有了一个畸形的性格,你就不知道该做什么。

#13


1  

To remove all Unicode characters outside of the Unicode basic language plane:

要删除Unicode基本语言平面之外的所有Unicode字符:

$str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);

#14


0  

Slightly different to the question, but what I am doing is to use HtmlEncode(string),

与问题略有不同,但我所做的是使用HtmlEncode(string),

pseudo code here

伪代码

var encoded = HtmlEncode(string);
encoded = Regex.Replace(encoded, "&#\d+?;", "");
var result = HtmlDecode(encoded);

input and output

输入和输出

"Headlight\x007E Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢?"
"Headlight~ Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢?"

I know it's not perfect, but does the job for me.

我知道这不是十全十美的,但我能胜任这份工作。

#15


-1  

How about iconv:

iconv怎么样:

http://php.net/manual/en/function.iconv.php

http://php.net/manual/en/function.iconv.php

Haven't used it inside PHP itself but its always performed well for me on the command line. You can get it to substitute invalid characters.

还没有在PHP内部使用它,但是在命令行中它的性能总是很好。您可以让它替换无效的字符。