I am trying to use a preg_replace to eliminate the Japanese full-width white space "
" from a string input but I end up with a corrupted multi-byte string.
我试图使用preg_replace从字符串输入中消除日语全宽白色空格“”但我最终得到一个损坏的多字节字符串。
I would prefer to preg_replace instead of str_replace. Here is a sample code:
我更喜欢preg_replace而不是str_replace。这是一个示例代码:
$keywords = ' ラメ単色'; $keywords = str_replace(array(' ', ' '), ' ', urldecode($keywords)); // outputs :'ラメ単色' $keywords = preg_replace("@[ ]@", ' ',urldecode($keywords)); // outputs :'�� ��単色'
Anyone has any idea as to why this is so and how to remedy this situation?
任何人都知道为什么会这样,以及如何纠正这种情况?
4 个解决方案
#1
7
Add the u
flag to your regex. This makes the RegEx engine treat the input string as UTF-8.
将u标志添加到正则表达式中。这使得RegEx引擎将输入字符串视为UTF-8。
$keywords = preg_replace("@[ ]@u", ' ',urldecode($keywords));
// outputs :'ラメ単色'
键盘。
The reason it mangles the string is because to the RegEx engine, your replacement characters, 20
(space) or e3 80 80
(IDEOGRAPHIC SPACE) are not treated as two characters, but separate bytes 20
, e3
and 80
.
它破坏字符串的原因是因为对于RegEx引擎,您的替换字符20(空格)或e3 80 80(IDEOGRAPHIC SPACE)不被视为两个字符,而是单独的字节20,e3和80。
When you look at the byte sequence of your string to scan, we get e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2
. We know the first character is a IDEOGRAPHIC SPACE, but because PHP is treating it as a sequence of bytes, it does a replacement individually of the first four bytes, because they match individual bytes that the regex engine is scanning.
当你查看要扫描的字符串的字节序列时,我们得到e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2。我们知道第一个字符是IDEOGRAPHIC SPACE,但由于PHP将其视为字节序列,因此它会单独替换前四个字节,因为它们匹配正则表达式引擎正在扫描的各个字节。
As for the mangling which results in the � (REPLACEMENT CHARACTER), we can see this happens because the byte e3
is present further along in the string. The e3
byte is the start byte of a three byte long Japanese character, such as e3 83 a9
(KATAKANA LETTER RA). When that leading e3
is replaced with a 20
(space), it no longer becomes a valid UTF-8 sequence.
对于导致 (REPLACEMENT CHARACTER)的重整,我们可以看到这种情况发生,因为字节e3在字符串中进一步出现。 e3字节是三字节长日文字符的起始字节,例如e3 83 a9(KATAKANA LETTER RA)。当前导e3被替换为20(空格)时,它不再成为有效的UTF-8序列。
When you enable the u
flag, the RegEx engine treats the string as UTF-8, and won't treat your characters in your character class on a per-byte basis.
当您启用u标志时,RegEx引擎会将字符串视为UTF-8,并且不会基于每个字节处理字符类中的字符。
#2
2
To avoid additional problems, also consider setting the internal encoding explicitly to your mb_* functions solution:
要避免其他问题,还可以考虑将内部编码显式设置为mb_ *函数解决方案:
mb_internal_encoding("UTF-8");
#3
1
Always good to dig into the documentation. I found out that preg_* related function are not optimized for mulitbyte charaacter. Instead mb_ereg_* and mb_* functions are supposed to be used. I solved this little issue by refactoring the code to something like:
总是很好地深入了解文档。我发现preg_ *相关函数没有针对mulitbyte charaacter进行优化。而是应该使用mb_ereg_ *和mb_ *函数。我通过重构代码来解决这个小问题:
$keywords = ' ラメ単色'; $pattern = " "/*ascii whitespace*/ . " "/*multi-byte whitespace*/; $keywords = trim( mb_ereg_replace("[{$pattern}]+", ' ',urldecode($keywords))); // outputs:'ラメ単色'
Thanks all the same!
谢谢你们!
#4
-1
Use this
用这个
$keywords = preg_replace('/\s+/', ' ',urldecode($keywords));
#1
7
Add the u
flag to your regex. This makes the RegEx engine treat the input string as UTF-8.
将u标志添加到正则表达式中。这使得RegEx引擎将输入字符串视为UTF-8。
$keywords = preg_replace("@[ ]@u", ' ',urldecode($keywords));
// outputs :'ラメ単色'
键盘。
The reason it mangles the string is because to the RegEx engine, your replacement characters, 20
(space) or e3 80 80
(IDEOGRAPHIC SPACE) are not treated as two characters, but separate bytes 20
, e3
and 80
.
它破坏字符串的原因是因为对于RegEx引擎,您的替换字符20(空格)或e3 80 80(IDEOGRAPHIC SPACE)不被视为两个字符,而是单独的字节20,e3和80。
When you look at the byte sequence of your string to scan, we get e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2
. We know the first character is a IDEOGRAPHIC SPACE, but because PHP is treating it as a sequence of bytes, it does a replacement individually of the first four bytes, because they match individual bytes that the regex engine is scanning.
当你查看要扫描的字符串的字节序列时,我们得到e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2。我们知道第一个字符是IDEOGRAPHIC SPACE,但由于PHP将其视为字节序列,因此它会单独替换前四个字节,因为它们匹配正则表达式引擎正在扫描的各个字节。
As for the mangling which results in the � (REPLACEMENT CHARACTER), we can see this happens because the byte e3
is present further along in the string. The e3
byte is the start byte of a three byte long Japanese character, such as e3 83 a9
(KATAKANA LETTER RA). When that leading e3
is replaced with a 20
(space), it no longer becomes a valid UTF-8 sequence.
对于导致 (REPLACEMENT CHARACTER)的重整,我们可以看到这种情况发生,因为字节e3在字符串中进一步出现。 e3字节是三字节长日文字符的起始字节,例如e3 83 a9(KATAKANA LETTER RA)。当前导e3被替换为20(空格)时,它不再成为有效的UTF-8序列。
When you enable the u
flag, the RegEx engine treats the string as UTF-8, and won't treat your characters in your character class on a per-byte basis.
当您启用u标志时,RegEx引擎会将字符串视为UTF-8,并且不会基于每个字节处理字符类中的字符。
#2
2
To avoid additional problems, also consider setting the internal encoding explicitly to your mb_* functions solution:
要避免其他问题,还可以考虑将内部编码显式设置为mb_ *函数解决方案:
mb_internal_encoding("UTF-8");
#3
1
Always good to dig into the documentation. I found out that preg_* related function are not optimized for mulitbyte charaacter. Instead mb_ereg_* and mb_* functions are supposed to be used. I solved this little issue by refactoring the code to something like:
总是很好地深入了解文档。我发现preg_ *相关函数没有针对mulitbyte charaacter进行优化。而是应该使用mb_ereg_ *和mb_ *函数。我通过重构代码来解决这个小问题:
$keywords = ' ラメ単色'; $pattern = " "/*ascii whitespace*/ . " "/*multi-byte whitespace*/; $keywords = trim( mb_ereg_replace("[{$pattern}]+", ' ',urldecode($keywords))); // outputs:'ラメ単色'
Thanks all the same!
谢谢你们!
#4
-1
Use this
用这个
$keywords = preg_replace('/\s+/', ' ',urldecode($keywords));