In PHP < 6, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?
在PHP <6中,将字符串拆分为Unicode字符数组的最佳方法是什么?如果输入不一定是UTF-8?
I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters.
我想知道输入字符串中的Unicode字符集是否是另一组Unicode字符的子集。
Why not run straight for the mb_
family of functions, as the first couple of answers didn't?
为什么不直接运行mb_系列函数,因为前几个答案没有?
6 个解决方案
#1
15
You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :
您可以将'u'修饰符与PCRE正则表达式一起使用;见模式修饰符(引用):
u (PCRE8)
你(PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
此修饰符打开与Perl不兼容的PCRE的其他功能。模式字符串被视为UTF-8。这个修饰符可以在Unix上从PHP 4.1.0或更高版本获得,在win32上从PHP 4.2.3获得。自PHP 4.3.5起,检查模式的UTF-8有效性。
For instance, considering this code :
例如,考虑以下代码:
header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";
$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);
You'll get an unusable result:
你会得到一个无法使用的结果:
array
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string 'c' (length=1)
3 => string ' ' (length=1)
4 => string '�' (length=1)
5 => string '�' (length=1)
6 => string '�' (length=1)
7 => string '�' (length=1)
8 => string '�' (length=1)
9 => string '�' (length=1)
10 => string '�' (length=1)
11 => string '�' (length=1)
12 => string '�' (length=1)
13 => string '�' (length=1)
14 => string '�' (length=1)
15 => string '�' (length=1)
16 => string ',' (length=1)
17 => string ' ' (length=1)
18 => string 'e' (length=1)
19 => string 'f' (length=1)
20 => string 'g' (length=1)
But, with this code :
但是,使用此代码:
header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";
$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);
(Notice the 'u' at the end of the regex)
(请注意正则表达式末尾的'u')
You get what you want :
你得到你想要的:
array
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string 'c' (length=1)
3 => string ' ' (length=1)
4 => string '文' (length=3)
5 => string '字' (length=3)
6 => string '化' (length=3)
7 => string 'け' (length=3)
8 => string ',' (length=1)
9 => string ' ' (length=1)
10 => string 'e' (length=1)
11 => string 'f' (length=1)
12 => string 'g' (length=1)
Hope this helps :-)
希望这可以帮助 :-)
#2
6
Slightly simpler than preg_match_all
:
比preg_match_all稍微简单:
preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY)
This gives you back a 1-dimensional array of characters. No need for a matches object.
这样可以返回一维的字符数组。不需要匹配对象。
#3
5
Try this:
尝试这个:
preg_match_all('/./u', $text, $array);
#4
1
If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8
which is abandoned but might be helping you if you decide to do it on your own.
如果由于某种原因,正则表达方式对你来说还不够。我曾经写过Zend_Locale_UTF8,但是如果您决定自己动手,可能会帮助您。
In particular have a look at the class Zend_Locale_UTF8_PHP5_String
which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).
特别是看看Zend_Locale_UTF8_PHP5_String类,它读取Unicode字符串并使用它们将它们分成单个字符(显然可能由多个字节组成)。
EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:
编辑:我只是重新说ZF的svn浏览器已关闭,所以为了方便我复制了重要的方法:
/**
* Returns the UTF-8 code sequence as an array for any given $string.
*
* @access protected
* @param string|integer $string
* @return array
*/
protected function _decode( $string ) {
$string = (string) $string;
$length = strlen($string);
$sequence = array();
for ( $i=0; $i<$length; ) {
$bytes = $this->_characterBytes($string, $i);
$ord = $this->_ord($string, $bytes, $i);
if ( $ord !== false )
$sequence[] = $ord;
if ( $bytes === false )
$i++;
else
$i += $bytes;
}
return $sequence;
}
/**
* Returns the UTF-8 code of a character.
*
* @see http://en.wikipedia.org/wiki/UTF-8#Description
* @access protected
* @param string $string
* @param integer $bytes
* @param integer $position
* @return integer
*/
protected function _ord( &$string, $bytes = null, $pos=0 )
{
if ( is_null($bytes) )
$bytes = $this->_characterBytes($string);
if ( strlen($string) >= $bytes ) {
switch ( $bytes ) {
case 1:
return ord($string[$pos]);
break;
case 2:
return ( (ord($string[$pos]) & 0x1f) << 6 ) +
( (ord($string[$pos+1]) & 0x3f) );
break;
case 3:
return ( (ord($string[$pos]) & 0xf) << 12 ) +
( (ord($string[$pos+1]) & 0x3f) << 6 ) +
( (ord($string[$pos+2]) & 0x3f) );
break;
case 4:
return ( (ord($string[$pos]) & 0x7) << 18 ) +
( (ord($string[$pos+1]) & 0x3f) << 12 ) +
( (ord($string[$pos+1]) & 0x3f) << 6 ) +
( (ord($string[$pos+2]) & 0x3f) );
break;
case 0:
default:
return false;
}
}
return false;
}
/**
* Returns the number of bytes of the $position-th character.
*
* @see http://en.wikipedia.org/wiki/UTF-8#Description
* @access protected
* @param string $string
* @param integer $position
*/
protected function _characterBytes( &$string, $position = 0 ) {
$char = $string[$position];
$charVal = ord($char);
if ( ($charVal & 0x80) === 0 )
return 1;
elseif ( ($charVal & 0xe0) === 0xc0 )
return 2;
elseif ( ($charVal & 0xf0) === 0xe0 )
return 3;
elseif ( ($charVal & 0xf8) === 0xf0)
return 4;
/*
elseif ( ($charVal & 0xfe) === 0xf8 )
return 5;
*/
return false;
}
#5
0
I was able to write a solution using mb_*
, including a trip to UTF-16 and back in a probably silly attempt to speed up string indexing:
我能够使用mb_ *编写一个解决方案,包括一次UTF-16之旅,并回到可能是愚蠢的尝试来加速字符串索引:
$japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8");
$length = mb_strlen($japanese2, "UTF-16");
for($i=0; $i<$length; $i++) {
$char = mb_substr($japanese2, $i, 1, "UTF-16");
$utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16");
print $utf8 . "\n";
}
I had better luck avoiding mb_internal_encoding
and just specifying everything at each mb_*
call. I'm sure I'll wind up using the preg
solution.
我有更好的运气避免mb_internal_encoding,只是在每次mb_ *调用时指定所有内容。我相信我最终会使用preg解决方案。
#6
0
the best way for split with length: I just changed laravel str_limit()
function:
分裂长度的最佳方法:我刚改变了laravel str_limit()函数:
public static function split_text($text, $limit = 100, $end = '')
{
$width=mb_strwidth($text, 'UTF-8');
if ($width <= $limit) {
return $text;
}
$res=[];
for($i=0;$i<=$width;$i=$i+$limit){
$res[]=rtrim(mb_strimwidth($text, $i, $limit, '', 'UTF-8')).$end;
}
return $res;
}
#1
15
You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :
您可以将'u'修饰符与PCRE正则表达式一起使用;见模式修饰符(引用):
u (PCRE8)
你(PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
此修饰符打开与Perl不兼容的PCRE的其他功能。模式字符串被视为UTF-8。这个修饰符可以在Unix上从PHP 4.1.0或更高版本获得,在win32上从PHP 4.2.3获得。自PHP 4.3.5起,检查模式的UTF-8有效性。
For instance, considering this code :
例如,考虑以下代码:
header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";
$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);
You'll get an unusable result:
你会得到一个无法使用的结果:
array
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string 'c' (length=1)
3 => string ' ' (length=1)
4 => string '�' (length=1)
5 => string '�' (length=1)
6 => string '�' (length=1)
7 => string '�' (length=1)
8 => string '�' (length=1)
9 => string '�' (length=1)
10 => string '�' (length=1)
11 => string '�' (length=1)
12 => string '�' (length=1)
13 => string '�' (length=1)
14 => string '�' (length=1)
15 => string '�' (length=1)
16 => string ',' (length=1)
17 => string ' ' (length=1)
18 => string 'e' (length=1)
19 => string 'f' (length=1)
20 => string 'g' (length=1)
But, with this code :
但是,使用此代码:
header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";
$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);
(Notice the 'u' at the end of the regex)
(请注意正则表达式末尾的'u')
You get what you want :
你得到你想要的:
array
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string 'c' (length=1)
3 => string ' ' (length=1)
4 => string '文' (length=3)
5 => string '字' (length=3)
6 => string '化' (length=3)
7 => string 'け' (length=3)
8 => string ',' (length=1)
9 => string ' ' (length=1)
10 => string 'e' (length=1)
11 => string 'f' (length=1)
12 => string 'g' (length=1)
Hope this helps :-)
希望这可以帮助 :-)
#2
6
Slightly simpler than preg_match_all
:
比preg_match_all稍微简单:
preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY)
This gives you back a 1-dimensional array of characters. No need for a matches object.
这样可以返回一维的字符数组。不需要匹配对象。
#3
5
Try this:
尝试这个:
preg_match_all('/./u', $text, $array);
#4
1
If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8
which is abandoned but might be helping you if you decide to do it on your own.
如果由于某种原因,正则表达方式对你来说还不够。我曾经写过Zend_Locale_UTF8,但是如果您决定自己动手,可能会帮助您。
In particular have a look at the class Zend_Locale_UTF8_PHP5_String
which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).
特别是看看Zend_Locale_UTF8_PHP5_String类,它读取Unicode字符串并使用它们将它们分成单个字符(显然可能由多个字节组成)。
EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:
编辑:我只是重新说ZF的svn浏览器已关闭,所以为了方便我复制了重要的方法:
/**
* Returns the UTF-8 code sequence as an array for any given $string.
*
* @access protected
* @param string|integer $string
* @return array
*/
protected function _decode( $string ) {
$string = (string) $string;
$length = strlen($string);
$sequence = array();
for ( $i=0; $i<$length; ) {
$bytes = $this->_characterBytes($string, $i);
$ord = $this->_ord($string, $bytes, $i);
if ( $ord !== false )
$sequence[] = $ord;
if ( $bytes === false )
$i++;
else
$i += $bytes;
}
return $sequence;
}
/**
* Returns the UTF-8 code of a character.
*
* @see http://en.wikipedia.org/wiki/UTF-8#Description
* @access protected
* @param string $string
* @param integer $bytes
* @param integer $position
* @return integer
*/
protected function _ord( &$string, $bytes = null, $pos=0 )
{
if ( is_null($bytes) )
$bytes = $this->_characterBytes($string);
if ( strlen($string) >= $bytes ) {
switch ( $bytes ) {
case 1:
return ord($string[$pos]);
break;
case 2:
return ( (ord($string[$pos]) & 0x1f) << 6 ) +
( (ord($string[$pos+1]) & 0x3f) );
break;
case 3:
return ( (ord($string[$pos]) & 0xf) << 12 ) +
( (ord($string[$pos+1]) & 0x3f) << 6 ) +
( (ord($string[$pos+2]) & 0x3f) );
break;
case 4:
return ( (ord($string[$pos]) & 0x7) << 18 ) +
( (ord($string[$pos+1]) & 0x3f) << 12 ) +
( (ord($string[$pos+1]) & 0x3f) << 6 ) +
( (ord($string[$pos+2]) & 0x3f) );
break;
case 0:
default:
return false;
}
}
return false;
}
/**
* Returns the number of bytes of the $position-th character.
*
* @see http://en.wikipedia.org/wiki/UTF-8#Description
* @access protected
* @param string $string
* @param integer $position
*/
protected function _characterBytes( &$string, $position = 0 ) {
$char = $string[$position];
$charVal = ord($char);
if ( ($charVal & 0x80) === 0 )
return 1;
elseif ( ($charVal & 0xe0) === 0xc0 )
return 2;
elseif ( ($charVal & 0xf0) === 0xe0 )
return 3;
elseif ( ($charVal & 0xf8) === 0xf0)
return 4;
/*
elseif ( ($charVal & 0xfe) === 0xf8 )
return 5;
*/
return false;
}
#5
0
I was able to write a solution using mb_*
, including a trip to UTF-16 and back in a probably silly attempt to speed up string indexing:
我能够使用mb_ *编写一个解决方案,包括一次UTF-16之旅,并回到可能是愚蠢的尝试来加速字符串索引:
$japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8");
$length = mb_strlen($japanese2, "UTF-16");
for($i=0; $i<$length; $i++) {
$char = mb_substr($japanese2, $i, 1, "UTF-16");
$utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16");
print $utf8 . "\n";
}
I had better luck avoiding mb_internal_encoding
and just specifying everything at each mb_*
call. I'm sure I'll wind up using the preg
solution.
我有更好的运气避免mb_internal_encoding,只是在每次mb_ *调用时指定所有内容。我相信我最终会使用preg解决方案。
#6
0
the best way for split with length: I just changed laravel str_limit()
function:
分裂长度的最佳方法:我刚改变了laravel str_limit()函数:
public static function split_text($text, $limit = 100, $end = '')
{
$width=mb_strwidth($text, 'UTF-8');
if ($width <= $limit) {
return $text;
}
$res=[];
for($i=0;$i<=$width;$i=$i+$limit){
$res[]=rtrim(mb_strimwidth($text, $i, $limit, '', 'UTF-8')).$end;
}
return $res;
}