I want to split a large string by a series of words.
我想用一系列单词拆分一个大字符串。
E.g.
$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
Then the results would be:
然后结果将是:
$text[0]='This is';
$text[1]='string which needs';
$text[2]='be';
$text[3]='above';
$text[4]='.';
How can I do this? Is preg_split
the best way, or is there a more efficient method? I'd like it to be as fast as possible, as I'll be splitting hundreds of MB of files.
我怎样才能做到这一点? preg_split是最好的方法,还是有更有效的方法?我希望它尽可能快,因为我将分割数百MB的文件。
4 个解决方案
#1
3
I don't think using pcre regex is necessary ... if it's really splitting words you need.
我不认为使用pcre正则表达式是必要的......如果它真的分裂了你需要的单词。
You could do something like this and benchmark see if it's faster / better ...
你可以做这样的事情和基准测试看看它是否更快/更好......
$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
$split = explode(' ', $text);
$result = array();
$temp = array();
foreach ($split as $s) {
if (in_array($s, $splitby)) {
if (sizeof($temp) > 0) {
$result[] = implode(' ', $temp);
$temp = array();
}
} else {
$temp[] = $s;
}
}
if (sizeof($temp) > 0) {
$result[] = implode(' ', $temp);
}
var_dump($result);
/* output
array(4) {
[0]=>
string(7) "This is"
[1]=>
string(18) "string which needs"
[2]=>
string(2) "be"
[3]=>
string(5) "above words."
}
The only difference with your output is the last word because "words." != "word" and it's not a split word.
与输出的唯一区别是最后一个词,因为“单词”。 !=“单词”,这不是分词。
#2
7
This should be reasonably efficient. However you may want to test with some files and report back on the performance.
这应该是合理有效的。但是,您可能希望使用某些文件进行测试并报告性能。
$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
$pattern = '/\s?'.implode($splitby, '\s?|\s?').'\s?/';
$result = preg_split($pattern, $text, -1, PREG_SPLIT_NO_EMPTY);
- Regular Expression Demo: http://rubular.com/r/jNUO1KvrXg
- PHP Code Demo: http://www.ideone.com/ov3Wl
正则表达式演示:http://rubular.com/r/jNUO1KvrXg
PHP代码演示:http://www.ideone.com/ov3Wl
#3
4
preg_split
can be used as:
preg_split可以用作:
$pieces = preg_split('/'.implode('\s*|\s*',$splitby).'/',$text,-1,PREG_SPLIT_NO_EMPTY);
#4
-1
Since the words in your $splitby array are not regular expression maybe you can use
由于$ splitby数组中的单词不是正则表达式,您可以使用
#1
3
I don't think using pcre regex is necessary ... if it's really splitting words you need.
我不认为使用pcre正则表达式是必要的......如果它真的分裂了你需要的单词。
You could do something like this and benchmark see if it's faster / better ...
你可以做这样的事情和基准测试看看它是否更快/更好......
$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
$split = explode(' ', $text);
$result = array();
$temp = array();
foreach ($split as $s) {
if (in_array($s, $splitby)) {
if (sizeof($temp) > 0) {
$result[] = implode(' ', $temp);
$temp = array();
}
} else {
$temp[] = $s;
}
}
if (sizeof($temp) > 0) {
$result[] = implode(' ', $temp);
}
var_dump($result);
/* output
array(4) {
[0]=>
string(7) "This is"
[1]=>
string(18) "string which needs"
[2]=>
string(2) "be"
[3]=>
string(5) "above words."
}
The only difference with your output is the last word because "words." != "word" and it's not a split word.
与输出的唯一区别是最后一个词,因为“单词”。 !=“单词”,这不是分词。
#2
7
This should be reasonably efficient. However you may want to test with some files and report back on the performance.
这应该是合理有效的。但是,您可能希望使用某些文件进行测试并报告性能。
$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
$pattern = '/\s?'.implode($splitby, '\s?|\s?').'\s?/';
$result = preg_split($pattern, $text, -1, PREG_SPLIT_NO_EMPTY);
- Regular Expression Demo: http://rubular.com/r/jNUO1KvrXg
- PHP Code Demo: http://www.ideone.com/ov3Wl
正则表达式演示:http://rubular.com/r/jNUO1KvrXg
PHP代码演示:http://www.ideone.com/ov3Wl
#3
4
preg_split
can be used as:
preg_split可以用作:
$pieces = preg_split('/'.implode('\s*|\s*',$splitby).'/',$text,-1,PREG_SPLIT_NO_EMPTY);
#4
-1
Since the words in your $splitby array are not regular expression maybe you can use
由于$ splitby数组中的单词不是正则表达式,您可以使用