Suppose I have a string such as "Let's split this string into many small ones"
and I want to split it on this
, into
and ones
假设我有一个字符串,例如“让我们将这个字符串分成许多小字符串”,我想把它分成这个,进入和
such that the output looks something like this:
这样输出看起来像这样:
["Let's split", "this string", "into many small", "ones"]
What is the most efficient way to do it?
最有效的方法是什么?
3 个解决方案
#1
11
With a lookahead.
带着前瞻。
>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
#2
3
By using re.split()
:
通过使用re.split():
>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']
By putting the words to split on in a capturing group, the output includes the words we split on.
通过在捕获组中放置要拆分的单词,输出包括我们拆分的单词。
If you need the spaces removed, use map(str.strip, result)
on the re.split()
output:
如果需要删除空格,请在re.split()输出中使用map(str.strip,result):
>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']
and you could use filter(None, result)
to remove any empty strings if need be:
如果需要,您可以使用filter(None,result)删除任何空字符串:
>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']
To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:
要拆分单词但将它们连接到以下组,您需要使用前瞻断言:
>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this
, into
and ones
.
现在我们真的在分裂空白,但只是在空白后面跟着一个完整的单词,一个在这个集合中,一个和一个。
#3
0
Here's a fairly lazy way to do it:
这是一个相当懒惰的方法:
import re
def resplit(regex,s):
current = None
for x in regex.finditer(s):
start = x.start()
yield s[current:start]
current = start
yield s[start:]
s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )
I don't know for sure if this is the most efficient, but it's pretty clean.
我不确定这是否最有效,但它非常干净。
Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s
) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.
基本上,我们只是一次迭代一次。这些片段由正则表达式开始匹配的字符串中的索引确定。我们只是切断字符串直到那一点,我们将该索引保存为下一个切片的起点。
As for performance, ignacio clearly wins this round:
至于表现,ignacio显然赢得了这一轮:
9.1412050724 -- Me
3.09771895409 -- ignacio
Code:
码:
import re
def resplit(regex,s):
current = None
for x in regex.finditer(s):
start = x.start()
yield s[current:start]
current = start
yield s[start:]
def me(regex,s):
return list(resplit(regex,s))
def ignacio(regex,s):
return regex.split("Let's split this string into many small ones")
s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')
import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")
#1
11
With a lookahead.
带着前瞻。
>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
#2
3
By using re.split()
:
通过使用re.split():
>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']
By putting the words to split on in a capturing group, the output includes the words we split on.
通过在捕获组中放置要拆分的单词,输出包括我们拆分的单词。
If you need the spaces removed, use map(str.strip, result)
on the re.split()
output:
如果需要删除空格,请在re.split()输出中使用map(str.strip,result):
>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']
and you could use filter(None, result)
to remove any empty strings if need be:
如果需要,您可以使用filter(None,result)删除任何空字符串:
>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']
To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:
要拆分单词但将它们连接到以下组,您需要使用前瞻断言:
>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this
, into
and ones
.
现在我们真的在分裂空白,但只是在空白后面跟着一个完整的单词,一个在这个集合中,一个和一个。
#3
0
Here's a fairly lazy way to do it:
这是一个相当懒惰的方法:
import re
def resplit(regex,s):
current = None
for x in regex.finditer(s):
start = x.start()
yield s[current:start]
current = start
yield s[start:]
s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )
I don't know for sure if this is the most efficient, but it's pretty clean.
我不确定这是否最有效,但它非常干净。
Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s
) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.
基本上,我们只是一次迭代一次。这些片段由正则表达式开始匹配的字符串中的索引确定。我们只是切断字符串直到那一点,我们将该索引保存为下一个切片的起点。
As for performance, ignacio clearly wins this round:
至于表现,ignacio显然赢得了这一轮:
9.1412050724 -- Me
3.09771895409 -- ignacio
Code:
码:
import re
def resplit(regex,s):
current = None
for x in regex.finditer(s):
start = x.start()
yield s[current:start]
current = start
yield s[start:]
def me(regex,s):
return list(resplit(regex,s))
def ignacio(regex,s):
return regex.split("Let's split this string into many small ones")
s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')
import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")