有效地在多个字符串分隔符上拆分python字符串

时间:2022-05-19 21:44:08

Suppose I have a string such as "Let's split this string into many small ones" and I want to split it on this, into and ones

假设我有一个字符串,例如“让我们将这个字符串分成许多小字符串”,我想把它分成这个,进入和

such that the output looks something like this:

这样输出看起来像这样:

["Let's split", "this string", "into many small", "ones"]

What is the most efficient way to do it?

最有效的方法是什么?

3 个解决方案

#1


11  

With a lookahead.

带着前瞻。

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

#2


3  

By using re.split():

通过使用re.split():

>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']

By putting the words to split on in a capturing group, the output includes the words we split on.

通过在捕获组中放置要拆分的单词,输出包括我们拆分的单词。

If you need the spaces removed, use map(str.strip, result) on the re.split() output:

如果需要删除空格,请在re.split()输出中使用map(str.strip,result):

>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']

and you could use filter(None, result) to remove any empty strings if need be:

如果需要,您可以使用filter(None,result)删除任何空字符串:

>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']

To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:

要拆分单词但将它们连接到以下组,您需要使用前瞻断言:

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this, into and ones.

现在我们真的在分裂空白,但只是在空白后面跟着一个完整的单词,一个在这个集合中,一个和一个。

#3


0  

Here's a fairly lazy way to do it:

这是一个相当懒惰的方法:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )

I don't know for sure if this is the most efficient, but it's pretty clean.

我不确定这是否最有效,但它非常干净。

Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.

基本上,我们只是一次迭代一次。这些片段由正则表达式开始匹配的字符串中的索引确定。我们只是切断字符串直到那一点,我们将该索引保存为下一个切片的起点。


As for performance, ignacio clearly wins this round:

至于表现,ignacio显然赢得了这一轮:

9.1412050724  -- Me
3.09771895409  -- ignacio

Code:

码:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]


def me(regex,s):
    return list(resplit(regex,s))

def ignacio(regex,s):
    return regex.split("Let's split this string into many small ones")

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')

import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")

#1


11  

With a lookahead.

带着前瞻。

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

#2


3  

By using re.split():

通过使用re.split():

>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']

By putting the words to split on in a capturing group, the output includes the words we split on.

通过在捕获组中放置要拆分的单词,输出包括我们拆分的单词。

If you need the spaces removed, use map(str.strip, result) on the re.split() output:

如果需要删除空格,请在re.split()输出中使用map(str.strip,result):

>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']

and you could use filter(None, result) to remove any empty strings if need be:

如果需要,您可以使用filter(None,result)删除任何空字符串:

>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']

To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:

要拆分单词但将它们连接到以下组,您需要使用前瞻断言:

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this, into and ones.

现在我们真的在分裂空白,但只是在空白后面跟着一个完整的单词,一个在这个集合中,一个和一个。

#3


0  

Here's a fairly lazy way to do it:

这是一个相当懒惰的方法:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )

I don't know for sure if this is the most efficient, but it's pretty clean.

我不确定这是否最有效,但它非常干净。

Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.

基本上,我们只是一次迭代一次。这些片段由正则表达式开始匹配的字符串中的索引确定。我们只是切断字符串直到那一点,我们将该索引保存为下一个切片的起点。


As for performance, ignacio clearly wins this round:

至于表现,ignacio显然赢得了这一轮:

9.1412050724  -- Me
3.09771895409  -- ignacio

Code:

码:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]


def me(regex,s):
    return list(resplit(regex,s))

def ignacio(regex,s):
    return regex.split("Let's split this string into many small ones")

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')

import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")