如何正确地分割这个字符串列表?

时间:2022-10-26 20:56:58

I have a list of strings such as this :

我有一个这样的字符串列表:

['z+2-44', '4+55+z+88']

How can I split this strings in the list such that it would be something like

如何在列表中分割这些字符串

[['z','+','2','-','44'],['4','+','55','+','z','+','88']]

I have tried using the split method already however that splits the 44 into 4 and 4, and am not sure what else to try.

我已经尝试过使用split方法,但是将44分割为4和4,我不确定还需要尝试什么。

5 个解决方案

#1


26  

You can use regex:

您可以使用正则表达式:

import re
lst = ['z+2-44', '4+55+z+88']
[re.findall('\w+|\W+', s) for s in lst]
# [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

\w+|\W+ matches a pattern that consists either of word characters (alphanumeric values in your case) or non word characters (+- signs in your case).

\w+|\ w+匹配一个由单词字符(在您的例子中是字母数字值)或非单词字符(在您的例子中是+-符号)组成的模式。

#2


14  

That will work, using itertools.groupby

使用itertools.groupby可以工作

z = ['z+2-44', '4+55+z+88']

print([["".join(x) for k,x in itertools.groupby(i,str.isalnum)] for i in z])

output:

输出:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

It just groups the chars if they're alphanumerical (or not), just join them back in a list comprehension.

如果它们是字母数字(或不是),只需将它们分组,然后将它们加入到列表的理解中。

EDIT: the general case of a calculator with parenthesis has been asked as a follow-up question here. If z is as follows:

编辑:括号内的计算器的一般情况是这里的一个后续问题。如果z为:

z = ['z+2-44', '4+55+((z+88))']

then with the previous grouping we get:

然后根据前面的分组,我们得到:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+((', 'z', '+', '88', '))']]

Which is not easy to parse in terms of tokens. So a change would be to join only if alphanum, and let as list if not, flattening in the end using chain.from_iterable:

用令牌来解析并不容易。因此,一个改变是只有在字母出现时才加入,如果没有,让as列表,最后使用chain.from_iterable:

print([list(itertools.chain.from_iterable(["".join(x)] if k else x for k,x in itertools.groupby(i,str.isalnum))) for i in z])

which yields:

收益率:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', '(', '(', 'z', '+', '88', ')', ')']]

(note that the alternate regex answer can also be adapted like this: [re.findall('\w+|\W', s) for s in lst] (note the lack of + after W)

(注意,备用的regex答案也可以这样改写:[re]。在lst中为s查找所有('\w+|\ w ', s)](注意在w后面缺少+)

also "".join(list(x)) is slightly faster than "".join(x), but I'll let you add it up to avoid altering visibility of that already complex expression.

另外,“”.join(list(x))比“”.join(x)稍微快一些,但是我将让您把它加起来,以避免改变已经复杂的表达式的可见性。

#3


6  

Alternative solution using re.split function:

使用re.split函数替代解决方案:

l = ['z+2-44', '4+55+z+88']
print([list(filter(None, re.split(r'(\w+)', i))) for i in l])

The output:

输出:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

#4


5  

You could only use str.replace() and str.split() built-in functions within a list comprehension:

只能在列表理解中使用string .replace()和string .split()内置函数:

In [34]: lst = ['z+2-44', '4+55+z+88']

In [35]: [s.replace('+', ' + ').replace('-', ' - ').split() for s in lst]
Out[35]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

But note that this is not an efficient approach for longer strings. In that case the best way to go is using regex.

但请注意,对于较长的字符串,这不是一种有效的方法。在这种情况下,最好的方法就是使用正则表达式。

As another pythonic way you can also use tokenize module:

作为另一种python方式,你也可以使用tokenize模块:

In [56]: from io import StringIO

In [57]: import tokenize

In [59]: [[t.string for t in tokenize.generate_tokens(StringIO(i).readline)][:-1] for i in lst]
Out[59]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays.

tokenize模块为Python源代码提供了一个词汇扫描器,它是用Python实现的。该模块中的扫描器也将注释作为标记返回,这使得它对于实现“漂亮的打印机”非常有用,包括用于屏幕显示的着色器。

#5


-1  

If you want to stick with split (hence avoiding regex), you can provide it with an optional character to split on:

如果您想坚持使用split(因此避免regex),您可以为它提供一个可选字符来分割:

>>> testing = 'z+2-44'
>>> testing.split('+')
['z', '2-44']
>>> testing.split('-')
['z+2', '44']

So, you could whip something up by chaining the split commands.

因此,您可以通过链接split命令来启动一些东西。

However, using regular expressions would probably be more readable:

但是,使用正则表达式可能更容易读懂:

import re

>>> re.split('\+|\-', testing)
['z', '2', '44']

This is just saying to "split the string at any + or - character" (the backslashes are escape characters because both of those have special meaning in a regex.

这只是说“在任意+或-字符上分割字符串”(反斜杠是转义字符,因为它们在regex中都有特殊的含义。

Lastly, in this particular case, I imagine the goal is something along the lines of "split at every non-alpha numeric character", in which case regex can still save the day:

最后,在这个特定的例子中,我假设目标是“在每个非字母数字字符处分割”,在这种情况下,regex仍然可以节省时间:

>>> re.split('[^a-zA-Z0-9]', testing)
['z', '2', '44']

It is of course worth noting that there are a million other solutions, as discussed in some other SO discussions.

当然值得注意的是,还有100万种其他的解决方案,正如在SO讨论中讨论的那样。

Python: Split string with multiple delimiters

Python:具有多个分隔符的分割字符串

Split Strings with Multiple Delimiters?

带多个分隔符的分割字符串?

My answers here are targeted towards simple, readable code and not performance, in honor of Donald Knuth

我的答案是针对简单的、可读的代码而不是性能,以纪念Donald Knuth。

#1


26  

You can use regex:

您可以使用正则表达式:

import re
lst = ['z+2-44', '4+55+z+88']
[re.findall('\w+|\W+', s) for s in lst]
# [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

\w+|\W+ matches a pattern that consists either of word characters (alphanumeric values in your case) or non word characters (+- signs in your case).

\w+|\ w+匹配一个由单词字符(在您的例子中是字母数字值)或非单词字符(在您的例子中是+-符号)组成的模式。

#2


14  

That will work, using itertools.groupby

使用itertools.groupby可以工作

z = ['z+2-44', '4+55+z+88']

print([["".join(x) for k,x in itertools.groupby(i,str.isalnum)] for i in z])

output:

输出:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

It just groups the chars if they're alphanumerical (or not), just join them back in a list comprehension.

如果它们是字母数字(或不是),只需将它们分组,然后将它们加入到列表的理解中。

EDIT: the general case of a calculator with parenthesis has been asked as a follow-up question here. If z is as follows:

编辑:括号内的计算器的一般情况是这里的一个后续问题。如果z为:

z = ['z+2-44', '4+55+((z+88))']

then with the previous grouping we get:

然后根据前面的分组,我们得到:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+((', 'z', '+', '88', '))']]

Which is not easy to parse in terms of tokens. So a change would be to join only if alphanum, and let as list if not, flattening in the end using chain.from_iterable:

用令牌来解析并不容易。因此,一个改变是只有在字母出现时才加入,如果没有,让as列表,最后使用chain.from_iterable:

print([list(itertools.chain.from_iterable(["".join(x)] if k else x for k,x in itertools.groupby(i,str.isalnum))) for i in z])

which yields:

收益率:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', '(', '(', 'z', '+', '88', ')', ')']]

(note that the alternate regex answer can also be adapted like this: [re.findall('\w+|\W', s) for s in lst] (note the lack of + after W)

(注意,备用的regex答案也可以这样改写:[re]。在lst中为s查找所有('\w+|\ w ', s)](注意在w后面缺少+)

also "".join(list(x)) is slightly faster than "".join(x), but I'll let you add it up to avoid altering visibility of that already complex expression.

另外,“”.join(list(x))比“”.join(x)稍微快一些,但是我将让您把它加起来,以避免改变已经复杂的表达式的可见性。

#3


6  

Alternative solution using re.split function:

使用re.split函数替代解决方案:

l = ['z+2-44', '4+55+z+88']
print([list(filter(None, re.split(r'(\w+)', i))) for i in l])

The output:

输出:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

#4


5  

You could only use str.replace() and str.split() built-in functions within a list comprehension:

只能在列表理解中使用string .replace()和string .split()内置函数:

In [34]: lst = ['z+2-44', '4+55+z+88']

In [35]: [s.replace('+', ' + ').replace('-', ' - ').split() for s in lst]
Out[35]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

But note that this is not an efficient approach for longer strings. In that case the best way to go is using regex.

但请注意,对于较长的字符串,这不是一种有效的方法。在这种情况下,最好的方法就是使用正则表达式。

As another pythonic way you can also use tokenize module:

作为另一种python方式,你也可以使用tokenize模块:

In [56]: from io import StringIO

In [57]: import tokenize

In [59]: [[t.string for t in tokenize.generate_tokens(StringIO(i).readline)][:-1] for i in lst]
Out[59]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays.

tokenize模块为Python源代码提供了一个词汇扫描器,它是用Python实现的。该模块中的扫描器也将注释作为标记返回,这使得它对于实现“漂亮的打印机”非常有用,包括用于屏幕显示的着色器。

#5


-1  

If you want to stick with split (hence avoiding regex), you can provide it with an optional character to split on:

如果您想坚持使用split(因此避免regex),您可以为它提供一个可选字符来分割:

>>> testing = 'z+2-44'
>>> testing.split('+')
['z', '2-44']
>>> testing.split('-')
['z+2', '44']

So, you could whip something up by chaining the split commands.

因此,您可以通过链接split命令来启动一些东西。

However, using regular expressions would probably be more readable:

但是,使用正则表达式可能更容易读懂:

import re

>>> re.split('\+|\-', testing)
['z', '2', '44']

This is just saying to "split the string at any + or - character" (the backslashes are escape characters because both of those have special meaning in a regex.

这只是说“在任意+或-字符上分割字符串”(反斜杠是转义字符,因为它们在regex中都有特殊的含义。

Lastly, in this particular case, I imagine the goal is something along the lines of "split at every non-alpha numeric character", in which case regex can still save the day:

最后,在这个特定的例子中,我假设目标是“在每个非字母数字字符处分割”,在这种情况下,regex仍然可以节省时间:

>>> re.split('[^a-zA-Z0-9]', testing)
['z', '2', '44']

It is of course worth noting that there are a million other solutions, as discussed in some other SO discussions.

当然值得注意的是,还有100万种其他的解决方案,正如在SO讨论中讨论的那样。

Python: Split string with multiple delimiters

Python:具有多个分隔符的分割字符串

Split Strings with Multiple Delimiters?

带多个分隔符的分割字符串?

My answers here are targeted towards simple, readable code and not performance, in honor of Donald Knuth

我的答案是针对简单的、可读的代码而不是性能,以纪念Donald Knuth。