在python中对多个字符分割字符串

时间:2021-05-06 21:38:50

I am trying to split a string on multiple characters in python just like I am doing in Java like this:

我试图在python中的多个字符上分割字符串,就像我在Java中做的那样:

private static final String SPECIAL_CHARACTERS_REGEX = "[ :;'?=()!\\[\\]-]+|(?<=\\d)(?=\\D)";
String rawMessage = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]";
String[] tokens = rawMessage.split(SPECIAL_CHARACTERS_REGEX);
System.out.println(Arrays.toString(tokens));

Here is the working demo with the correct output: Working Demo

下面是正确输出的工作演示:工作演示

I am trying to do exactly the same in python, but when I am doing that it would not tokenize at all if I just add the 'single quotes' character in the regex. How do I create the same resultant parse results from python as from Java program above?

我试着在python中做同样的事情,但是当我这么做的时候,如果我只是在regex中添加“单引号”字符,它就不会有任何标记。我如何从上面的Java程序中创建与python相同的结果解析结果?

This:

这样的:

import re
tokens = re.split(' \.', line);
print tokens

For line:

线:

"let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"

Gives:

给:

["let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]";]

When I was it to do this:

当我这样做的时候:

[let, s, meet, tomorrow, at, 9, 30, p, 7, 8, pm, i, you, go, no, Go, to, do]

4 个解决方案

#1


2  

Here's an alternative that finds rather than splits:

这里有一个发现而不是分裂的选择:

>>> s = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"
>>> re.findall(r'\d+|[A-Za-z]+', s)
['let', 's', 'meet', 'tomorrow', 'at', '9', '30', 'p', '7', '8', 'pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

If it is ok to keep letters and numbers together use '[0-9A-Za-z]+'. For letters, numbers, and underscore use r'\w+'.

如果可以将字母和数字放在一起,可以使用“[0-9A-Za-z]+”。对于字母、数字和下划线使用r'\w+'。

#2


1  

Use the same regular expression you used in Java:

使用您在Java中使用的相同正则表达式:

line = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"
tokens = re.split("[ :;'?=()!\\[\\]-]+|(?<=\\d)(?=\\D)", line)
tokens = [token for token in tokens if len(token) != 0] # remove empty strings!
print(tokens)
# ['let', 's', 'meet', 'tomorrow', 'at', '9', '30p', '7', '8pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

#3


0  

Use the following code

使用下面的代码

>>> chars = "[:;'?=()!\-]+<" #Characters to remove
>>> sentence = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]" #Sentence
>>> for k in sentence: #Loops over everything in the sentence
...     if k in chars: #Checks if the variable is one we want to remove
...             sentence = sentence.replace(k, ' ') #If it is, it replaces it
...
>>> sentence = sentence.replace('p', ' p').replace('pm', ' pm').split() #Adds a space before the 'p' and the 'pm', and then splits it the way we want to
>>> sentence
['let', 's', 'meet', 'tomorrow', 'at', '9', '30', 'p', '7', '8', 'pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

If you want to use regex:

如果你想使用regex:

line = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"
tokens = re.split("[ :;'?=()!\\[\\]-]+|(?<=\\d)(?=\\D)", line)
tokens = [token for token in tokens if len(token) != 0]
tokens = tokens.replace('p', ' p').replace('pm', ' pm').split()
print(tokens)
#['let', 's', 'meet', 'tomorrow', 'at', '9', '30', 'p', '7', '8', 'pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

#4


0  

That split regex in Java should have worked the same in Python.
Its probably a bug. The confusion would probably be the overlap
between \D and [ :;'?=()!\[\]-], and how it handles that (bug~).

Java中的split regex在Python中应该是相同的。可能一个bug。混淆可能是\D和[:'?=()之间的重叠!\[\]-],以及它如何处理(bug~)。

You could try to solve it by putting (?<=\d)(?=\D) first, but it
has to be coerced to do that.

您可以尝试通过首先放置(?<=\d)(?=\ d)来解决它,但是必须强制它这样做。

This regex here forces it to do that. Is this a workaround?
I don't know, don't have python to test with. But, it works in Perl.

这个regex强制它这样做。这是一个解决方案吗?我不知道,没有python来测试。但是,它在Perl中工作。

Coerced regex -

强迫正则表达式,

 #  (?<=\d)(?:[ :;'?=()!\[\]-]+|(?=\D))|(?<!\d|[ :;'?=()!\[\]-])[ :;'?=()!\[\]-]+

    (?<= \d )
    (?:
         [ :;'?=()!\[\]-]+ 
      |  (?= \D )
    )
 |  
    (?<! \d | [ :;'?=()!\[\]-] )
    [ :;'?=()!\[\]-]+ 

#1


2  

Here's an alternative that finds rather than splits:

这里有一个发现而不是分裂的选择:

>>> s = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"
>>> re.findall(r'\d+|[A-Za-z]+', s)
['let', 's', 'meet', 'tomorrow', 'at', '9', '30', 'p', '7', '8', 'pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

If it is ok to keep letters and numbers together use '[0-9A-Za-z]+'. For letters, numbers, and underscore use r'\w+'.

如果可以将字母和数字放在一起,可以使用“[0-9A-Za-z]+”。对于字母、数字和下划线使用r'\w+'。

#2


1  

Use the same regular expression you used in Java:

使用您在Java中使用的相同正则表达式:

line = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"
tokens = re.split("[ :;'?=()!\\[\\]-]+|(?<=\\d)(?=\\D)", line)
tokens = [token for token in tokens if len(token) != 0] # remove empty strings!
print(tokens)
# ['let', 's', 'meet', 'tomorrow', 'at', '9', '30p', '7', '8pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

#3


0  

Use the following code

使用下面的代码

>>> chars = "[:;'?=()!\-]+<" #Characters to remove
>>> sentence = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]" #Sentence
>>> for k in sentence: #Loops over everything in the sentence
...     if k in chars: #Checks if the variable is one we want to remove
...             sentence = sentence.replace(k, ' ') #If it is, it replaces it
...
>>> sentence = sentence.replace('p', ' p').replace('pm', ' pm').split() #Adds a space before the 'p' and the 'pm', and then splits it the way we want to
>>> sentence
['let', 's', 'meet', 'tomorrow', 'at', '9', '30', 'p', '7', '8', 'pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

If you want to use regex:

如果你想使用regex:

line = "let's meet tomorrow at 9:30p? 7-8pm? i=you go (no Go!) [to do !]"
tokens = re.split("[ :;'?=()!\\[\\]-]+|(?<=\\d)(?=\\D)", line)
tokens = [token for token in tokens if len(token) != 0]
tokens = tokens.replace('p', ' p').replace('pm', ' pm').split()
print(tokens)
#['let', 's', 'meet', 'tomorrow', 'at', '9', '30', 'p', '7', '8', 'pm', 'i', 'you', 'go', 'no', 'Go', 'to', 'do']

#4


0  

That split regex in Java should have worked the same in Python.
Its probably a bug. The confusion would probably be the overlap
between \D and [ :;'?=()!\[\]-], and how it handles that (bug~).

Java中的split regex在Python中应该是相同的。可能一个bug。混淆可能是\D和[:'?=()之间的重叠!\[\]-],以及它如何处理(bug~)。

You could try to solve it by putting (?<=\d)(?=\D) first, but it
has to be coerced to do that.

您可以尝试通过首先放置(?<=\d)(?=\ d)来解决它,但是必须强制它这样做。

This regex here forces it to do that. Is this a workaround?
I don't know, don't have python to test with. But, it works in Perl.

这个regex强制它这样做。这是一个解决方案吗?我不知道,没有python来测试。但是,它在Perl中工作。

Coerced regex -

强迫正则表达式,

 #  (?<=\d)(?:[ :;'?=()!\[\]-]+|(?=\D))|(?<!\d|[ :;'?=()!\[\]-])[ :;'?=()!\[\]-]+

    (?<= \d )
    (?:
         [ :;'?=()!\[\]-]+ 
      |  (?= \D )
    )
 |  
    (?<! \d | [ :;'?=()!\[\]-] )
    [ :;'?=()!\[\]-]+