将一个字符串拆分为单词和标点符号。

时间:2021-01-22 15:45:15

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

我试着将一个字符串分割成单词和标点符号,将标点符号添加到由拆分产生的列表中。

For instance:

例如:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:

我真正想要的列表是:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

所以,我希望字符串在空格处被分割,标点与单词之间的分隔。

I've tried to parse the string first and then run the split:

我试着先解析字符串,然后运行拆分:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

这就产生了我想要的结果,但是在大文件上痛苦地慢了下来。

Is there a way to do this more efficiently?

有没有一种更有效的方法?

9 个解决方案

#1


58  

This is more or less the way to do it:

这或多或少是这样做的:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

诀窍在于,不要考虑将字符串拆分到哪里,而是考虑在令牌中包含哪些内容。

Caveats:

警告:

  • The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
  • 下划线(_)被认为是一个内字字符。如果你不想要的话,就换一个。
  • This will not work with (single) quotes in the string.
  • 这将不能与字符串中的(单)引号一起工作。
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • 在正则表达式的右半部分添加任何你想要使用的标点符号。
  • Anything not explicitely mentioned in the re is silently dropped.
  • 在re中没有明确提到的任何东西都被悄悄地删除了。

#2


23  

Here is a Unicode-aware version:

这是一个unicodeaware版本:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

第一个选项捕获单词字符的序列(由unicode定义,所以“resume”不会变成['r', 'sum']);第二个捕获单个非单词字符,忽略空白。

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

请注意,与上面的答案不同,它将单引号作为单独的标点(例如:"I'm" -> ['I', ' ', 'm'])。这似乎是NLP的标准,所以我认为它是一个特性。

#3


4  

In perl-style regular expression syntax, \b matches a word boundary. This should come in handy for doing a regex-based split.

在perl样式的正则表达式语法中,\b匹配一个单词边界。这对于执行基于regex的拆分非常方便。

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

编辑:我已经被告知“空匹配”在Python的re模块的split函数中不起作用。我将把这个信息留给其他人,因为其他人被这个“特性”难住了。

#4


3  

Here's my entry.

这是我的入境。

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

我怀疑这将会在多大程度上阻碍效率,或者如果它抓住了所有的案例(注意“!!!”组合在一起;这也许是好事,也可能不是好事。

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

一个明显的优化是在手动(使用re.compile)之前编译regex,如果您要逐行地进行此操作。

#5


1  

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

这里是对您的实现的一个小更新。如果你想做更详细的事情,我建议你去看看le dorfier建议的NLTK。

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

这可能只会快一点。”join()被用来代替+=,这是众所周知的更快。

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

#6


0  

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

我认为您可以在NLTK中找到所有可以想象的帮助,特别是在您使用python的时候。在本教程中对这个问题有一个很好的全面的讨论。

#7


0  

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:

我想出了一种方法,可以用\b来标记所有的单词和\W+模式,这并不需要硬编码:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']

Here .*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

这里。* \ s * ?是一种匹配任何非空格和$的模式,如果它是一个标点符号,则添加到一个字符串中的最后一个标记。

Note the following though -- this will group punctuation that consists of more than one symbol:

请注意以下内容——这将会使用多个符号组成的标点符号:

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

当然,你可以找到并将这样的群体划分为:

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
...     print re.findall(r'(?:\w+|\W)', token)

['You']
['can']
['"', ',']
['she']
['said']

#8


0  

Try this:

试试这个:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

#9


-1  

Have you tried using a regex?

你试过使用正则表达式吗?

http://docs.python.org/library/re.html#re-syntax

http://docs.python.org/library/re.html re-syntax


By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

顺便说一下。为什么你需要第二个呢?你将会知道,在每一段文字之后。

[0]

[0]

","

”、“

[1]

[1]

","

”、“

So if you want to add the "," you can just do it after each iteration when you use the array..

所以如果你想添加“,”你可以在每次迭代后使用这个数组。

#1


58  

This is more or less the way to do it:

这或多或少是这样做的:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

诀窍在于,不要考虑将字符串拆分到哪里,而是考虑在令牌中包含哪些内容。

Caveats:

警告:

  • The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
  • 下划线(_)被认为是一个内字字符。如果你不想要的话,就换一个。
  • This will not work with (single) quotes in the string.
  • 这将不能与字符串中的(单)引号一起工作。
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • 在正则表达式的右半部分添加任何你想要使用的标点符号。
  • Anything not explicitely mentioned in the re is silently dropped.
  • 在re中没有明确提到的任何东西都被悄悄地删除了。

#2


23  

Here is a Unicode-aware version:

这是一个unicodeaware版本:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

第一个选项捕获单词字符的序列(由unicode定义,所以“resume”不会变成['r', 'sum']);第二个捕获单个非单词字符,忽略空白。

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

请注意,与上面的答案不同,它将单引号作为单独的标点(例如:"I'm" -> ['I', ' ', 'm'])。这似乎是NLP的标准,所以我认为它是一个特性。

#3


4  

In perl-style regular expression syntax, \b matches a word boundary. This should come in handy for doing a regex-based split.

在perl样式的正则表达式语法中,\b匹配一个单词边界。这对于执行基于regex的拆分非常方便。

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

编辑:我已经被告知“空匹配”在Python的re模块的split函数中不起作用。我将把这个信息留给其他人,因为其他人被这个“特性”难住了。

#4


3  

Here's my entry.

这是我的入境。

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

我怀疑这将会在多大程度上阻碍效率,或者如果它抓住了所有的案例(注意“!!!”组合在一起;这也许是好事,也可能不是好事。

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

一个明显的优化是在手动(使用re.compile)之前编译regex,如果您要逐行地进行此操作。

#5


1  

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

这里是对您的实现的一个小更新。如果你想做更详细的事情,我建议你去看看le dorfier建议的NLTK。

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

这可能只会快一点。”join()被用来代替+=,这是众所周知的更快。

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

#6


0  

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

我认为您可以在NLTK中找到所有可以想象的帮助,特别是在您使用python的时候。在本教程中对这个问题有一个很好的全面的讨论。

#7


0  

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:

我想出了一种方法,可以用\b来标记所有的单词和\W+模式,这并不需要硬编码:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']

Here .*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

这里。* \ s * ?是一种匹配任何非空格和$的模式,如果它是一个标点符号,则添加到一个字符串中的最后一个标记。

Note the following though -- this will group punctuation that consists of more than one symbol:

请注意以下内容——这将会使用多个符号组成的标点符号:

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

当然,你可以找到并将这样的群体划分为:

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
...     print re.findall(r'(?:\w+|\W)', token)

['You']
['can']
['"', ',']
['she']
['said']

#8


0  

Try this:

试试这个:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

#9


-1  

Have you tried using a regex?

你试过使用正则表达式吗?

http://docs.python.org/library/re.html#re-syntax

http://docs.python.org/library/re.html re-syntax


By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

顺便说一下。为什么你需要第二个呢?你将会知道,在每一段文字之后。

[0]

[0]

","

”、“

[1]

[1]

","

”、“

So if you want to add the "," you can just do it after each iteration when you use the array..

所以如果你想添加“,”你可以在每次迭代后使用这个数组。