将一个字符串拆分为单词和标点符号。

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

我试着将一个字符串分割成单词和标点符号，将标点符号添加到由拆分产生的列表中。

For instance:

例如:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:

我真正想要的列表是:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

所以，我希望字符串在空格处被分割，标点与单词之间的分隔。

I've tried to parse the string first and then run the split:

我试着先解析字符串，然后运行拆分:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

这就产生了我想要的结果，但是在大文件上痛苦地慢了下来。

Is there a way to do this more efficiently?

有没有一种更有效的方法?

9 个解决方案

#1

This is more or less the way to do it:

这或多或少是这样做的:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

诀窍在于，不要考虑将字符串拆分到哪里，而是考虑在令牌中包含哪些内容。

Caveats:

警告:

The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
下划线(_)被认为是一个内字字符。如果你不想要的话，就换一个。
This will not work with (single) quotes in the string.
这将不能与字符串中的(单)引号一起工作。
Put any additional punctuation marks you want to use in the right half of the regular expression.
在正则表达式的右半部分添加任何你想要使用的标点符号。
Anything not explicitely mentioned in the re is silently dropped.
在re中没有明确提到的任何东西都被悄悄地删除了。

#2

Here is a Unicode-aware version:

这是一个unicodeaware版本:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

第一个选项捕获单词字符的序列(由unicode定义，所以“resume”不会变成['r'， 'sum']);第二个捕获单个非单词字符，忽略空白。

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

请注意，与上面的答案不同，它将单引号作为单独的标点(例如:"I'm" -> ['I'， ' '， 'm'])。这似乎是NLP的标准，所以我认为它是一个特性。

#3

In perl-style regular expression syntax, \b matches a word boundary. This should come in handy for doing a regex-based split.

在perl样式的正则表达式语法中，\b匹配一个单词边界。这对于执行基于regex的拆分非常方便。

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

编辑:我已经被告知“空匹配”在Python的re模块的split函数中不起作用。我将把这个信息留给其他人，因为其他人被这个“特性”难住了。

#4

Here's my entry.

这是我的入境。

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

我怀疑这将会在多大程度上阻碍效率，或者如果它抓住了所有的案例(注意“!!!”组合在一起;这也许是好事，也可能不是好事。

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

一个明显的优化是在手动(使用re.compile)之前编译regex，如果您要逐行地进行此操作。

#5

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

这里是对您的实现的一个小更新。如果你想做更详细的事情，我建议你去看看le dorfier建议的NLTK。

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

这可能只会快一点。”join()被用来代替+=，这是众所周知的更快。

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

#6

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

我认为您可以在NLTK中找到所有可以想象的帮助，特别是在您使用python的时候。在本教程中对这个问题有一个很好的全面的讨论。

#7

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:

我想出了一种方法，可以用\b来标记所有的单词和\W+模式，这并不需要硬编码:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']

Here .*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

这里。* \ s * ?是一种匹配任何非空格和$的模式，如果它是一个标点符号，则添加到一个字符串中的最后一个标记。

Note the following though -- this will group punctuation that consists of more than one symbol:

请注意以下内容——这将会使用多个符号组成的标点符号:

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

当然，你可以找到并将这样的群体划分为:

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
...     print re.findall(r'(?:\w+|\W)', token)

['You']
['can']
['"', ',']
['she']
['said']

#8

Try this:

试试这个:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

#9

-1

Have you tried using a regex?

你试过使用正则表达式吗?

http://docs.python.org/library/re.html#re-syntax

http://docs.python.org/library/re.html re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

顺便说一下。为什么你需要第二个呢?你将会知道，在每一段文字之后。

[0]

","

”、“

[1]

","

”、“

So if you want to add the "," you can just do it after each iteration when you use the array..

所以如果你想添加“，”你可以在每次迭代后使用这个数组。

#1