如何在python中将文本文件拆分为单词？

I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read one file in python using:

我是python的新手，之前也没有使用过文本...我有100个文本文件，每个文件有大约100到150行描述病人状况的非结构化文本。我在python中读取了一个文件：

with open("C:\\...\\...\\...\\record-13.txt") as f:
    content = f.readlines()
    print (content)

Now I can split each line of this file to its words using for example:

现在我可以使用例如将此文件的每一行拆分为其单词：

a = content[0].split()
print (a)

but I don't know how to split whole file to words? do loops (while or for) help with that?

但我不知道如何将整个文件分成单词？做循环（同时或为）帮助吗？

Thank you for your help guys. Your answers help me to write this (in my file, words are split by space so that's delimiter I think!):

谢谢你的帮助。你的回答帮助我写这个（在我的文件中，单词按空格分割，以便我认为是分隔符！）：

with open ("C:\\...\\...\\...\\record-13.txt") as f:
  lines = f.readlines()
  for line in lines:
      words = line.split()
      for word in words:
          print (word)

that simply splits words by line (one word in one line).

简单地逐行分割（一行中的一个单词）。

5 个解决方案

#1

Nobody has suggested a generator, I'm surprised. Here's how I would do it:

没有人建议发电机，我很惊讶。这是我将如何做到这一点：

def words(stringIterable):
    #upcast the argument to an iterator, if it's an iterator already, it stays the same
    lineStream = iter(stringIterable)
    for line in lineStream: #enumerate the lines
        for word in line.split(): #further break them down
            yield word

Now this can be used both on simple lists of sentences that you might have in memory already:

现在，这可以用在你可能已经在内存中的简单句子列表中：

listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
    print(word)

But it will work just as well on a file, without needing to read the whole file in memory:

但它在文件上也可以正常工作，而无需在内存中读取整个文件：

with open('words.py', 'r') as myself:
    for word in words(myself):
        print(word)

#2

It depends on how you define words, or what you regard as the delimiters.
Notice string.split in Python receives an optional parameter delimiter, so you could pass it as this:

这取决于您如何定义单词或您认为的分隔符。请注意Python中的string.split接收可选的参数分隔符，因此您可以将其传递为：

for lines in content[0].split():
    for word in lines.split(','):
        print(word)

Unfortunately, string.split receives a single delimiter only, so you may need multi-level splitting like this:

不幸的是，string.split只接收一个分隔符，因此您可能需要进行多级分割：

for lines in content[0].split():
    for split0 in lines.split(' '):
        for split1 in split0.split(','):
            for split2 in split1.split('.'):
                for split3 in split2.split('?'):
                    for split4 in split3.split('!'):
                        for word in split4.split(':'): 
                            if word != "":
                                print(word)

Looks ugly, right? Luckily we can use iteration instead:

看起来很难看，对吧？幸运的是，我们可以使用迭代代替：

delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
    new_words = []
    for word in words:
        new_words += word.split(delimiter)
    words = new_words

EDITED: Or simply we could use the regular expression package:

编辑：或者我们可以使用正则表达式包：

import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)

#3

with open("C:\...\...\...\record-13.txt") as f:
    for line in f:
        for word in line.split():
            print word

Or, this gives you a list of words

或者，这会为您提供单词列表

with open("C:\...\...\...\record-13.txt") as f:
    words = [word for line in f for word in line.split()]

Or, this gives you a list of lines, but with each line as a list of words.

或者，这会为您提供一个行列表，但每行都是一个单词列表。

with open("C:\...\...\...\record-13.txt") as f:
    words = [line.split() for line in f]

#4

The most flexible approach is to use list comprehension to generate a list of words:

最灵活的方法是使用列表推导来生成单词列表：

with open("C:\...\...\...\record-13.txt") as f:
    words = [word
             for line in f
             for word in line.split()]

# Do what you want with the words list

Which you can then iterate over, add to a collections.Counter or anything else you please.

然后您可以迭代，添加到collections.Counter或其他任何您喜欢的。

#5

I would use Natural Language Tool Kit as the split() way does not deal well with punctuation.

我会使用自然语言工具包，因为split（）方式不能很好地处理标点符号。

import nltk

for line in file:
    words = nltk.word_tokenize(line)

#1