Python 2.7 - 使用字典从文本文件中查找并替换为新的文本文件

时间:2022-03-23 16:50:01

I am newbie to programming, and have been studying python in my spare time for the past few months. I decided I was going to try and create a little script that converts American spellings to English spellings in a text file.

我是编程的新手,过去几个月我一直在闲暇时间学习python。我决定尝试创建一个小脚本,将美国拼写转换成文本文件中的英语拼写。

I have been trying all sorts of things for the past 5 hours, but eventually came up with something that got me somewhat closer to my goal, but not quite there!

在过去的5个小时里,我一直在尝试各种各样的事情,但最终想出了一些让我更接近目标的东西,但并不完全在那里!

#imported dictionary contains 1800 english:american spelling key:value pairs. 
from english_american_dictionary import dict


def replace_all(text, dict):
    for english, american in dict.iteritems():
        text = text.replace(american, english)
    return text


my_text = open('test_file.txt', 'r')

for line in my_text:
    new_line = replace_all(line, dict)
    output = open('output_test_file.txt', 'a')
    print >> output, new_line

output.close()

I am sure there is a considerably better way to go about things, but for this script,here are the issues I am having:

我确信有更好的方法可以解决问题,但对于这个脚本,我遇到的问题是:

  • In the output file the lines are written on every other line, with a line break between, but the original test_file.txt does not have this. Contents of test_file.txt shown at bottom of this page
  • 在输出文件中,每行都写入行,并在它们之间有换行符,但原始的test_file.txt没有。本页底部显示的test_file.txt的内容

  • Only the first instance of an American spelling in a line gets converted to English.
  • 只有一行中美国拼写的第一个实例转换为英语。

  • I didn't really want to open output file in append mode, but couldn't figure out 'r' in this code structure.
  • 我真的不想在追加模式下打开输出文件,但在这段代码结构中无法弄清楚'r'。

Any help appreciated for this eager newb!

任何帮助赞赏这个渴望新手!

The contents of the test_file.txt are:

test_file.txt的内容是:

I am sample file.
I contain an english spelling: colour.
3 american spellings on 1 line: color, analyze, utilize.
1 american spelling on 1 line: familiarize.

3 个解决方案

#1


8  

The extra blank line you are seeing is because you are using print to write out a line that already includes a newline character at the end. Since print writes its own newline too, your output becomes double spaced. An easy fix is to use outfile.write(new_line) instead.

您看到的额外空白行是因为您使用print来写出最后已包含换行符的行。由于print也会编写自己的换行符,因此输出会变为双倍行距。一个简单的解决方法是使用outfile.write(new_line)。

As for the file modes, the issue is that you're opening the output file over and over. You should just open it once, at the start. Its usually a good idea to use with statements to handle opening files, since they'll take care of closing them for you when you're done with them.

至于文件模式,问题是你一遍又一遍地打开输出文件。你应该在开始时打开它一次。使用语句来处理打开文件通常是一个好主意,因为当你完成它们时,它们会照顾你关闭它们。

I don't undestand your other issue, with only some of the replacements happening. Is your dictionary missing the spellings for 'analyze' and 'utilize'?

我没有看到你的另一个问题,只有一些替换发生。你的词典是否缺少“分析”和“利用”的拼写?

One suggestion I'd make is to not do your replacements line by line. You can read the whole file in at once with file.read() and then work on it as a single unit. This will probably be faster, since it won't need to loop as often over the items in your spelling dictionary (just once, rather than once per line):

我提出的一个建议是不要逐行更换。您可以使用file.read()立即读取整个文件,然后将其作为一个单元进行处理。这可能会更快,因为它不需要经常循环拼写字典中的项目(只需一次,而不是每行一次):

with open('test_file.txt', 'r') as in_file:
    text = in_file.read()

with open('output_test_file.txt', 'w') as out_file:
    out_file.write(replace_all(text, spelling_dict))

Edit:

To make your code correctly handle words that contain other words (like "entire" containing "tire"), you probably need to abandon the simple str.replace approach in favor of regular expressions.

为了使你的代码正确处理包含其他单词的单词(比如“整个”包含“轮胎”),你可能需要放弃简单的str.replace方法,而不是正则表达式。

Here's a quickly thrown together solution that uses re.sub, given a dictionary of spelling changes from American to British English (that is, in the reverse order of your current dictionary):

这是一个使用re.sub的快速抛出的解决方案,给出了从美国英语到英国英语的拼写更改字典(即,按照当前字典的相反顺序):

import re

#from english_american_dictionary import ame_to_bre_spellings
ame_to_bre_spellings = {'tire':'tyre', 'color':'colour', 'utilize':'utilise'}

def replacer_factory(spelling_dict):
    def replacer(match):
        word = match.group()
        return spelling_dict.get(word, word)
    return replacer

def ame_to_bre(text):
    pattern = r'\b\w+\b'  # this pattern matches whole words only
    replacer = replacer_factory(ame_to_bre_spellings)
    return re.sub(pattern, replacer, text)

def main():
    #with open('test_file.txt') as in_file:
    #    text = in_file.read()
    text = 'foo color, entire, utilize'

    #with open('output_test_file.txt', 'w') as out_file:
    #    out_file.write(ame_to_bre(text))
    print(ame_to_bre(text))

if __name__ == '__main__':
    main()

One nice thing about this code structure is that you can easily convert from British English spellings back to American English ones, if you pass a dictionary in the other order to the replacer_factory function.

关于这种代码结构的一个好处是,如果您将其他顺序的字典传递给replacer_factory函数,您可以轻松地将英式英语拼写转换回美式英语拼写。

#2


3  

The print statement adds a newline of its own, but your lines already have their own newlines. You can either strip the newline from your new_line, or use the lower-level

print语句添加了自己的换行符,但是你的行已经有了自己的换行符。您可以从new_line中删除换行符,也可以使用较低级别的换行符

output.write(new_line)

instead (which writes exactly what you pass to it).

相反(它准确写出你传递给它的东西)。

For your second question, I think we need an actual example. replace() should indeed replace all occurrences.

对于你的第二个问题,我认为我们需要一个实际的例子。 replace()确实应该替换所有出现的事件。

>>> "abc abc abcd ab".replace("abc", "def")
'def def defd ab'

I'm not sure what your third question is asking. If you want to replace the output file, do

我不确定你的第三个问题是什么。如果要替换输出文件,请执行

output = open('output_test_file.txt', 'w')

'w' means you're opening the file for writing.

'w'表示你打开文件进行写作。

#3


2  

As all the good answers above, I wrote a new version which I think is more pythonic, wish this helps:

正如上面所有的好答案,我写了一个新版本,我觉得它更pythonic,希望这有帮助:

# imported dictionary contains 1800 english:american spelling key:value pairs.
mydict = {
    'color': 'colour',
}


def replace_all(text, mydict):
    for english, american in mydict.iteritems():
        text = text.replace(american, english)
    return text

try:
    with open('new_output.txt', 'w') as new_file:
        with open('test_file.txt', 'r') as f:
            for line in f:
                new_line = replace_all(line, mydict)
                new_file.write(new_line)
except:
    print "Can't open file!"

Also you can see the answer I asked before, it contains many best practice advices: Loading large file (25k entries) into dict is slow in Python?

你也可以看到我之前提出的答案,它包含许多最佳实践建议:在Python中将大文件(25k条目)加载到dict中是很慢的?

Here is a few other tips about how to write python more python:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html

这里有一些关于如何编写python更多python的其他技巧:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html

Good luck:)

#1


8  

The extra blank line you are seeing is because you are using print to write out a line that already includes a newline character at the end. Since print writes its own newline too, your output becomes double spaced. An easy fix is to use outfile.write(new_line) instead.

您看到的额外空白行是因为您使用print来写出最后已包含换行符的行。由于print也会编写自己的换行符,因此输出会变为双倍行距。一个简单的解决方法是使用outfile.write(new_line)。

As for the file modes, the issue is that you're opening the output file over and over. You should just open it once, at the start. Its usually a good idea to use with statements to handle opening files, since they'll take care of closing them for you when you're done with them.

至于文件模式,问题是你一遍又一遍地打开输出文件。你应该在开始时打开它一次。使用语句来处理打开文件通常是一个好主意,因为当你完成它们时,它们会照顾你关闭它们。

I don't undestand your other issue, with only some of the replacements happening. Is your dictionary missing the spellings for 'analyze' and 'utilize'?

我没有看到你的另一个问题,只有一些替换发生。你的词典是否缺少“分析”和“利用”的拼写?

One suggestion I'd make is to not do your replacements line by line. You can read the whole file in at once with file.read() and then work on it as a single unit. This will probably be faster, since it won't need to loop as often over the items in your spelling dictionary (just once, rather than once per line):

我提出的一个建议是不要逐行更换。您可以使用file.read()立即读取整个文件,然后将其作为一个单元进行处理。这可能会更快,因为它不需要经常循环拼写字典中的项目(只需一次,而不是每行一次):

with open('test_file.txt', 'r') as in_file:
    text = in_file.read()

with open('output_test_file.txt', 'w') as out_file:
    out_file.write(replace_all(text, spelling_dict))

Edit:

To make your code correctly handle words that contain other words (like "entire" containing "tire"), you probably need to abandon the simple str.replace approach in favor of regular expressions.

为了使你的代码正确处理包含其他单词的单词(比如“整个”包含“轮胎”),你可能需要放弃简单的str.replace方法,而不是正则表达式。

Here's a quickly thrown together solution that uses re.sub, given a dictionary of spelling changes from American to British English (that is, in the reverse order of your current dictionary):

这是一个使用re.sub的快速抛出的解决方案,给出了从美国英语到英国英语的拼写更改字典(即,按照当前字典的相反顺序):

import re

#from english_american_dictionary import ame_to_bre_spellings
ame_to_bre_spellings = {'tire':'tyre', 'color':'colour', 'utilize':'utilise'}

def replacer_factory(spelling_dict):
    def replacer(match):
        word = match.group()
        return spelling_dict.get(word, word)
    return replacer

def ame_to_bre(text):
    pattern = r'\b\w+\b'  # this pattern matches whole words only
    replacer = replacer_factory(ame_to_bre_spellings)
    return re.sub(pattern, replacer, text)

def main():
    #with open('test_file.txt') as in_file:
    #    text = in_file.read()
    text = 'foo color, entire, utilize'

    #with open('output_test_file.txt', 'w') as out_file:
    #    out_file.write(ame_to_bre(text))
    print(ame_to_bre(text))

if __name__ == '__main__':
    main()

One nice thing about this code structure is that you can easily convert from British English spellings back to American English ones, if you pass a dictionary in the other order to the replacer_factory function.

关于这种代码结构的一个好处是,如果您将其他顺序的字典传递给replacer_factory函数,您可以轻松地将英式英语拼写转换回美式英语拼写。

#2


3  

The print statement adds a newline of its own, but your lines already have their own newlines. You can either strip the newline from your new_line, or use the lower-level

print语句添加了自己的换行符,但是你的行已经有了自己的换行符。您可以从new_line中删除换行符,也可以使用较低级别的换行符

output.write(new_line)

instead (which writes exactly what you pass to it).

相反(它准确写出你传递给它的东西)。

For your second question, I think we need an actual example. replace() should indeed replace all occurrences.

对于你的第二个问题,我认为我们需要一个实际的例子。 replace()确实应该替换所有出现的事件。

>>> "abc abc abcd ab".replace("abc", "def")
'def def defd ab'

I'm not sure what your third question is asking. If you want to replace the output file, do

我不确定你的第三个问题是什么。如果要替换输出文件,请执行

output = open('output_test_file.txt', 'w')

'w' means you're opening the file for writing.

'w'表示你打开文件进行写作。

#3


2  

As all the good answers above, I wrote a new version which I think is more pythonic, wish this helps:

正如上面所有的好答案,我写了一个新版本,我觉得它更pythonic,希望这有帮助:

# imported dictionary contains 1800 english:american spelling key:value pairs.
mydict = {
    'color': 'colour',
}


def replace_all(text, mydict):
    for english, american in mydict.iteritems():
        text = text.replace(american, english)
    return text

try:
    with open('new_output.txt', 'w') as new_file:
        with open('test_file.txt', 'r') as f:
            for line in f:
                new_line = replace_all(line, mydict)
                new_file.write(new_line)
except:
    print "Can't open file!"

Also you can see the answer I asked before, it contains many best practice advices: Loading large file (25k entries) into dict is slow in Python?

你也可以看到我之前提出的答案,它包含许多最佳实践建议:在Python中将大文件(25k条目)加载到dict中是很慢的?

Here is a few other tips about how to write python more python:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html

这里有一些关于如何编写python更多python的其他技巧:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html

Good luck:)