I am newbie to programming, and have been studying python in my spare time for the past few months. I decided I was going to try and create a little script that converts American spellings to English spellings in a text file.
我是编程的新手,过去几个月我一直在闲暇时间学习python。我决定尝试创建一个小脚本,将美国拼写转换成文本文件中的英语拼写。
I have been trying all sorts of things for the past 5 hours, but eventually came up with something that got me somewhat closer to my goal, but not quite there!
在过去的5个小时里,我一直在尝试各种各样的事情,但最终想出了一些让我更接近目标的东西,但并不完全在那里!
#imported dictionary contains 1800 english:american spelling key:value pairs.
from english_american_dictionary import dict
def replace_all(text, dict):
for english, american in dict.iteritems():
text = text.replace(american, english)
return text
my_text = open('test_file.txt', 'r')
for line in my_text:
new_line = replace_all(line, dict)
output = open('output_test_file.txt', 'a')
print >> output, new_line
output.close()
I am sure there is a considerably better way to go about things, but for this script,here are the issues I am having:
我确信有更好的方法可以解决问题,但对于这个脚本,我遇到的问题是:
- In the output file the lines are written on every other line, with a line break between, but the original test_file.txt does not have this. Contents of test_file.txt shown at bottom of this page
- Only the first instance of an American spelling in a line gets converted to English.
- I didn't really want to open output file in append mode, but couldn't figure out 'r' in this code structure.
在输出文件中,每行都写入行,并在它们之间有换行符,但原始的test_file.txt没有。本页底部显示的test_file.txt的内容
只有一行中美国拼写的第一个实例转换为英语。
我真的不想在追加模式下打开输出文件,但在这段代码结构中无法弄清楚'r'。
Any help appreciated for this eager newb!
任何帮助赞赏这个渴望新手!
The contents of the test_file.txt are:
test_file.txt的内容是:
I am sample file.
I contain an english spelling: colour.
3 american spellings on 1 line: color, analyze, utilize.
1 american spelling on 1 line: familiarize.
3 个解决方案
#1
8
The extra blank line you are seeing is because you are using print
to write out a line that already includes a newline character at the end. Since print
writes its own newline too, your output becomes double spaced. An easy fix is to use outfile.write(new_line)
instead.
您看到的额外空白行是因为您使用print来写出最后已包含换行符的行。由于print也会编写自己的换行符,因此输出会变为双倍行距。一个简单的解决方法是使用outfile.write(new_line)。
As for the file modes, the issue is that you're opening the output file over and over. You should just open it once, at the start. Its usually a good idea to use with
statements to handle opening files, since they'll take care of closing them for you when you're done with them.
至于文件模式,问题是你一遍又一遍地打开输出文件。你应该在开始时打开它一次。使用语句来处理打开文件通常是一个好主意,因为当你完成它们时,它们会照顾你关闭它们。
I don't undestand your other issue, with only some of the replacements happening. Is your dictionary missing the spellings for 'analyze'
and 'utilize'
?
我没有看到你的另一个问题,只有一些替换发生。你的词典是否缺少“分析”和“利用”的拼写?
One suggestion I'd make is to not do your replacements line by line. You can read the whole file in at once with file.read()
and then work on it as a single unit. This will probably be faster, since it won't need to loop as often over the items in your spelling dictionary (just once, rather than once per line):
我提出的一个建议是不要逐行更换。您可以使用file.read()立即读取整个文件,然后将其作为一个单元进行处理。这可能会更快,因为它不需要经常循环拼写字典中的项目(只需一次,而不是每行一次):
with open('test_file.txt', 'r') as in_file:
text = in_file.read()
with open('output_test_file.txt', 'w') as out_file:
out_file.write(replace_all(text, spelling_dict))
Edit:
To make your code correctly handle words that contain other words (like "entire" containing "tire"), you probably need to abandon the simple str.replace
approach in favor of regular expressions.
为了使你的代码正确处理包含其他单词的单词(比如“整个”包含“轮胎”),你可能需要放弃简单的str.replace方法,而不是正则表达式。
Here's a quickly thrown together solution that uses re.sub
, given a dictionary of spelling changes from American to British English (that is, in the reverse order of your current dictionary):
这是一个使用re.sub的快速抛出的解决方案,给出了从美国英语到英国英语的拼写更改字典(即,按照当前字典的相反顺序):
import re
#from english_american_dictionary import ame_to_bre_spellings
ame_to_bre_spellings = {'tire':'tyre', 'color':'colour', 'utilize':'utilise'}
def replacer_factory(spelling_dict):
def replacer(match):
word = match.group()
return spelling_dict.get(word, word)
return replacer
def ame_to_bre(text):
pattern = r'\b\w+\b' # this pattern matches whole words only
replacer = replacer_factory(ame_to_bre_spellings)
return re.sub(pattern, replacer, text)
def main():
#with open('test_file.txt') as in_file:
# text = in_file.read()
text = 'foo color, entire, utilize'
#with open('output_test_file.txt', 'w') as out_file:
# out_file.write(ame_to_bre(text))
print(ame_to_bre(text))
if __name__ == '__main__':
main()
One nice thing about this code structure is that you can easily convert from British English spellings back to American English ones, if you pass a dictionary in the other order to the replacer_factory
function.
关于这种代码结构的一个好处是,如果您将其他顺序的字典传递给replacer_factory函数,您可以轻松地将英式英语拼写转换回美式英语拼写。
#2
3
The print
statement adds a newline of its own, but your lines already have their own newlines. You can either strip the newline from your new_line
, or use the lower-level
print语句添加了自己的换行符,但是你的行已经有了自己的换行符。您可以从new_line中删除换行符,也可以使用较低级别的换行符
output.write(new_line)
instead (which writes exactly what you pass to it).
相反(它准确写出你传递给它的东西)。
For your second question, I think we need an actual example. replace()
should indeed replace all occurrences.
对于你的第二个问题,我认为我们需要一个实际的例子。 replace()确实应该替换所有出现的事件。
>>> "abc abc abcd ab".replace("abc", "def")
'def def defd ab'
I'm not sure what your third question is asking. If you want to replace the output file, do
我不确定你的第三个问题是什么。如果要替换输出文件,请执行
output = open('output_test_file.txt', 'w')
'w'
means you're opening the file for writing.
'w'表示你打开文件进行写作。
#3
2
As all the good answers above, I wrote a new version which I think is more pythonic, wish this helps:
正如上面所有的好答案,我写了一个新版本,我觉得它更pythonic,希望这有帮助:
# imported dictionary contains 1800 english:american spelling key:value pairs.
mydict = {
'color': 'colour',
}
def replace_all(text, mydict):
for english, american in mydict.iteritems():
text = text.replace(american, english)
return text
try:
with open('new_output.txt', 'w') as new_file:
with open('test_file.txt', 'r') as f:
for line in f:
new_line = replace_all(line, mydict)
new_file.write(new_line)
except:
print "Can't open file!"
Also you can see the answer I asked before, it contains many best practice advices: Loading large file (25k entries) into dict is slow in Python?
你也可以看到我之前提出的答案,它包含许多最佳实践建议:在Python中将大文件(25k条目)加载到dict中是很慢的?
Here is a few other tips about how to write python more python:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
这里有一些关于如何编写python更多python的其他技巧:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
Good luck:)
#1
8
The extra blank line you are seeing is because you are using print
to write out a line that already includes a newline character at the end. Since print
writes its own newline too, your output becomes double spaced. An easy fix is to use outfile.write(new_line)
instead.
您看到的额外空白行是因为您使用print来写出最后已包含换行符的行。由于print也会编写自己的换行符,因此输出会变为双倍行距。一个简单的解决方法是使用outfile.write(new_line)。
As for the file modes, the issue is that you're opening the output file over and over. You should just open it once, at the start. Its usually a good idea to use with
statements to handle opening files, since they'll take care of closing them for you when you're done with them.
至于文件模式,问题是你一遍又一遍地打开输出文件。你应该在开始时打开它一次。使用语句来处理打开文件通常是一个好主意,因为当你完成它们时,它们会照顾你关闭它们。
I don't undestand your other issue, with only some of the replacements happening. Is your dictionary missing the spellings for 'analyze'
and 'utilize'
?
我没有看到你的另一个问题,只有一些替换发生。你的词典是否缺少“分析”和“利用”的拼写?
One suggestion I'd make is to not do your replacements line by line. You can read the whole file in at once with file.read()
and then work on it as a single unit. This will probably be faster, since it won't need to loop as often over the items in your spelling dictionary (just once, rather than once per line):
我提出的一个建议是不要逐行更换。您可以使用file.read()立即读取整个文件,然后将其作为一个单元进行处理。这可能会更快,因为它不需要经常循环拼写字典中的项目(只需一次,而不是每行一次):
with open('test_file.txt', 'r') as in_file:
text = in_file.read()
with open('output_test_file.txt', 'w') as out_file:
out_file.write(replace_all(text, spelling_dict))
Edit:
To make your code correctly handle words that contain other words (like "entire" containing "tire"), you probably need to abandon the simple str.replace
approach in favor of regular expressions.
为了使你的代码正确处理包含其他单词的单词(比如“整个”包含“轮胎”),你可能需要放弃简单的str.replace方法,而不是正则表达式。
Here's a quickly thrown together solution that uses re.sub
, given a dictionary of spelling changes from American to British English (that is, in the reverse order of your current dictionary):
这是一个使用re.sub的快速抛出的解决方案,给出了从美国英语到英国英语的拼写更改字典(即,按照当前字典的相反顺序):
import re
#from english_american_dictionary import ame_to_bre_spellings
ame_to_bre_spellings = {'tire':'tyre', 'color':'colour', 'utilize':'utilise'}
def replacer_factory(spelling_dict):
def replacer(match):
word = match.group()
return spelling_dict.get(word, word)
return replacer
def ame_to_bre(text):
pattern = r'\b\w+\b' # this pattern matches whole words only
replacer = replacer_factory(ame_to_bre_spellings)
return re.sub(pattern, replacer, text)
def main():
#with open('test_file.txt') as in_file:
# text = in_file.read()
text = 'foo color, entire, utilize'
#with open('output_test_file.txt', 'w') as out_file:
# out_file.write(ame_to_bre(text))
print(ame_to_bre(text))
if __name__ == '__main__':
main()
One nice thing about this code structure is that you can easily convert from British English spellings back to American English ones, if you pass a dictionary in the other order to the replacer_factory
function.
关于这种代码结构的一个好处是,如果您将其他顺序的字典传递给replacer_factory函数,您可以轻松地将英式英语拼写转换回美式英语拼写。
#2
3
The print
statement adds a newline of its own, but your lines already have their own newlines. You can either strip the newline from your new_line
, or use the lower-level
print语句添加了自己的换行符,但是你的行已经有了自己的换行符。您可以从new_line中删除换行符,也可以使用较低级别的换行符
output.write(new_line)
instead (which writes exactly what you pass to it).
相反(它准确写出你传递给它的东西)。
For your second question, I think we need an actual example. replace()
should indeed replace all occurrences.
对于你的第二个问题,我认为我们需要一个实际的例子。 replace()确实应该替换所有出现的事件。
>>> "abc abc abcd ab".replace("abc", "def")
'def def defd ab'
I'm not sure what your third question is asking. If you want to replace the output file, do
我不确定你的第三个问题是什么。如果要替换输出文件,请执行
output = open('output_test_file.txt', 'w')
'w'
means you're opening the file for writing.
'w'表示你打开文件进行写作。
#3
2
As all the good answers above, I wrote a new version which I think is more pythonic, wish this helps:
正如上面所有的好答案,我写了一个新版本,我觉得它更pythonic,希望这有帮助:
# imported dictionary contains 1800 english:american spelling key:value pairs.
mydict = {
'color': 'colour',
}
def replace_all(text, mydict):
for english, american in mydict.iteritems():
text = text.replace(american, english)
return text
try:
with open('new_output.txt', 'w') as new_file:
with open('test_file.txt', 'r') as f:
for line in f:
new_line = replace_all(line, mydict)
new_file.write(new_line)
except:
print "Can't open file!"
Also you can see the answer I asked before, it contains many best practice advices: Loading large file (25k entries) into dict is slow in Python?
你也可以看到我之前提出的答案,它包含许多最佳实践建议:在Python中将大文件(25k条目)加载到dict中是很慢的?
Here is a few other tips about how to write python more python:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
这里有一些关于如何编写python更多python的其他技巧:) http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
Good luck:)