I have a file that contains just over 100,000 words. What I have to do is run through every 5 letter combination of the alphabet to work out the 5 letters that are used by the least number of words.
我有一个包含超过100,000个单词的文件。我要做的是通过字母表的每5个字母组合来计算出由最少数量的单词使用的5个字母。
I've worked out a python based program that will eventually get the answer, but at the speed it is going it would probably take around 48 hours if not longer. Part the problem is the sheer number of calculations. I haven't worked out how to limit permutations so that only distinct strings are compared - so that is 265 calculations just on the combinations alone, and then each of these is compared against 100,000 words, which works out at around 10*1011 calculations at a bare minimum.
我已经制定了一个基于python的程序,最终会得到答案,但是按照它的速度,它可能需要大约48小时,如果不是更长的话。部分问题是计算的绝对数量。我还没有弄清楚如何限制排列,以便只比较不同的字符串 - 因此仅仅对组合进行265次计算,然后将每一项与100,000个单词进行比较,计算结果大约为10 * 1011最低限度的。
Is there any way to speed this process up substantially, be that through more efficient algorithms, or multi-threading or something like that?
有没有办法大幅加快这个过程,通过更有效的算法,或多线程或类似的东西?
Any recommendations for books/articles on algorithm efficiency would also be greatly appreciated.
任何关于算法效率的书籍/文章的建议也将受到高度赞赏。
My current program is as follows:
我目前的计划如下:
Imports permutation function from itertools module:
从itertools模块导入排列函数:
from itertools import permutations
Asks if the word contains the forbidden letters:
询问该单词是否包含禁用字母:
def avoids (word, letters):
for characters in letters:
for character in word:
if character == characters:
return False
return True
Calculates the number of words from the file that do not contain the forbidden characters:
计算文件中不包含禁用字符的单词数:
def number_of_words(file, letters):
open_file = open(file)
x = 0 #counter
for line in open_file:
word = line.strip()
if avoids(word, letters) == True:
x += 1
return x
Runs through every variation of five letters that exists in the alphabet and calculates the combination that excludes the fewest words:
运行字母表中存在的五个字母的每个变体,并计算排除最少单词的组合:
def find_smallest():
y = 0
#every combination of letters
for letters in permutations("abcdefghijklmnopqrstuvwxyz", 5):
x = number_of_words("words.txt", letters)
#sets y to the initial value of x
if y == 0:
y = x
print "Start point, combination: %s, amount: %d" % (letters, y)
#If current combination is greater than all previous combinations set y to x
elif x > y:
y = x
combination = letters
duplication = 0
print "New highest, combination: %s, amount: %d" % (letters, y)
print "%s excludes the smallest number of words (%d)" % (combination, y)
Runs the program:
运行程序:
find_smallest()
2 个解决方案
#1
3
This is not a question about increasing permutation efficiency. This is really a question about how to make a smarter algorithm, it is a data structure question.
这不是关于提高排列效率的问题。这实际上是一个关于如何制作更智能算法的问题,它是一个数据结构问题。
I have a file that contains just over 100,000 words. What I have to do is run through every 5 letter combination of the alphabet to work out the 5 letters that are used by the least number of words.
我有一个包含超过100,000个单词的文件。我要做的是通过字母表的每5个字母组合来计算出由最少数量的单词使用的5个字母。
Loop through the 26 letters of the alphabet and count the number of words in your list which use each letter:
循环显示字母表中的26个字母,并计算列表中使用每个字母的单词数:
import string
alphabet = string.ascii_lowercase
counts = {k: sum(k in word.lower() for word in words) for k in alphabet}
This should be pretty fast, and should give you enough information to trivially pick out the five least popular letters.
这应该是相当快的,并且应该给你足够的信息来轻易地挑选出五个最不受欢迎的字母。
Equivalent approach, probably more efficient but maybe less clear than the above.
等效方法,可能更有效但可能不如上述清晰。
from itertools import chain
from collections import Counter
counter = Counter({k: 0 for k in string.ascii_lowercase})
counter.update(Counter(c for w in words for c in set(w.lower())))
#2
4
- you can use combinations instead of permutations
- 您可以使用组合而不是排列
- why not just scan all words once, count the number of appearances of each letter, then select the 5 with minimum number of appearances ?
- 为什么不只扫描所有单词一次,计算每个字母的出现次数,然后选择具有最少出现次数的5?
#1
3
This is not a question about increasing permutation efficiency. This is really a question about how to make a smarter algorithm, it is a data structure question.
这不是关于提高排列效率的问题。这实际上是一个关于如何制作更智能算法的问题,它是一个数据结构问题。
I have a file that contains just over 100,000 words. What I have to do is run through every 5 letter combination of the alphabet to work out the 5 letters that are used by the least number of words.
我有一个包含超过100,000个单词的文件。我要做的是通过字母表的每5个字母组合来计算出由最少数量的单词使用的5个字母。
Loop through the 26 letters of the alphabet and count the number of words in your list which use each letter:
循环显示字母表中的26个字母,并计算列表中使用每个字母的单词数:
import string
alphabet = string.ascii_lowercase
counts = {k: sum(k in word.lower() for word in words) for k in alphabet}
This should be pretty fast, and should give you enough information to trivially pick out the five least popular letters.
这应该是相当快的,并且应该给你足够的信息来轻易地挑选出五个最不受欢迎的字母。
Equivalent approach, probably more efficient but maybe less clear than the above.
等效方法,可能更有效但可能不如上述清晰。
from itertools import chain
from collections import Counter
counter = Counter({k: 0 for k in string.ascii_lowercase})
counter.update(Counter(c for w in words for c in set(w.lower())))
#2
4
- you can use combinations instead of permutations
- 您可以使用组合而不是排列
- why not just scan all words once, count the number of appearances of each letter, then select the 5 with minimum number of appearances ?
- 为什么不只扫描所有单词一次,计算每个字母的出现次数,然后选择具有最少出现次数的5?