I'm wondering because I need to have have a function that is disgustingly fast at checking if a word is in a dictionary list - I'm considering leaving the dictionary as a large string and running regex against instead. This needs to be absurdly fast. So I just need a basic overview of how python handles checking if a string is in a list of strings and if its beyond-reasonable fast.
我想知道因为我需要有一个令人厌恶的功能,检查一个单词是否在字典列表中 - 我正在考虑将字典保留为一个大字符串并反而运行正则表达式。这需要非常快。所以我只需要一个关于python如何处理检查字符串是否在字符串列表中以及它是否超出合理速度的基本概述。
8 个解决方案
#1
10
If you want a blazingly fast membership test, then a list is the wrong data structure. Take a look at the implementation of list_contains
in listobject.c
, line 437. It iterates over the list in order, comparing the item with each element in turn. The later the item appears in the list, the longer it will take to find it, and if the item is missing, then the whole list must be scanned.
如果你想要一个非常快速的成员资格测试,那么列表就是错误的数据结构。看看listobject.c中第437行的list_contains的实现。它按顺序遍历列表,依次将项目与每个元素进行比较。项目出现在列表中的时间越晚,查找项目所需的时间越长,如果项目丢失,则必须扫描整个列表。
Use a set instead. Sets are implemented internally by a hash table, so looking up an object involves computing its hash and then scanning a few table entries (usually just one). For the particular case of looking up a string, see set_lookkey_string
in setobject.c
, line 156.
请改用套装。集合由散列表在内部实现,因此查找对象涉及计算其散列,然后扫描一些表条目(通常只有一个)。有关查找字符串的特殊情况,请参阅setobject.c第156行中的set_lookkey_string。
#2
4
A set of strings will have O(1) lookup time: effectively constant regardless of the size of the set. Making a set from your list of strings is easy:
一组字符串将具有O(1)查找时间:无论集合的大小如何,都有效。从字符串列表中创建一个集很容易:
my_set = set(my_list)
if my_word in my_set:
print "it's there!"
#3
2
If you need real fast checking, use a set
:
如果您需要真正的快速检查,请使用一组:
words = set(words_list)
if "hello" in words:
print("hello found!"")
A set is faster because it uses a hash-algorithm, instead of a direct search approach.
集合更快,因为它使用散列算法,而不是直接搜索方法。
#4
2
According to this site, x in s
is O(n). Therefore, it checks each entry (in the worst case).
根据该站点,s中的x是O(n)。因此,它会检查每个条目(在最坏的情况下)。
At any rate, do not use a regex. Using sets or lists is a much more intuitive way to represent the data and regexes will not perform better than O(n).
无论如何,不要使用正则表达式。使用集合或列表是一种更直观的方式来表示数据,并且正则表达式的性能不会比O(n)更好。
#5
1
If you're using a regular list, consider a set
instead.
如果您使用的是常规列表,请考虑使用一组。
If you want to implement your own fine-tuned membership test for your container object, override __contains__
.
如果要为容器对象实现自己的微调成员资格测试,请覆盖__contains__。
#6
0
You probably want to use a Set if you're worried about time. A Set is much like a list, but it checks for membership based on hashing.
如果你担心时间,你可能想要使用Set。集合很像列表,但它会根据散列检查成员资格。
#7
0
Use a set. If you need case-insensitive checking, just store the words into the set downcased. Then when checking if a certain word is in the set, downcase the word before checking membership.
使用一套。如果您需要不区分大小写的检查,只需将单词存储到下集中。然后,当检查集合中是否存在某个单词时,在检查成员资格之前将该单词缩写。
The general rule is: normalize entries when building the set, and normalize an item before checking against the set. Another example of normalization is collapsing consecutive whitespace chars into a single space and stripping leading/trailing whitespace.
一般规则是:在构建集合时规范化条目,并在检查集合之前规范化项目。标准化的另一个例子是将连续的空格字符折叠成单个空格并剥离前导/尾随空格。
#8
0
Running a regex against your word list is a very bad idea; it scales very badly. Using dict()
, set()
or frozenset()
will scale a lot better:
对你的单词列表运行正则表达式是一个非常糟糕的主意;它的鳞片非常严重。使用dict(),set()或frozenset()会更好地扩展:
s = set(['one','two','three'])
'two' in s ## true
b='four'
b in s ## false
s.add('four')
b in s ## true
#1
10
If you want a blazingly fast membership test, then a list is the wrong data structure. Take a look at the implementation of list_contains
in listobject.c
, line 437. It iterates over the list in order, comparing the item with each element in turn. The later the item appears in the list, the longer it will take to find it, and if the item is missing, then the whole list must be scanned.
如果你想要一个非常快速的成员资格测试,那么列表就是错误的数据结构。看看listobject.c中第437行的list_contains的实现。它按顺序遍历列表,依次将项目与每个元素进行比较。项目出现在列表中的时间越晚,查找项目所需的时间越长,如果项目丢失,则必须扫描整个列表。
Use a set instead. Sets are implemented internally by a hash table, so looking up an object involves computing its hash and then scanning a few table entries (usually just one). For the particular case of looking up a string, see set_lookkey_string
in setobject.c
, line 156.
请改用套装。集合由散列表在内部实现,因此查找对象涉及计算其散列,然后扫描一些表条目(通常只有一个)。有关查找字符串的特殊情况,请参阅setobject.c第156行中的set_lookkey_string。
#2
4
A set of strings will have O(1) lookup time: effectively constant regardless of the size of the set. Making a set from your list of strings is easy:
一组字符串将具有O(1)查找时间:无论集合的大小如何,都有效。从字符串列表中创建一个集很容易:
my_set = set(my_list)
if my_word in my_set:
print "it's there!"
#3
2
If you need real fast checking, use a set
:
如果您需要真正的快速检查,请使用一组:
words = set(words_list)
if "hello" in words:
print("hello found!"")
A set is faster because it uses a hash-algorithm, instead of a direct search approach.
集合更快,因为它使用散列算法,而不是直接搜索方法。
#4
2
According to this site, x in s
is O(n). Therefore, it checks each entry (in the worst case).
根据该站点,s中的x是O(n)。因此,它会检查每个条目(在最坏的情况下)。
At any rate, do not use a regex. Using sets or lists is a much more intuitive way to represent the data and regexes will not perform better than O(n).
无论如何,不要使用正则表达式。使用集合或列表是一种更直观的方式来表示数据,并且正则表达式的性能不会比O(n)更好。
#5
1
If you're using a regular list, consider a set
instead.
如果您使用的是常规列表,请考虑使用一组。
If you want to implement your own fine-tuned membership test for your container object, override __contains__
.
如果要为容器对象实现自己的微调成员资格测试,请覆盖__contains__。
#6
0
You probably want to use a Set if you're worried about time. A Set is much like a list, but it checks for membership based on hashing.
如果你担心时间,你可能想要使用Set。集合很像列表,但它会根据散列检查成员资格。
#7
0
Use a set. If you need case-insensitive checking, just store the words into the set downcased. Then when checking if a certain word is in the set, downcase the word before checking membership.
使用一套。如果您需要不区分大小写的检查,只需将单词存储到下集中。然后,当检查集合中是否存在某个单词时,在检查成员资格之前将该单词缩写。
The general rule is: normalize entries when building the set, and normalize an item before checking against the set. Another example of normalization is collapsing consecutive whitespace chars into a single space and stripping leading/trailing whitespace.
一般规则是:在构建集合时规范化条目,并在检查集合之前规范化项目。标准化的另一个例子是将连续的空格字符折叠成单个空格并剥离前导/尾随空格。
#8
0
Running a regex against your word list is a very bad idea; it scales very badly. Using dict()
, set()
or frozenset()
will scale a lot better:
对你的单词列表运行正则表达式是一个非常糟糕的主意;它的鳞片非常严重。使用dict(),set()或frozenset()会更好地扩展:
s = set(['one','two','three'])
'two' in s ## true
b='four'
b in s ## false
s.add('four')
b in s ## true