I have about half a million items that need to be placed in a list, I can't have duplications, and if an item is already there I need to get it's index. So far I have
我有大约五十万个项目需要放在一个列表中,我不能重复,如果一个项目已经存在,我需要得到它的索引。到目前为止我有
if Item in List:
ItemNumber=List.index(Item)
else:
List.append(Item)
ItemNumber=List.index(Item)
The problem is that as the list grows it gets progressively slower until at some point it just isn't worth doing. I am limited to python 2.5 because it is an embedded system.
问题是随着列表的增长,它逐渐变慢,直到某些时候它不值得做。我仅限于python 2.5,因为它是一个嵌入式系统。
4 个解决方案
#1
15
You can use a set (in CPython since version 2.4) to efficiently look up duplicate values. If you really need an indexed system as well, you can use both a set and list.
您可以使用一个集合(在2.4版本的CPython中)有效地查找重复值。如果您确实需要索引系统,则可以使用集合和列表。
Doing your lookups using a set will remove the overhead of if Item in List
, but not that of List.index(Item)
使用集合执行查找将消除列表中if项目的开销,但不会消除List.index(Item)中的项目开销
Please note ItemNumber=List.index(Item)
will be very inefficient to do after List.append(Item)
. You know the length of the list, so your index can be retrieved with ItemNumber = len(List)-1
.
请注意List.append(Item)后,ItemNumber = List.index(Item)的效率非常低。您知道列表的长度,因此可以使用ItemNumber = len(List)-1检索索引。
To completely remove the overhead of List.index
(because that method will search through the list - very inefficient on larger sets), you can use a dict mapping Items back to their index.
要完全删除List.index的开销(因为该方法将搜索列表 - 在较大的集合上效率非常低),您可以使用dict将项目映射回其索引。
I might rewrite it as follows:
我可能会重写如下:
# earlier in the program, NOT inside the loop
Dup = {}
# inside your loop to add items:
if Item in Dup:
ItemNumber = Dup[Item]
else:
List.append(Item)
Dup[Item] = ItemNumber = len(List)-1
#2
7
If you really need to keep the data in an array, I'd use a separate dictionary to keep track of duplicates. This requires twice as much memory, but won't slow down significantly.
如果你真的需要将数据保存在数组中,我会使用单独的字典来跟踪重复项。这需要两倍的内存,但不会显着减慢。
existing = dict()
if Item in existing:
ItemNumber = existing[Item]
else:
ItemNumber = existing[Item] = len(List)
List.append(Item)
However, if you don't need to save the order of items you should just use a set
instead. This will take almost as little space as a list, yet will be as fast as a dictionary.
但是,如果您不需要保存项目的顺序,则应该只使用一组。这将占用几乎与列表一样少的空间,但速度与字典一样快。
Items = set()
# ...
Items.add(Item) # will do nothing if Item is already added
Both of these require that your object is hashable. In Python, most types are hashable unless they are a container whose contents can be modified. For example: list
s are not hashable because you can modify their contents, but tuple
s are hashable because you cannot.
这两个都要求您的对象是可清洗的。在Python中,大多数类型都是可清除的,除非它们是可以修改其内容的容器。例如:列表不可清除,因为您可以修改其内容,但元组是可清除的,因为您不能。
If you were trying to store values that aren't hashable, there isn't a fast general solution.
如果您尝试存储不可清除的值,则没有快速的通用解决方案。
#3
5
You can improve the check a lot:
你可以改进检查:
check = set(List)
for Item in NewList:
if Item in check: ItemNumber = List.index(Item)
else:
ItemNumber = len(List)
List.append(Item)
Or, even better, if order is not important you can do this:
或者,更好的是,如果订单不重要,您可以这样做:
oldlist = set(List)
addlist = set(AddList)
newlist = list(oldlist | addlist)
And if you need to loop over the items that were duplicated:
如果你需要遍历重复的项目:
for item in (oldlist & addlist):
pass # do stuff
#4
0
What is the range of your half a million items? You might be able to use memory very inefficiently if you can make a few statements about the range of these items. I believe an approach along this line would be the fastest possible, but might not be practical for an embedded application unless you can make some very strict guarantees.
你的五十万件物品的范围是多少?如果您可以对这些项的范围做一些陈述,则可能会非常低效地使用内存。我相信沿着这条线的方法将是最快的,但对于嵌入式应用程序可能不实用,除非您可以做出一些非常严格的保证。
Does this answer help point you towards the time/memory trade off I am alluding to? I can help clarify more if you'd like.
这个答案是否有助于指出你所指的时间/记忆权衡?如果您愿意,我可以帮助澄清更多。
#1
15
You can use a set (in CPython since version 2.4) to efficiently look up duplicate values. If you really need an indexed system as well, you can use both a set and list.
您可以使用一个集合(在2.4版本的CPython中)有效地查找重复值。如果您确实需要索引系统,则可以使用集合和列表。
Doing your lookups using a set will remove the overhead of if Item in List
, but not that of List.index(Item)
使用集合执行查找将消除列表中if项目的开销,但不会消除List.index(Item)中的项目开销
Please note ItemNumber=List.index(Item)
will be very inefficient to do after List.append(Item)
. You know the length of the list, so your index can be retrieved with ItemNumber = len(List)-1
.
请注意List.append(Item)后,ItemNumber = List.index(Item)的效率非常低。您知道列表的长度,因此可以使用ItemNumber = len(List)-1检索索引。
To completely remove the overhead of List.index
(because that method will search through the list - very inefficient on larger sets), you can use a dict mapping Items back to their index.
要完全删除List.index的开销(因为该方法将搜索列表 - 在较大的集合上效率非常低),您可以使用dict将项目映射回其索引。
I might rewrite it as follows:
我可能会重写如下:
# earlier in the program, NOT inside the loop
Dup = {}
# inside your loop to add items:
if Item in Dup:
ItemNumber = Dup[Item]
else:
List.append(Item)
Dup[Item] = ItemNumber = len(List)-1
#2
7
If you really need to keep the data in an array, I'd use a separate dictionary to keep track of duplicates. This requires twice as much memory, but won't slow down significantly.
如果你真的需要将数据保存在数组中,我会使用单独的字典来跟踪重复项。这需要两倍的内存,但不会显着减慢。
existing = dict()
if Item in existing:
ItemNumber = existing[Item]
else:
ItemNumber = existing[Item] = len(List)
List.append(Item)
However, if you don't need to save the order of items you should just use a set
instead. This will take almost as little space as a list, yet will be as fast as a dictionary.
但是,如果您不需要保存项目的顺序,则应该只使用一组。这将占用几乎与列表一样少的空间,但速度与字典一样快。
Items = set()
# ...
Items.add(Item) # will do nothing if Item is already added
Both of these require that your object is hashable. In Python, most types are hashable unless they are a container whose contents can be modified. For example: list
s are not hashable because you can modify their contents, but tuple
s are hashable because you cannot.
这两个都要求您的对象是可清洗的。在Python中,大多数类型都是可清除的,除非它们是可以修改其内容的容器。例如:列表不可清除,因为您可以修改其内容,但元组是可清除的,因为您不能。
If you were trying to store values that aren't hashable, there isn't a fast general solution.
如果您尝试存储不可清除的值,则没有快速的通用解决方案。
#3
5
You can improve the check a lot:
你可以改进检查:
check = set(List)
for Item in NewList:
if Item in check: ItemNumber = List.index(Item)
else:
ItemNumber = len(List)
List.append(Item)
Or, even better, if order is not important you can do this:
或者,更好的是,如果订单不重要,您可以这样做:
oldlist = set(List)
addlist = set(AddList)
newlist = list(oldlist | addlist)
And if you need to loop over the items that were duplicated:
如果你需要遍历重复的项目:
for item in (oldlist & addlist):
pass # do stuff
#4
0
What is the range of your half a million items? You might be able to use memory very inefficiently if you can make a few statements about the range of these items. I believe an approach along this line would be the fastest possible, but might not be practical for an embedded application unless you can make some very strict guarantees.
你的五十万件物品的范围是多少?如果您可以对这些项的范围做一些陈述,则可能会非常低效地使用内存。我相信沿着这条线的方法将是最快的,但对于嵌入式应用程序可能不实用,除非您可以做出一些非常严格的保证。
Does this answer help point you towards the time/memory trade off I am alluding to? I can help clarify more if you'd like.
这个答案是否有助于指出你所指的时间/记忆权衡?如果您愿意,我可以帮助澄清更多。