How can I most optimally remove identical items from a list and sort it in Python?
如何从列表中最佳地删除相同的项目并在Python中对其进行排序?
Say I have a list:
说我有一个清单:
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
I could iterate over a copy of the list (since you should not mutate the list while iterating over it), item for item, and remove all of the item except for one:
我可以遍历列表的副本(因为你不应该在迭代它时改变列表),item for item,并删除所有项目,除了一个:
for item in my_list[:]: # must iterate over a copy because mutating it
count = my_list.count(item) # see how many are in the list
if count > 1:
for _ in range(count-1): # remove all but one of the element
my_list.remove(item)
Which removes the redundant items:
这删除了多余的项目:
['b', 'c', 'a', 'd', 'f', 'e']
and then sort the list:
然后对列表进行排序:
my_list.sort()
so my_list is now:
所以my_list现在是:
['a', 'b', 'c', 'd', 'e', 'f']
But what's the most efficient and direct (i.e. performant) way to remove the identical elements and sort this list?
但是,删除相同元素并对此列表进行排序的最有效和直接(即高效)方法是什么?
*This question came up at work (I wanted so badly to answer it, but one of our senior-most Python developers got to it before me), and I also brought it up at my local Python Meetup group, and few people had a good answer for it, so I'm answering it Q&A style, as suggested by *.
*这个问题出现在工作中(我非常想回答它,但我们的一位资深大多数Python开发人员在我之前就已经开始了),而且我还在我当地的Python Meetup小组提出了这个问题,很少有人有很好的答案,所以我正在回答它的问答风格,正如*所建议的那样。
4 个解决方案
#1
15
The best way to remove redundant elements from a list is to cast it as a set, and since sorted accepts any iterable and returns a list, this is far more efficient than doing this piecewise.
从列表中删除冗余元素的最佳方法是将其转换为集合,并且由于sorted接受任何iterable并返回列表,因此这比分段执行更有效。
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def sorted_set(a_list):
return sorted(set(a_list))
new_list = sorted_set(my_list)
and new_list is:
和new_list是:
['a', 'b', 'c', 'd', 'e', 'f']
The downside of this approach is that elements given to set must be hashable, so if the elements are unhashable, you'll get an error:
这种方法的缺点是赋予set的元素必须是可散列的,所以如果元素不可用,你会得到一个错误:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> sorted(set(my_list))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
This trivial case could be addressed by casting the sublists as tuples, which may be more performant than the solution in the answer, which could mean more expensive tests for equality:
这个简单的案例可以通过将子列表作为元组进行处理来解决,这可能比答案中的解决方案更高效,这可能意味着更加昂贵的相等测试:
>>> my_list = [tuple(i) for i in my_list]
>>> sorted(set(my_list))
[('a',), ('b',), ('c',)]
But other cases would need to find different workarounds. This would not be necessary with the other solution (but again, may be much more computationally expensive):
但其他情况需要找到不同的解决方法。对于其他解决方案,这不是必需的(但同样,计算成本可能更高):
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
Which works for sublists:
适用于子列表:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> remove_extras_and_sort(my_list)
[['a'], ['b'], ['c']]
To compare performance:
import timeit
setup = '''
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
'''
timeit.timeit('sorted_set(my_list[:])', setup=setup)
timeit.timeit('remove_extras_and_sort(my_list[:])', setup=setup)
Which returns times as I measure them on my system, respectively:
在我的系统上测量时,它返回的时间分别为:
1.5562372207641602
4.558010101318359
Which means that the method given in the question likely takes more than 3 times as long to compute, given the necessary overhead for copying the lists each time (if we don't copy the lists, we'll just be sorting a list that's already been sorted, since the setup only runs once).
这意味着问题中给出的方法可能需要花费3倍以上的计算时间,因为每次复制列表需要花费必要的开销(如果我们不复制列表,我们只需要排序已经列出的列表已经排序,因为设置只运行一次)。
We can disassemble each function:
import dis
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
And just by looking at the output, we see that the bytecode for the first function is more than six times as long:
只需查看输出,我们就会看到第一个函数的字节码长度超过六倍:
>>> dis.dis(remove_extras_and_sort)
2 0 SETUP_LOOP 85 (to 88)
3 LOAD_FAST 0 (my_list)
6 SLICE+0
7 GET_ITER
>> 8 FOR_ITER 76 (to 87)
11 STORE_FAST 1 (item)
3 14 LOAD_FAST 0 (my_list)
17 LOAD_ATTR 0 (count)
20 LOAD_FAST 1 (item)
23 CALL_FUNCTION 1
26 STORE_FAST 2 (count)
4 29 LOAD_FAST 2 (count)
32 LOAD_CONST 1 (1)
35 COMPARE_OP 4 (>)
38 POP_JUMP_IF_FALSE 8
5 41 SETUP_LOOP 40 (to 84)
44 LOAD_GLOBAL 1 (range)
47 LOAD_FAST 2 (count)
50 LOAD_CONST 1 (1)
53 BINARY_SUBTRACT
54 CALL_FUNCTION 1
57 GET_ITER
>> 58 FOR_ITER 19 (to 80)
61 STORE_FAST 3 (_)
6 64 LOAD_FAST 0 (my_list)
67 LOAD_ATTR 2 (remove)
70 LOAD_FAST 1 (item)
73 CALL_FUNCTION 1
76 POP_TOP
77 JUMP_ABSOLUTE 58
>> 80 POP_BLOCK
81 JUMP_ABSOLUTE 8
>> 84 JUMP_ABSOLUTE 8
>> 87 POP_BLOCK
7 >> 88 LOAD_FAST 0 (my_list)
91 LOAD_ATTR 3 (sort)
94 CALL_FUNCTION 0
97 POP_TOP
8 98 LOAD_FAST 0 (my_list)
101 RETURN_VALUE
And the recommended way has much shorter bytecode:
推荐的方法有更短的字节码:
>>> dis.dis(sorted_set)
2 0 LOAD_GLOBAL 0 (sorted)
3 LOAD_GLOBAL 1 (set)
6 LOAD_FAST 0 (a_list)
9 CALL_FUNCTION 1
12 CALL_FUNCTION 1
15 RETURN_VALUE
So we see that using the builtin functionality of Python is much more effective and efficient than trying to reinvent the wheel.
所以我们看到使用Python的内置功能比尝试重新发明*更有效和高效。
Addendum: other options that need to be more fully explored:
附录:需要更充分探索的其他选择:
def groupby_sorted(my_list):
"""if items in my_list are unhashable"""
from itertools import groupby
return [e for e, g in groupby(sorted(my_list))]
def preserve_order_encountered(my_list):
"""elements in argument must be hashable - preserves order encountered"""
from collections import OrderedDict
return list(OrderedDict.fromkeys(my_list))
#2
2
Placing the items into a set and then sorting is going to be efficient, but it does rely on the items being hashable:
将项目放入集合然后排序将是有效的,但它确实依赖于可清洗的项目:
def sorted_set(a_list):
return sorted(set(a_list))
timeit sorted_set(my_list)
100000 loops, best of 3: 3.19 µs per loop
The classic way to get a sorted list of unique elements is first to sort, then to perform a second pass over the list, eliminating identical elements (which are guaranteed to be adjacent after the sort):
获取排序的唯一元素列表的经典方法是先排序,然后对列表执行第二次传递,从而消除相同的元素(保证在排序后相邻):
def sorted_unique(a_list):
l = sorted(a_list)
return l[:1] + [b for a, b in zip(l, l[1:]) if a != b]
This is not too bad compared to using set
:
与使用set相比,这并不算太糟糕:
timeit sorted_unique(my_list)
100000 loops, best of 3: 6.6 µs per loop
We can actually do better using itertools.groupby
:
我们实际上可以使用itertools.groupby做得更好:
def sorted_group(a_list):
return [k for k, _ in groupby(sorted(a_list))]
timeit sorted_group(my_list)
100000 loops, best of 3: 5.3 µs per loop
Finally, if the items are primitive values it's worth considering numpy; in this case (on a small list) the overheads outweigh any benefit, but it performs well on larger problem sets:
最后,如果项目是原始值,那么值得考虑numpy;在这种情况下(在一个小的列表中)开销超过任何好处,但它在较大的问题集上表现良好:
def sorted_np(a_list):
return np.unique(np.sort(a_list))
timeit sorted_np(my_list)
10000 loops, best of 3: 42 µs per loop
my_list = [random.randint(0, 10**6) for _ in range(10**6)]
timeit sorted_set(my_list)
1 loops, best of 3: 454 ms per loop
timeit sorted_np(my_list)
1 loops, best of 3: 333 ms per loop
#3
1
It is one two simple functions in python:
它是python中的两个简单函数:
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
print sorted(set(my_list))
and you get what you want ;)
你得到你想要的东西;)
if you want more info regarding sets look here, and about sorting in python have a look here.
如果你想了解有关集合的更多信息,请查看此处,以及有关在python中进行排序的信息。
hope this helps.
希望这可以帮助。
#4
-1
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
b=[]
for x in my_list:
try:
z=b.index(x)
except:
b.append(x)
b.sort()
output
['a', 'b', 'c', 'd', 'e', 'f']
#1
15
The best way to remove redundant elements from a list is to cast it as a set, and since sorted accepts any iterable and returns a list, this is far more efficient than doing this piecewise.
从列表中删除冗余元素的最佳方法是将其转换为集合,并且由于sorted接受任何iterable并返回列表,因此这比分段执行更有效。
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def sorted_set(a_list):
return sorted(set(a_list))
new_list = sorted_set(my_list)
and new_list is:
和new_list是:
['a', 'b', 'c', 'd', 'e', 'f']
The downside of this approach is that elements given to set must be hashable, so if the elements are unhashable, you'll get an error:
这种方法的缺点是赋予set的元素必须是可散列的,所以如果元素不可用,你会得到一个错误:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> sorted(set(my_list))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
This trivial case could be addressed by casting the sublists as tuples, which may be more performant than the solution in the answer, which could mean more expensive tests for equality:
这个简单的案例可以通过将子列表作为元组进行处理来解决,这可能比答案中的解决方案更高效,这可能意味着更加昂贵的相等测试:
>>> my_list = [tuple(i) for i in my_list]
>>> sorted(set(my_list))
[('a',), ('b',), ('c',)]
But other cases would need to find different workarounds. This would not be necessary with the other solution (but again, may be much more computationally expensive):
但其他情况需要找到不同的解决方法。对于其他解决方案,这不是必需的(但同样,计算成本可能更高):
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
Which works for sublists:
适用于子列表:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> remove_extras_and_sort(my_list)
[['a'], ['b'], ['c']]
To compare performance:
import timeit
setup = '''
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
'''
timeit.timeit('sorted_set(my_list[:])', setup=setup)
timeit.timeit('remove_extras_and_sort(my_list[:])', setup=setup)
Which returns times as I measure them on my system, respectively:
在我的系统上测量时,它返回的时间分别为:
1.5562372207641602
4.558010101318359
Which means that the method given in the question likely takes more than 3 times as long to compute, given the necessary overhead for copying the lists each time (if we don't copy the lists, we'll just be sorting a list that's already been sorted, since the setup only runs once).
这意味着问题中给出的方法可能需要花费3倍以上的计算时间,因为每次复制列表需要花费必要的开销(如果我们不复制列表,我们只需要排序已经列出的列表已经排序,因为设置只运行一次)。
We can disassemble each function:
import dis
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
And just by looking at the output, we see that the bytecode for the first function is more than six times as long:
只需查看输出,我们就会看到第一个函数的字节码长度超过六倍:
>>> dis.dis(remove_extras_and_sort)
2 0 SETUP_LOOP 85 (to 88)
3 LOAD_FAST 0 (my_list)
6 SLICE+0
7 GET_ITER
>> 8 FOR_ITER 76 (to 87)
11 STORE_FAST 1 (item)
3 14 LOAD_FAST 0 (my_list)
17 LOAD_ATTR 0 (count)
20 LOAD_FAST 1 (item)
23 CALL_FUNCTION 1
26 STORE_FAST 2 (count)
4 29 LOAD_FAST 2 (count)
32 LOAD_CONST 1 (1)
35 COMPARE_OP 4 (>)
38 POP_JUMP_IF_FALSE 8
5 41 SETUP_LOOP 40 (to 84)
44 LOAD_GLOBAL 1 (range)
47 LOAD_FAST 2 (count)
50 LOAD_CONST 1 (1)
53 BINARY_SUBTRACT
54 CALL_FUNCTION 1
57 GET_ITER
>> 58 FOR_ITER 19 (to 80)
61 STORE_FAST 3 (_)
6 64 LOAD_FAST 0 (my_list)
67 LOAD_ATTR 2 (remove)
70 LOAD_FAST 1 (item)
73 CALL_FUNCTION 1
76 POP_TOP
77 JUMP_ABSOLUTE 58
>> 80 POP_BLOCK
81 JUMP_ABSOLUTE 8
>> 84 JUMP_ABSOLUTE 8
>> 87 POP_BLOCK
7 >> 88 LOAD_FAST 0 (my_list)
91 LOAD_ATTR 3 (sort)
94 CALL_FUNCTION 0
97 POP_TOP
8 98 LOAD_FAST 0 (my_list)
101 RETURN_VALUE
And the recommended way has much shorter bytecode:
推荐的方法有更短的字节码:
>>> dis.dis(sorted_set)
2 0 LOAD_GLOBAL 0 (sorted)
3 LOAD_GLOBAL 1 (set)
6 LOAD_FAST 0 (a_list)
9 CALL_FUNCTION 1
12 CALL_FUNCTION 1
15 RETURN_VALUE
So we see that using the builtin functionality of Python is much more effective and efficient than trying to reinvent the wheel.
所以我们看到使用Python的内置功能比尝试重新发明*更有效和高效。
Addendum: other options that need to be more fully explored:
附录:需要更充分探索的其他选择:
def groupby_sorted(my_list):
"""if items in my_list are unhashable"""
from itertools import groupby
return [e for e, g in groupby(sorted(my_list))]
def preserve_order_encountered(my_list):
"""elements in argument must be hashable - preserves order encountered"""
from collections import OrderedDict
return list(OrderedDict.fromkeys(my_list))
#2
2
Placing the items into a set and then sorting is going to be efficient, but it does rely on the items being hashable:
将项目放入集合然后排序将是有效的,但它确实依赖于可清洗的项目:
def sorted_set(a_list):
return sorted(set(a_list))
timeit sorted_set(my_list)
100000 loops, best of 3: 3.19 µs per loop
The classic way to get a sorted list of unique elements is first to sort, then to perform a second pass over the list, eliminating identical elements (which are guaranteed to be adjacent after the sort):
获取排序的唯一元素列表的经典方法是先排序,然后对列表执行第二次传递,从而消除相同的元素(保证在排序后相邻):
def sorted_unique(a_list):
l = sorted(a_list)
return l[:1] + [b for a, b in zip(l, l[1:]) if a != b]
This is not too bad compared to using set
:
与使用set相比,这并不算太糟糕:
timeit sorted_unique(my_list)
100000 loops, best of 3: 6.6 µs per loop
We can actually do better using itertools.groupby
:
我们实际上可以使用itertools.groupby做得更好:
def sorted_group(a_list):
return [k for k, _ in groupby(sorted(a_list))]
timeit sorted_group(my_list)
100000 loops, best of 3: 5.3 µs per loop
Finally, if the items are primitive values it's worth considering numpy; in this case (on a small list) the overheads outweigh any benefit, but it performs well on larger problem sets:
最后,如果项目是原始值,那么值得考虑numpy;在这种情况下(在一个小的列表中)开销超过任何好处,但它在较大的问题集上表现良好:
def sorted_np(a_list):
return np.unique(np.sort(a_list))
timeit sorted_np(my_list)
10000 loops, best of 3: 42 µs per loop
my_list = [random.randint(0, 10**6) for _ in range(10**6)]
timeit sorted_set(my_list)
1 loops, best of 3: 454 ms per loop
timeit sorted_np(my_list)
1 loops, best of 3: 333 ms per loop
#3
1
It is one two simple functions in python:
它是python中的两个简单函数:
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
print sorted(set(my_list))
and you get what you want ;)
你得到你想要的东西;)
if you want more info regarding sets look here, and about sorting in python have a look here.
如果你想了解有关集合的更多信息,请查看此处,以及有关在python中进行排序的信息。
hope this helps.
希望这可以帮助。
#4
-1
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
b=[]
for x in my_list:
try:
z=b.index(x)
except:
b.append(x)
b.sort()
output
['a', 'b', 'c', 'd', 'e', 'f']