Is there a built-in that removes duplicates from list in Python, whilst preserving order? I know that I can use a set to remove duplicates, but that destroys the original order. I also know that I can roll my own like this:
在Python中是否有一个内置的从列表中删除重复,同时保持顺序?我知道我可以使用一个集合来删除重复的内容,但是这会破坏原始的顺序。我也知道我可以像这样滚动我自己:
def uniq(input):
output = []
for x in input:
if x not in output:
output.append(x)
return output
(Thanks to unwind for that code sample.)
(多亏了这个代码示例的unwind。)
But I'd like to avail myself of a built-in or a more Pythonic idiom if possible.
但是,如果可能的话,我想用一个内置的或更大的python语言。
Related question: In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order?
相关问题:在Python中,从列表中删除重复项的最快算法是什么?
28 个解决方案
#1
633
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
这里有一些替代方法:http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
最快的一个:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add
to seen_add
instead of just calling seen.add
? Python is a dynamic language, and resolving seen.add
each iteration is more costly than resolving a local variable. seen.add
could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
为什么分配。添加到seen_add,而不是仅仅调用seen.add?Python是一种动态语言,并且可以解析。添加每个迭代要比解析局部变量花费更多。观察。在迭代之间,add可能已经发生了变化,而运行时还没有足够聪明地排除这种可能性。为了安全起见,它每次都要检查对象。
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
如果您打算在相同的数据集中大量使用这个函数,那么最好使用一个有序集:http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
O(1)每次操作插入、删除和成员检查。
#2
285
Edit 2016
编辑2016
As Raymond pointed out, in python 3.5+ where OrderedDict
is implemented in C, the list comprehension approach will be slower than OrderedDict
(unless you actually need the list at the end - and even then, only if the input is very short). So the best solution for 3.5+ is OrderedDict
.
正如Raymond指出的,在python 3.5+中,OrderedDict是在C中实现的,列表理解方法将比OrderedDict(除非您实际上在末尾需要列表——即使这样,只有在输入非常短的情况下)慢。因此,对于3.5+来说,最好的解决方案就是OrderedDict。
Important Edit 2015
重要的编辑2015
As @abarnert notes, the more_itertools
library (pip install more_itertools
) contains a unique_everseen
function that is built to solve this problem without any unreadable (not seen.add
) mutations in list comprehensions. This is also the fastest solution too:
正如@abarnert所指出的,more_itertools库(pip install more_itertools)包含一个unique_everseen函数,它是为解决这个问题而构建的,在列表理解中没有任何不可读的(不是seen.add)突变。这也是最快的解决方案:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks. This comes from an implementation of the itertools recipe unique_everseen
which looks like:
只有一个简单的库导入,并且没有黑客。这来自itertools recipe unique_everseen的实现,它看起来如下:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
In Python 2.7+
the
accepted common idiom
(which works but isn't optimized for speed, I would now use unique_everseen
) for this uses collections.OrderedDict
:
在Python 2.7中,+已接受的公共习惯用法(它有效,但没有针对速度进行优化,我现在将使用unique_everseen),用于使用collection . ordereddict:
Runtime: O(N)
运行时:O(N)
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
This looks much nicer than:
这看起来比:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
and doesn't utilize the ugly hack:
不使用丑陋的技巧:
not seen.add(x)
which relies on the fact that set.add
is an in-place method that always returns None
so not None
evaluates to True
.
它依赖于set.add是一个就地方法,它总是返回None,因此不会将None计算为True。
Note however that the hack solution is faster in raw speed though it has the same runtime complexity O(N).
但是请注意,尽管hack解决方案具有相同的运行时复杂性O(N),但是它在原始速度上更快。
#3
45
In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:
在Python 2.7中,从可迭代的对象中删除重复项的新方法是:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
在Python 3.5中,OrderedDict具有C实现。我的时间显示,这是Python 3.5中最快的和最短的方法。
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
在Python 3.6中,常规命令变得有序和紧凑。(此特性适用于CPython和PyPy,但在其他实现中可能不存在)。这给我们提供了一种新的最快的去杜平方法,同时保持秩序:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:
在Python 3.7中,常规的dict类型保证在所有实现中都是有序的。所以,最短最快的解是:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Response to @max: Once you move to 3.6 or 3.7 and use the regular dict instead of OrderedDict, you can't really beat the performance in any other way. The dictionary is dense and readily converts to a list with almost no overhead. The target list is pre-sized to len(d) which saves all the resizes that occur in a list comprehension. Also, since the internal key list is dense, copying the pointers is about almost fast as a list copy.
回复@max:一旦你切换到3.6或3.7,使用常规的dict形式而不是OrderedDict,你不能以任何其他方式真正打败性能。字典是密集的,很容易转换成几乎没有开销的列表。目标列表被预先设置为len(d),它保存列表理解中出现的所有大小。而且,由于内部键列表是密集的,复制指针的速度几乎和复制列表一样快。
#4
39
sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]
unique → ['1', '2', '3', '6', '4', '5']
独特的→[' 1 ',' 2 ',' 3 ',' 6 ',' 4 ',' 5 ']
#5
22
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.
列表甚至不需要排序,充分的条件是相等的值被分组在一起。
Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
编辑:我假设“保存顺序”意味着列表实际上是有序的。如果不是这样,那么米扎克dx的解就是正确的。
Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".
社区编辑:这是“将连续的元素压缩为单个元素”的最优雅的方式。
#6
18
I think if you wanna maintain the order,
如果你想保持秩序,
you can try this:
list1 = ['b','c','d','b','c','a','a']
list2 = list(set(list1))
list2.sort(key=list1.index)
print list2
OR similarly you can do this:
list1 = ['b','c','d','b','c','a','a']
list2 = sorted(set(list1),key=list1.index)
print list2
You can also do this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
for i in list1:
if not i in list2:
list2.append(i)`
print list2
It can also be written as this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
[list2.append(i) for i in list1 if not i in list2]
print list2
#7
11
For another very late answer to another very old question:
对于另一个非常古老的问题的另一个很晚的回答:
The itertools
recipes have a function that does this, using the seen
set technique, but:
itertools菜谱有一个函数可以实现这一点,使用所见的set技术,但是:
- Handles a standard
key
function. - 处理标准键函数。
- Uses no unseemly hacks.
- 不使用不得体的黑客。
- Optimizes the loop by pre-binding
seen.add
instead of looking it up N times. (f7
also does this, but some versions don't.) - 通过预先绑定优化循环。把它加起来,而不是查找N次。(f7也有这种功能,但有些版本没有。)
- Optimizes the loop by using
ifilterfalse
, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them insideifilterfalse
, of course, but that's in C, and much faster.) - 通过使用ifilterfalse优化循环,因此您只需循环Python中的惟一元素,而不是所有元素。(当然,你仍然在ifilterfalse中遍历它们,但那是在C中,而且要快得多。)
Is it actually faster than f7
? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7
uses a listcomp, and there's no way to do that here. (You can directly append
instead of yield
ing, or you can feed the generator into the list
function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.
它比f7快吗?这取决于你的数据,所以你必须对它进行测试。如果您希望最后得到一个列表,那么f7使用一个listcomp,这里没有办法做到这一点。(您可以直接追加而不是让步,或者可以将生成器输入到list函数中,但是没有一个函数可以像listcomp中的LIST_APPEND那样快。)无论如何,挤出几微秒的时间并不重要,重要的是要有一个易于理解的、可重用的、已经编写好的函数,当您想要装饰时不需要DSU。
As with all of the recipes, it's also available in more-iterools
.
和所有的食谱一样,它也有更多的iterools。
If you just want the no-key
case, you can simplify it as:
如果你只想要无钥匙的情况,你可以把它简化为:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
#8
10
Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen
:
只需从外部模块1中添加此类功能的另一个(非常高性能的)实现:iteration_utility .unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]
>>> list(unique_everseen(lst))
[1, 2, 3, 4]
Timings
I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys
, f7
and more_itertools.unique_everseen
:
我做了一些计时(Python 3.6),这表明它比我测试的所有其他选项都要快,包括OrderedDict.fromkeys、f7和more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
为了确保我也做了一个测试用了更多的副本只是为了检查它是否有区别:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
一个只包含一个值:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen
function is the fastest (on my computer).
在所有这些情况下,iteration_utilities。unique_everseen函数是最快的(在我的计算机上)。
This iteration_utilities.unique_everseen
function can also handle unhashable values in the input (however with an O(n*n)
performance instead of the O(n)
performance when the values are hashable).
这iteration_utilities。unique_everseen函数还可以处理输入中的不可洗值(但是当值是可洗的时,可以使用O(n*n)性能而不是O(n)性能)。
>>> lst = [{1}, {1}, {2}, {1}, {3}]
>>> list(unique_everseen(lst))
[{1}, {2}, {3}]
1 Disclaimer: I'm the author of that package.
免责声明:我是这个包裹的作者。
#9
6
For no hashable types (e.g. list of lists), based on MizardX's:
基于MizardX的:没有可清洗的类型(如列表列表)
def f7_noHash(seq)
seen = set()
return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]
#10
5
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
不是去踢死马(这个问题已经很老了,已经有很多好的答案了),但是这里有一个使用熊猫的解决方案,它在很多情况下都非常快,而且使用起来非常简单。
import pandas as pd
my_list = range(5) + range(5) # [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4]
#11
3
Borrowing the recursive idea used in definining Haskell's nub
function for lists, this would be a recursive approach:
借用定义Haskell对于列表的nub函数的递归思想,这将是一种递归方法:
def unique(lst):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))
e.g.:
例如:
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
我尝试过增加数据大小,看到了次线性时间复杂度(不是确定的,但建议这对正常数据来说应该没问题)。
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:
我也认为有趣的是,这可以很容易地推广到其他操作的唯一性。是这样的:
import operator
def unique(lst, cmp_op=operator.ne):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:
例如,您可以传入一个函数,该函数使用四舍五入的概念作为相同的整数,就好像它是为了惟一目的而“相等”一样,例如:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
然后独特(some_list test_round)将提供独特的元素列表,独特性不再意味着传统的平等(由使用任何类型的基于集合或隐含dict-key-based方法这个问题),而是为了带轮的第一个元素为每个可能的整数K,K元素可能轮,例如:
In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]
#12
3
5 x faster reduce variant but more sophisticated
5倍速度减少变种,但更复杂
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
解释:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
#13
3
You can reference a list comprehension as it is being built by the symbol '_[1]'.
For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
您可以引用一个列表理解,因为它是由符号“_[1]”构建的。例如,下面的函数是惟一的——它通过引用元素的列表理解,在不改变元素的顺序的情况下创建元素列表。
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
演示:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
输出:
[1, 2, 3, 4, 5]
#14
2
MizardX's answer gives a good collection of multiple approaches.
MizardX的答案提供了很多方法。
This is what I came up with while thinking aloud:
这就是我大声思考时想到的:
mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]
#15
1
You could do a sort of ugly list comprehension hack.
你可以做一些难看的列表理解技巧。
[l[i] for i in range(len(l)) if l.index(l[i]) == i]
#16
1
Relatively effective approach with _sorted_
a numpy
arrays:
使用_sorted_a numpy数组的相对有效方法:
b = np.array([1,3,3, 8, 12, 12,12])
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])
Outputs:
输出:
array([ 1, 3, 8, 12])
#17
1
l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))
A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.
一个生成器表达式,使用O(1)查找集合以确定是否在新列表中包含元素。
#18
1
A simple recursive solution:
一个简单的递归解决方案:
def uniquefy_list(a):
return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]
#19
1
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
在Python 3.7及以上版本中,字典保证记住它们的键插入顺序。这个问题的答案概括了目前的情况。
The OrderedDict
solution thus becomes obsolete and without any import statements we can simply issue:
因此,OrderedDict解决方案变得过时,没有任何导入语句,我们可以简单地发布:
>>> list(dict.fromkeys([1, 2, 1, 3, 3, 2, 4]).keys())
[1, 2, 3, 4]
#20
0
If you need one liner then maybe this would help:
如果你需要一个衬垫,那么这可能会有帮助:
reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))
... should work but correct me if i'm wrong
…如果我错了应该纠正我吗
#21
0
Because I was looking at a dup and collected some related but different, related, useful information that isn't part of the other answers, here are two other possible solutions.
因为我正在查看一个dup,并收集了一些相关但不同的、相关的、有用的信息,这些信息不是其他答案的一部分,这里有另外两个可能的解决方案。
.get(True) XOR .setdefault(False)
. get(真)XOR .setdefault(假)
The first is very much like the accepted seen_add
soultion but with explicit side effects using dictionary's get(x,<default>)
and setdefault(x,<default>)
:
第一个非常类似于已接受的seen_add soultion,但是带有显式的副作用,使用dictionary的get(x,
# Explanation of d.get(x,True) != d.setdefault(x,False)
#
# x in d | d[x] | A = d.get(x,True) | x in d | B = d.setdefault(x,False) | x in d | d[x] | A xor B
# False | None | True (1) | False | False (2) | True | False | True
# True | False | False (3) | True | False (4) | True | False | False
#
# Notes
# (1) x is not in the dictionary, so get(x,<default>) returns True but does __not__ add the value to the dictionary
# (2) x is not in the dictionary, so setdefault(x,<default>) adds the {x:False} and returns False
# (3) since x is in the dictionary, the <default> argument is ignored, and the value of the key is returned, which was
# set to False in (2)
# (4) since the key is already in the dictionary, its value is returned directly and the argument is ignored
#
# A != B is how to do boolean XOR in Python
#
def sort_with_order(s):
d = dict()
return [x for x in s if d.get(x,True) != d.setdefault(x,False)]
get(x,<default>)
returns <default>
if x
is not in the dictionary, but does not add the key to the dictionary. set(x,<default>)
returns the value if the key is in the dictionary, otherwise sets it to and returns <default>
.
get(x,
Aside: a != b
is how to do an XOR in python
旁白:a != b是如何在python中执行XOR
__OVERRIDING ___missing_____ (inspired by this answer)
(受此答案的启发)
The second technique is overriding the __missing__
method that gets called when the key doesn't exist in a dictionary, which is only called when using d[k]
notation:
第二种技术是重写__missing__方法,该方法在字典中不存在键时被调用,只有在使用d[k]符号时才调用:
class Tracker(dict):
# returns True if missing, otherwise sets the value to False
# so next time d[key] is called, the value False will be returned
# and __missing__ will not be called again
def __missing__(self, key):
self[key] = False
return True
t = Tracker()
unique_with_order = [x for x in samples if t[x]]
From the docs:
从文档:
New in version 2.5: If a subclass of dict defines a method _____missing_____(), if the key key is not present, the d[key] operation calls that method with the key key as argument. The d[key] operation then returns or raises whatever is returned or raised by the _____missing_____(key) call if the key is not present. No other operations or methods invoke _____missing_____(). If _____missing_____() is not defined, KeyError is raised. _____missing_____() must be a method; it cannot be an instance variable. For an example, see collections.defaultdict.
新版本2.5:如果dict类型的子类定义一个方法_____missing_____(),如果不存在键,d[key]操作调用该方法,键作为参数。d[key]操作然后返回或提升任何由___missing_____(key)调用返回或引发的内容(如果密钥不存在)。没有其他操作或方法调用___missing_____()。如果___missing_____()没有定义,则会引发KeyError。()必须是一种方法;它不能是实例变量。例如,请参阅collection .defaultdict。
#22
0
If you routinely use pandas
, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates
:
如果您经常使用熊猫,而美学优于性能,那么请考虑内置函数pandas. series.drop_copy:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
时间:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
#23
0
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
这将保持秩序并在O(n)时间内运行。基本上,我们的想法是在任何有重复发现的地方创造一个洞,然后把它埋到底部。使用一个读写指针。每当找到副本时,只有读指针前进,写指针停留在重复条目上以覆盖它。
def deduplicate(l):
count = {}
(read,write) = (0,0)
while read < len(l):
if l[read] in count:
read += 1
continue
count[l[read]] = True
l[write] = l[read]
read += 1
write += 1
return l[0:write]
#24
0
A solution without using imported modules or sets:
不使用导入模块或集合的解决方案:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
Gives output:
给输出:
['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
#25
0
Here is my 2 cents on this:
这是我的两点看法:
def unique(nums):
unique = []
for n in nums:
if n not in unique:
unique.append(n)
return unique
Regards, Yuriy
问候,看门人尤里
#26
0
this is the smartes way to remove duplicates from a list in Python whilst preserving its order, you can even do it in one line of code:
这是在Python中从列表中删除重复内容的一种智能方法,同时保持其顺序,您甚至可以在一行代码中完成:
a_list = ["a", "b", "a", "c"]
sorted_list = [x[0] for x in (sorted({x:a_list.index(x) for x in set(a_list)}.items(), key=lambda x: x[1]))]
print sorted_list
#27
0
My buddy Wes gave me this sweet answer using list comprehensions.
我的好朋友维斯给了我这个甜蜜的答案,用的是列表的理解。
Example Code:
示例代码:
>>> l = [3, 4, 3, 6, 4, 1, 4, 8]
>>> l = [l[i] for i in range(len(l)) if i == l.index(l[i])]
>>> l = [3, 4, 6, 1, 8]
#28
0
Just to add another answer I've not seen listed
再补充一个我没见过的答案
>>> a = ['f', 'F', 'F', 'G', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'f']
>>> [a[i] for i in sorted(set([a.index(elem) for elem in a]))]
['f', 'F', 'G', 'a', 'b', 'c', 'd']
>>>
This is using .index
to get the first index of every list element, and getting rid of duplicate results (for repeating elements) with set
, then sorting because there's no order in sets. Note that we do not loose order information because the first index of every new element is always in ascending order. So sorted will always put it right.
这是使用.index获取每个列表元素的第一个索引,并使用set删除重复的结果(对于重复的元素),然后进行排序,因为集合中没有顺序。注意,我们没有松散的顺序信息,因为每个新元素的第一个索引总是按升序排列。所以排序总是正确的。
I've just considered the easy syntax, not performance.
我只是考虑了简单的语法,而不是性能。
#1
633
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
这里有一些替代方法:http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
最快的一个:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add
to seen_add
instead of just calling seen.add
? Python is a dynamic language, and resolving seen.add
each iteration is more costly than resolving a local variable. seen.add
could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
为什么分配。添加到seen_add,而不是仅仅调用seen.add?Python是一种动态语言,并且可以解析。添加每个迭代要比解析局部变量花费更多。观察。在迭代之间,add可能已经发生了变化,而运行时还没有足够聪明地排除这种可能性。为了安全起见,它每次都要检查对象。
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
如果您打算在相同的数据集中大量使用这个函数,那么最好使用一个有序集:http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
O(1)每次操作插入、删除和成员检查。
#2
285
Edit 2016
编辑2016
As Raymond pointed out, in python 3.5+ where OrderedDict
is implemented in C, the list comprehension approach will be slower than OrderedDict
(unless you actually need the list at the end - and even then, only if the input is very short). So the best solution for 3.5+ is OrderedDict
.
正如Raymond指出的,在python 3.5+中,OrderedDict是在C中实现的,列表理解方法将比OrderedDict(除非您实际上在末尾需要列表——即使这样,只有在输入非常短的情况下)慢。因此,对于3.5+来说,最好的解决方案就是OrderedDict。
Important Edit 2015
重要的编辑2015
As @abarnert notes, the more_itertools
library (pip install more_itertools
) contains a unique_everseen
function that is built to solve this problem without any unreadable (not seen.add
) mutations in list comprehensions. This is also the fastest solution too:
正如@abarnert所指出的,more_itertools库(pip install more_itertools)包含一个unique_everseen函数,它是为解决这个问题而构建的,在列表理解中没有任何不可读的(不是seen.add)突变。这也是最快的解决方案:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks. This comes from an implementation of the itertools recipe unique_everseen
which looks like:
只有一个简单的库导入,并且没有黑客。这来自itertools recipe unique_everseen的实现,它看起来如下:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
In Python 2.7+
the
accepted common idiom
(which works but isn't optimized for speed, I would now use unique_everseen
) for this uses collections.OrderedDict
:
在Python 2.7中,+已接受的公共习惯用法(它有效,但没有针对速度进行优化,我现在将使用unique_everseen),用于使用collection . ordereddict:
Runtime: O(N)
运行时:O(N)
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
This looks much nicer than:
这看起来比:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
and doesn't utilize the ugly hack:
不使用丑陋的技巧:
not seen.add(x)
which relies on the fact that set.add
is an in-place method that always returns None
so not None
evaluates to True
.
它依赖于set.add是一个就地方法,它总是返回None,因此不会将None计算为True。
Note however that the hack solution is faster in raw speed though it has the same runtime complexity O(N).
但是请注意,尽管hack解决方案具有相同的运行时复杂性O(N),但是它在原始速度上更快。
#3
45
In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:
在Python 2.7中,从可迭代的对象中删除重复项的新方法是:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
在Python 3.5中,OrderedDict具有C实现。我的时间显示,这是Python 3.5中最快的和最短的方法。
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
在Python 3.6中,常规命令变得有序和紧凑。(此特性适用于CPython和PyPy,但在其他实现中可能不存在)。这给我们提供了一种新的最快的去杜平方法,同时保持秩序:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:
在Python 3.7中,常规的dict类型保证在所有实现中都是有序的。所以,最短最快的解是:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Response to @max: Once you move to 3.6 or 3.7 and use the regular dict instead of OrderedDict, you can't really beat the performance in any other way. The dictionary is dense and readily converts to a list with almost no overhead. The target list is pre-sized to len(d) which saves all the resizes that occur in a list comprehension. Also, since the internal key list is dense, copying the pointers is about almost fast as a list copy.
回复@max:一旦你切换到3.6或3.7,使用常规的dict形式而不是OrderedDict,你不能以任何其他方式真正打败性能。字典是密集的,很容易转换成几乎没有开销的列表。目标列表被预先设置为len(d),它保存列表理解中出现的所有大小。而且,由于内部键列表是密集的,复制指针的速度几乎和复制列表一样快。
#4
39
sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]
unique → ['1', '2', '3', '6', '4', '5']
独特的→[' 1 ',' 2 ',' 3 ',' 6 ',' 4 ',' 5 ']
#5
22
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.
列表甚至不需要排序,充分的条件是相等的值被分组在一起。
Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
编辑:我假设“保存顺序”意味着列表实际上是有序的。如果不是这样,那么米扎克dx的解就是正确的。
Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".
社区编辑:这是“将连续的元素压缩为单个元素”的最优雅的方式。
#6
18
I think if you wanna maintain the order,
如果你想保持秩序,
you can try this:
list1 = ['b','c','d','b','c','a','a']
list2 = list(set(list1))
list2.sort(key=list1.index)
print list2
OR similarly you can do this:
list1 = ['b','c','d','b','c','a','a']
list2 = sorted(set(list1),key=list1.index)
print list2
You can also do this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
for i in list1:
if not i in list2:
list2.append(i)`
print list2
It can also be written as this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
[list2.append(i) for i in list1 if not i in list2]
print list2
#7
11
For another very late answer to another very old question:
对于另一个非常古老的问题的另一个很晚的回答:
The itertools
recipes have a function that does this, using the seen
set technique, but:
itertools菜谱有一个函数可以实现这一点,使用所见的set技术,但是:
- Handles a standard
key
function. - 处理标准键函数。
- Uses no unseemly hacks.
- 不使用不得体的黑客。
- Optimizes the loop by pre-binding
seen.add
instead of looking it up N times. (f7
also does this, but some versions don't.) - 通过预先绑定优化循环。把它加起来,而不是查找N次。(f7也有这种功能,但有些版本没有。)
- Optimizes the loop by using
ifilterfalse
, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them insideifilterfalse
, of course, but that's in C, and much faster.) - 通过使用ifilterfalse优化循环,因此您只需循环Python中的惟一元素,而不是所有元素。(当然,你仍然在ifilterfalse中遍历它们,但那是在C中,而且要快得多。)
Is it actually faster than f7
? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7
uses a listcomp, and there's no way to do that here. (You can directly append
instead of yield
ing, or you can feed the generator into the list
function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.
它比f7快吗?这取决于你的数据,所以你必须对它进行测试。如果您希望最后得到一个列表,那么f7使用一个listcomp,这里没有办法做到这一点。(您可以直接追加而不是让步,或者可以将生成器输入到list函数中,但是没有一个函数可以像listcomp中的LIST_APPEND那样快。)无论如何,挤出几微秒的时间并不重要,重要的是要有一个易于理解的、可重用的、已经编写好的函数,当您想要装饰时不需要DSU。
As with all of the recipes, it's also available in more-iterools
.
和所有的食谱一样,它也有更多的iterools。
If you just want the no-key
case, you can simplify it as:
如果你只想要无钥匙的情况,你可以把它简化为:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
#8
10
Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen
:
只需从外部模块1中添加此类功能的另一个(非常高性能的)实现:iteration_utility .unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]
>>> list(unique_everseen(lst))
[1, 2, 3, 4]
Timings
I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys
, f7
and more_itertools.unique_everseen
:
我做了一些计时(Python 3.6),这表明它比我测试的所有其他选项都要快,包括OrderedDict.fromkeys、f7和more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
为了确保我也做了一个测试用了更多的副本只是为了检查它是否有区别:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
一个只包含一个值:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen
function is the fastest (on my computer).
在所有这些情况下,iteration_utilities。unique_everseen函数是最快的(在我的计算机上)。
This iteration_utilities.unique_everseen
function can also handle unhashable values in the input (however with an O(n*n)
performance instead of the O(n)
performance when the values are hashable).
这iteration_utilities。unique_everseen函数还可以处理输入中的不可洗值(但是当值是可洗的时,可以使用O(n*n)性能而不是O(n)性能)。
>>> lst = [{1}, {1}, {2}, {1}, {3}]
>>> list(unique_everseen(lst))
[{1}, {2}, {3}]
1 Disclaimer: I'm the author of that package.
免责声明:我是这个包裹的作者。
#9
6
For no hashable types (e.g. list of lists), based on MizardX's:
基于MizardX的:没有可清洗的类型(如列表列表)
def f7_noHash(seq)
seen = set()
return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]
#10
5
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
不是去踢死马(这个问题已经很老了,已经有很多好的答案了),但是这里有一个使用熊猫的解决方案,它在很多情况下都非常快,而且使用起来非常简单。
import pandas as pd
my_list = range(5) + range(5) # [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4]
#11
3
Borrowing the recursive idea used in definining Haskell's nub
function for lists, this would be a recursive approach:
借用定义Haskell对于列表的nub函数的递归思想,这将是一种递归方法:
def unique(lst):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))
e.g.:
例如:
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
我尝试过增加数据大小,看到了次线性时间复杂度(不是确定的,但建议这对正常数据来说应该没问题)。
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:
我也认为有趣的是,这可以很容易地推广到其他操作的唯一性。是这样的:
import operator
def unique(lst, cmp_op=operator.ne):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:
例如,您可以传入一个函数,该函数使用四舍五入的概念作为相同的整数,就好像它是为了惟一目的而“相等”一样,例如:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
然后独特(some_list test_round)将提供独特的元素列表,独特性不再意味着传统的平等(由使用任何类型的基于集合或隐含dict-key-based方法这个问题),而是为了带轮的第一个元素为每个可能的整数K,K元素可能轮,例如:
In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]
#12
3
5 x faster reduce variant but more sophisticated
5倍速度减少变种,但更复杂
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
解释:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
#13
3
You can reference a list comprehension as it is being built by the symbol '_[1]'.
For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
您可以引用一个列表理解,因为它是由符号“_[1]”构建的。例如,下面的函数是惟一的——它通过引用元素的列表理解,在不改变元素的顺序的情况下创建元素列表。
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
演示:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
输出:
[1, 2, 3, 4, 5]
#14
2
MizardX's answer gives a good collection of multiple approaches.
MizardX的答案提供了很多方法。
This is what I came up with while thinking aloud:
这就是我大声思考时想到的:
mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]
#15
1
You could do a sort of ugly list comprehension hack.
你可以做一些难看的列表理解技巧。
[l[i] for i in range(len(l)) if l.index(l[i]) == i]
#16
1
Relatively effective approach with _sorted_
a numpy
arrays:
使用_sorted_a numpy数组的相对有效方法:
b = np.array([1,3,3, 8, 12, 12,12])
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])
Outputs:
输出:
array([ 1, 3, 8, 12])
#17
1
l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))
A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.
一个生成器表达式,使用O(1)查找集合以确定是否在新列表中包含元素。
#18
1
A simple recursive solution:
一个简单的递归解决方案:
def uniquefy_list(a):
return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]
#19
1
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
在Python 3.7及以上版本中,字典保证记住它们的键插入顺序。这个问题的答案概括了目前的情况。
The OrderedDict
solution thus becomes obsolete and without any import statements we can simply issue:
因此,OrderedDict解决方案变得过时,没有任何导入语句,我们可以简单地发布:
>>> list(dict.fromkeys([1, 2, 1, 3, 3, 2, 4]).keys())
[1, 2, 3, 4]
#20
0
If you need one liner then maybe this would help:
如果你需要一个衬垫,那么这可能会有帮助:
reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))
... should work but correct me if i'm wrong
…如果我错了应该纠正我吗
#21
0
Because I was looking at a dup and collected some related but different, related, useful information that isn't part of the other answers, here are two other possible solutions.
因为我正在查看一个dup,并收集了一些相关但不同的、相关的、有用的信息,这些信息不是其他答案的一部分,这里有另外两个可能的解决方案。
.get(True) XOR .setdefault(False)
. get(真)XOR .setdefault(假)
The first is very much like the accepted seen_add
soultion but with explicit side effects using dictionary's get(x,<default>)
and setdefault(x,<default>)
:
第一个非常类似于已接受的seen_add soultion,但是带有显式的副作用,使用dictionary的get(x,
# Explanation of d.get(x,True) != d.setdefault(x,False)
#
# x in d | d[x] | A = d.get(x,True) | x in d | B = d.setdefault(x,False) | x in d | d[x] | A xor B
# False | None | True (1) | False | False (2) | True | False | True
# True | False | False (3) | True | False (4) | True | False | False
#
# Notes
# (1) x is not in the dictionary, so get(x,<default>) returns True but does __not__ add the value to the dictionary
# (2) x is not in the dictionary, so setdefault(x,<default>) adds the {x:False} and returns False
# (3) since x is in the dictionary, the <default> argument is ignored, and the value of the key is returned, which was
# set to False in (2)
# (4) since the key is already in the dictionary, its value is returned directly and the argument is ignored
#
# A != B is how to do boolean XOR in Python
#
def sort_with_order(s):
d = dict()
return [x for x in s if d.get(x,True) != d.setdefault(x,False)]
get(x,<default>)
returns <default>
if x
is not in the dictionary, but does not add the key to the dictionary. set(x,<default>)
returns the value if the key is in the dictionary, otherwise sets it to and returns <default>
.
get(x,
Aside: a != b
is how to do an XOR in python
旁白:a != b是如何在python中执行XOR
__OVERRIDING ___missing_____ (inspired by this answer)
(受此答案的启发)
The second technique is overriding the __missing__
method that gets called when the key doesn't exist in a dictionary, which is only called when using d[k]
notation:
第二种技术是重写__missing__方法,该方法在字典中不存在键时被调用,只有在使用d[k]符号时才调用:
class Tracker(dict):
# returns True if missing, otherwise sets the value to False
# so next time d[key] is called, the value False will be returned
# and __missing__ will not be called again
def __missing__(self, key):
self[key] = False
return True
t = Tracker()
unique_with_order = [x for x in samples if t[x]]
From the docs:
从文档:
New in version 2.5: If a subclass of dict defines a method _____missing_____(), if the key key is not present, the d[key] operation calls that method with the key key as argument. The d[key] operation then returns or raises whatever is returned or raised by the _____missing_____(key) call if the key is not present. No other operations or methods invoke _____missing_____(). If _____missing_____() is not defined, KeyError is raised. _____missing_____() must be a method; it cannot be an instance variable. For an example, see collections.defaultdict.
新版本2.5:如果dict类型的子类定义一个方法_____missing_____(),如果不存在键,d[key]操作调用该方法,键作为参数。d[key]操作然后返回或提升任何由___missing_____(key)调用返回或引发的内容(如果密钥不存在)。没有其他操作或方法调用___missing_____()。如果___missing_____()没有定义,则会引发KeyError。()必须是一种方法;它不能是实例变量。例如,请参阅collection .defaultdict。
#22
0
If you routinely use pandas
, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates
:
如果您经常使用熊猫,而美学优于性能,那么请考虑内置函数pandas. series.drop_copy:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
时间:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
#23
0
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
这将保持秩序并在O(n)时间内运行。基本上,我们的想法是在任何有重复发现的地方创造一个洞,然后把它埋到底部。使用一个读写指针。每当找到副本时,只有读指针前进,写指针停留在重复条目上以覆盖它。
def deduplicate(l):
count = {}
(read,write) = (0,0)
while read < len(l):
if l[read] in count:
read += 1
continue
count[l[read]] = True
l[write] = l[read]
read += 1
write += 1
return l[0:write]
#24
0
A solution without using imported modules or sets:
不使用导入模块或集合的解决方案:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
Gives output:
给输出:
['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
#25
0
Here is my 2 cents on this:
这是我的两点看法:
def unique(nums):
unique = []
for n in nums:
if n not in unique:
unique.append(n)
return unique
Regards, Yuriy
问候,看门人尤里
#26
0
this is the smartes way to remove duplicates from a list in Python whilst preserving its order, you can even do it in one line of code:
这是在Python中从列表中删除重复内容的一种智能方法,同时保持其顺序,您甚至可以在一行代码中完成:
a_list = ["a", "b", "a", "c"]
sorted_list = [x[0] for x in (sorted({x:a_list.index(x) for x in set(a_list)}.items(), key=lambda x: x[1]))]
print sorted_list
#27
0
My buddy Wes gave me this sweet answer using list comprehensions.
我的好朋友维斯给了我这个甜蜜的答案,用的是列表的理解。
Example Code:
示例代码:
>>> l = [3, 4, 3, 6, 4, 1, 4, 8]
>>> l = [l[i] for i in range(len(l)) if i == l.index(l[i])]
>>> l = [3, 4, 6, 1, 8]
#28
0
Just to add another answer I've not seen listed
再补充一个我没见过的答案
>>> a = ['f', 'F', 'F', 'G', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'f']
>>> [a[i] for i in sorted(set([a.index(elem) for elem in a]))]
['f', 'F', 'G', 'a', 'b', 'c', 'd']
>>>
This is using .index
to get the first index of every list element, and getting rid of duplicate results (for repeating elements) with set
, then sorting because there's no order in sets. Note that we do not loose order information because the first index of every new element is always in ascending order. So sorted will always put it right.
这是使用.index获取每个列表元素的第一个索引,并使用set删除重复的结果(对于重复的元素),然后进行排序,因为集合中没有顺序。注意,我们没有松散的顺序信息,因为每个新元素的第一个索引总是按升序排列。所以排序总是正确的。
I've just considered the easy syntax, not performance.
我只是考虑了简单的语法,而不是性能。