I have two numpy arrays of the same length that contain binary values
我有两个相同长度的numpy数组包含二进制值
import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])
I want to compute the hamming distance between them as fast as possible since I have millions of such distance computations to make.
我想尽可能快地计算它们之间的汉明距离,因为我有数以百万计的这样的距离计算。
A simple but slow option is this (taken from wikipedia):
一个简单但缓慢的选择(取自*):
%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop
I have come up with faster options, inspired by some answers here on stack overflow.
我已经提出了更快的选项,灵感来自堆栈溢出的一些答案。
%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop
%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop
I'm wondering if there are even faster ways to compute this, possibly using cython?
我想知道是否有更快的方法来计算这个,可能使用cython?
3 个解决方案
#1
13
There is a ready numpy function which beats len((a != b).nonzero()[0])
;)
有一个准备好的numpy函数击败len((a!= b).nonzero()[0]);)
np.count_nonzero(a!=b)
#2
4
Using pythran can bring extra benefit here:
使用pythran可以带来额外的好处:
$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
return len(nonzero(a != b)[0])
As a reference (without pythran):
作为参考(没有pythran):
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop
While after pythran compilation:
在pythran编译之后:
$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop
That's roughly a 6x
speedup over the numpy implementation, as pythran skips the creation of an intermediate array when evaluating the element wise comparison.
这比numpy实现大约快了6倍,因为pythran在评估元素明智比较时会跳过创建中间数组。
I also measured:
我还测量过:
def hamm(a,b):
return count_nonzero(a != b)
And I get 3.11 usec per loop
for the Python version and 0.427 usec per loop
with the Pythran one.
对于Python版本,每循环获得3.11 usec,使用Pythran获得每循环0.427 usec。
Disclaimer: I'm one of the Pythran dev.
免责声明:我是Pythran开发者之一。
#3
3
Compared to 1.07µs for np.count_nonzero(a!=b) on my platform, gmpy2.hamdist gets its down to about 143ns after conversion of each array to an mpz (multiple-precison integer):
与我平台上的np.count_nonzero(a!= b)的1.07μs相比,gmpy2.hamdist在将每个数组转换为mpz(多精度整数)后将其降低到大约143ns:
import numpy as np
from gmpy2 import mpz, hamdist, pack
a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])
Based on a tip from @casevh, conversion from a 1D array of ones and zeros to a gmpy2 mpz object can be done reasonably efficiently with gmpy2.pack(list(reversed(list(array))),1).
基于来自@casevh的提示,可以使用gmpy2.pack(list(reversed(list(array)))1来合理有效地完成从1和0的数组到gmpy2 mpz对象的转换。
# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)
hamdist(ampz,bmpz)
Out[8]: 7
%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop
for relative comparison, on my platform:
相对比较,在我的平台上:
%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop
%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop
%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop
%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop
#1
13
There is a ready numpy function which beats len((a != b).nonzero()[0])
;)
有一个准备好的numpy函数击败len((a!= b).nonzero()[0]);)
np.count_nonzero(a!=b)
#2
4
Using pythran can bring extra benefit here:
使用pythran可以带来额外的好处:
$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
return len(nonzero(a != b)[0])
As a reference (without pythran):
作为参考(没有pythran):
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop
While after pythran compilation:
在pythran编译之后:
$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop
That's roughly a 6x
speedup over the numpy implementation, as pythran skips the creation of an intermediate array when evaluating the element wise comparison.
这比numpy实现大约快了6倍,因为pythran在评估元素明智比较时会跳过创建中间数组。
I also measured:
我还测量过:
def hamm(a,b):
return count_nonzero(a != b)
And I get 3.11 usec per loop
for the Python version and 0.427 usec per loop
with the Pythran one.
对于Python版本,每循环获得3.11 usec,使用Pythran获得每循环0.427 usec。
Disclaimer: I'm one of the Pythran dev.
免责声明:我是Pythran开发者之一。
#3
3
Compared to 1.07µs for np.count_nonzero(a!=b) on my platform, gmpy2.hamdist gets its down to about 143ns after conversion of each array to an mpz (multiple-precison integer):
与我平台上的np.count_nonzero(a!= b)的1.07μs相比,gmpy2.hamdist在将每个数组转换为mpz(多精度整数)后将其降低到大约143ns:
import numpy as np
from gmpy2 import mpz, hamdist, pack
a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])
Based on a tip from @casevh, conversion from a 1D array of ones and zeros to a gmpy2 mpz object can be done reasonably efficiently with gmpy2.pack(list(reversed(list(array))),1).
基于来自@casevh的提示,可以使用gmpy2.pack(list(reversed(list(array)))1来合理有效地完成从1和0的数组到gmpy2 mpz对象的转换。
# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)
hamdist(ampz,bmpz)
Out[8]: 7
%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop
for relative comparison, on my platform:
相对比较,在我的平台上:
%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop
%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop
%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop
%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop