Say I have these 2D arrays A and B.
假设我有这些2D阵列A和B.
How can I remove elements from A that are in B. (Complement in set theory: A-B)
如何从B中删除A中的元素。(集合论中的补充:A-B)
A=np.asarray([[1,1,1], [1,1,2], [1,1,3], [1,1,4]])
B=np.asarray([[0,0,0], [1,0,2], [1,0,3], [1,0,4], [1,1,0], [1,1,1], [1,1,4]])
#output = [[1,1,2], [1,1,3]]
To be more precise, I would like to do something like this.
更确切地说,我想做这样的事情。
data = some numpy array
label = some numpy array
A = np.argwhere(label==0) #[[1 1 1], [1 1 2], [1 1 3], [1 1 4]]
B = np.argwhere(data>1.5) #[[0 0 0], [1 0 2], [1 0 3], [1 0 4], [1 1 0], [1 1 1], [1 1 4]]
out = np.argwhere(label==0 and data>1.5) #[[1 1 2], [1 1 3]]
5 个解决方案
#1
11
Based on this solution
to Find the row indexes of several values in a numpy array
, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -
基于这个解决方案来查找numpy数组中几个值的行索引,这里是一个基于NumPy的解决方案,具有更少的内存占用,并且在处理大型数组时可能是有益的 -
dims = np.maximum(B.max(0),A.max(0))+1
out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]
Sample run -
样品运行 -
In [38]: A
Out[38]:
array([[1, 1, 1],
[1, 1, 2],
[1, 1, 3],
[1, 1, 4]])
In [39]: B
Out[39]:
array([[0, 0, 0],
[1, 0, 2],
[1, 0, 3],
[1, 0, 4],
[1, 1, 0],
[1, 1, 1],
[1, 1, 4]])
In [40]: out
Out[40]:
array([[1, 1, 2],
[1, 1, 3]])
Runtime test on large arrays -
大型阵列上的运行时测试 -
In [107]: def in1d_approach(A,B):
...: dims = np.maximum(B.max(0),A.max(0))+1
...: return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
...: np.ravel_multi_index(B.T,dims))]
...:
In [108]: # Setup arrays with B as large array and A contains some of B's rows
...: B = np.random.randint(0,9,(1000,3))
...: A = np.random.randint(0,9,(100,3))
...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
...: A[A_idx] = B[B_idx]
...:
Timings with broadcasting
based solutions -
基于广播解决方案的计时 -
In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln
In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100 loops, best of 3: 3.66 ms per loop
Timing with less memory footprint based solution -
基于内存占用更少的解决方案 -
In [111]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 231 µs per loop
Further performance boost
进一步提升性能
in1d_approach
reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot
, like so -
in1d_approach通过将每一行视为索引元组来减少每一行。通过使用np.dot引入矩阵乘法,我们可以更有效地做到这一点,就像这样 -
def in1d_dot_approach(A,B):
cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]
Let's test it against the previous on much larger arrays -
让我们在更大的阵列上对它进行测试 -
In [251]: # Setup arrays with B as large array and A contains some of B's rows
...: B = np.random.randint(0,9,(10000,3))
...: A = np.random.randint(0,9,(1000,3))
...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
...: A[A_idx] = B[B_idx]
...:
In [252]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 1.28 ms per loop
In [253]: %timeit in1d_dot_approach(A, B)
1000 loops, best of 3: 1.2 ms per loop
#2
9
Here is a Numpythonic approach with broadcasting:
这是一个广播的Numpythonic方法:
In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
Out[83]:
array([[1, 1, 2],
[1, 1, 3]])
Here is a timeit with other answer:
这是一个其他答案的时间表:
In [90]: def cal_diff(A, B):
....: A_rows = A.view([('', A.dtype)] * A.shape[1])
....: B_rows = B.view([('', B.dtype)] * B.shape[1])
....: return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
....:
In [93]: %timeit cal_diff(A, B)
10000 loops, best of 3: 54.1 µs per loop
In [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100000 loops, best of 3: 9.41 µs per loop
# Even better with Divakar's suggestion
In [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100000 loops, best of 3: 7.41 µs per loop
Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.
好吧,如果你正在寻找一种更快的方法,你应该寻找减少比较次数的方法。在这种情况下(不考虑订单),您可以从行中生成唯一的数字,并比较可以通过将项目的权力加总为2来完成的数字。
Here is the benchmark with Divakar's in1d approach:
以下是Divakar的in1d方法的基准:
In [144]: def in1d_approach(A,B):
.....: dims = np.maximum(B.max(0),A.max(0))+1
.....: return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
.....: np.ravel_multi_index(B.T,dims))]
.....:
In [146]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 23.8 µs per loop
In [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 20.2 µs per loop
You can use np.diff
to get the an order independent result:
您可以使用np.diff获取与订单无关的结果:
In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])
In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
Out[195]:
array([[1, 1, 2],
[1, 1, 3]])
In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 30.7 µs per loop
Benchmark with Divakar's setup:
Divakar设置基准:
In [198]: B = np.random.randint(0,9,(1000,3))
In [199]: A = np.random.randint(0,9,(100,3))
In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
In [202]: A[A_idx] = B[B_idx]
In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 137 µs per loop
In [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 112 µs per loop
In [205]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 115 µs per loop
Timing with larger arrays (Divakar's solution is slightly faster):
使用更大阵列的时间安排(Divakar的解决方案稍快):
In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
1000 loops, best of 3: 1.01 ms per loop
In [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
1000 loops, best of 3: 880 µs per loop
In [233]: %timeit in1d_approach(A, B)
1000 loops, best of 3: 807 µs per loop
#3
7
there is a easy solution with list comprehension,
列表理解有一个简单的解决方案,
A = [i for i in A if i not in B]
Result
结果
[[1, 1, 2], [1, 1, 3]]
List comprehension it's not removing the elements from the array, It's just reassigning,
列表理解它不是从数组中删除元素,它只是重新分配,
if you want to remove the elements use this method
如果要删除元素,请使用此方法
for i in B:
if i in A:
A.remove(i)
#4
5
If you want to do it the numpy way,
如果你想以笨拙的方式去做,
import numpy as np
A = np.array([[1, 1, 1,], [1, 1, 2], [1, 1, 3], [1, 1, 4]])
B = np.array([[0, 0, 0], [1, 0, 2], [1, 0, 3], [1, 0, 4], [1, 1, 0], [1, 1, 1], [1, 1, 4]])
A_rows = A.view([('', A.dtype)] * A.shape[1])
B_rows = B.view([('', B.dtype)] * B.shape[1])
diff_array = np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
As @Rahul suggested, for a non numpy easy solution,
正如@Rahul建议的那样,对于一个非笨拙的简单解决方案,
diff_array = [i for i in A if i not in B]
#5
4
Another non-numpy solution:
另一个非numpy解决方案:
[i for i in A if i not in B]
#1
11
Based on this solution
to Find the row indexes of several values in a numpy array
, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -
基于这个解决方案来查找numpy数组中几个值的行索引,这里是一个基于NumPy的解决方案,具有更少的内存占用,并且在处理大型数组时可能是有益的 -
dims = np.maximum(B.max(0),A.max(0))+1
out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]
Sample run -
样品运行 -
In [38]: A
Out[38]:
array([[1, 1, 1],
[1, 1, 2],
[1, 1, 3],
[1, 1, 4]])
In [39]: B
Out[39]:
array([[0, 0, 0],
[1, 0, 2],
[1, 0, 3],
[1, 0, 4],
[1, 1, 0],
[1, 1, 1],
[1, 1, 4]])
In [40]: out
Out[40]:
array([[1, 1, 2],
[1, 1, 3]])
Runtime test on large arrays -
大型阵列上的运行时测试 -
In [107]: def in1d_approach(A,B):
...: dims = np.maximum(B.max(0),A.max(0))+1
...: return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
...: np.ravel_multi_index(B.T,dims))]
...:
In [108]: # Setup arrays with B as large array and A contains some of B's rows
...: B = np.random.randint(0,9,(1000,3))
...: A = np.random.randint(0,9,(100,3))
...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
...: A[A_idx] = B[B_idx]
...:
Timings with broadcasting
based solutions -
基于广播解决方案的计时 -
In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln
In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100 loops, best of 3: 3.66 ms per loop
Timing with less memory footprint based solution -
基于内存占用更少的解决方案 -
In [111]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 231 µs per loop
Further performance boost
进一步提升性能
in1d_approach
reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot
, like so -
in1d_approach通过将每一行视为索引元组来减少每一行。通过使用np.dot引入矩阵乘法,我们可以更有效地做到这一点,就像这样 -
def in1d_dot_approach(A,B):
cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]
Let's test it against the previous on much larger arrays -
让我们在更大的阵列上对它进行测试 -
In [251]: # Setup arrays with B as large array and A contains some of B's rows
...: B = np.random.randint(0,9,(10000,3))
...: A = np.random.randint(0,9,(1000,3))
...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
...: A[A_idx] = B[B_idx]
...:
In [252]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 1.28 ms per loop
In [253]: %timeit in1d_dot_approach(A, B)
1000 loops, best of 3: 1.2 ms per loop
#2
9
Here is a Numpythonic approach with broadcasting:
这是一个广播的Numpythonic方法:
In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
Out[83]:
array([[1, 1, 2],
[1, 1, 3]])
Here is a timeit with other answer:
这是一个其他答案的时间表:
In [90]: def cal_diff(A, B):
....: A_rows = A.view([('', A.dtype)] * A.shape[1])
....: B_rows = B.view([('', B.dtype)] * B.shape[1])
....: return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
....:
In [93]: %timeit cal_diff(A, B)
10000 loops, best of 3: 54.1 µs per loop
In [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100000 loops, best of 3: 9.41 µs per loop
# Even better with Divakar's suggestion
In [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100000 loops, best of 3: 7.41 µs per loop
Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.
好吧,如果你正在寻找一种更快的方法,你应该寻找减少比较次数的方法。在这种情况下(不考虑订单),您可以从行中生成唯一的数字,并比较可以通过将项目的权力加总为2来完成的数字。
Here is the benchmark with Divakar's in1d approach:
以下是Divakar的in1d方法的基准:
In [144]: def in1d_approach(A,B):
.....: dims = np.maximum(B.max(0),A.max(0))+1
.....: return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
.....: np.ravel_multi_index(B.T,dims))]
.....:
In [146]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 23.8 µs per loop
In [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 20.2 µs per loop
You can use np.diff
to get the an order independent result:
您可以使用np.diff获取与订单无关的结果:
In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])
In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
Out[195]:
array([[1, 1, 2],
[1, 1, 3]])
In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 30.7 µs per loop
Benchmark with Divakar's setup:
Divakar设置基准:
In [198]: B = np.random.randint(0,9,(1000,3))
In [199]: A = np.random.randint(0,9,(100,3))
In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
In [202]: A[A_idx] = B[B_idx]
In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 137 µs per loop
In [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 112 µs per loop
In [205]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 115 µs per loop
Timing with larger arrays (Divakar's solution is slightly faster):
使用更大阵列的时间安排(Divakar的解决方案稍快):
In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
1000 loops, best of 3: 1.01 ms per loop
In [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
1000 loops, best of 3: 880 µs per loop
In [233]: %timeit in1d_approach(A, B)
1000 loops, best of 3: 807 µs per loop
#3
7
there is a easy solution with list comprehension,
列表理解有一个简单的解决方案,
A = [i for i in A if i not in B]
Result
结果
[[1, 1, 2], [1, 1, 3]]
List comprehension it's not removing the elements from the array, It's just reassigning,
列表理解它不是从数组中删除元素,它只是重新分配,
if you want to remove the elements use this method
如果要删除元素,请使用此方法
for i in B:
if i in A:
A.remove(i)
#4
5
If you want to do it the numpy way,
如果你想以笨拙的方式去做,
import numpy as np
A = np.array([[1, 1, 1,], [1, 1, 2], [1, 1, 3], [1, 1, 4]])
B = np.array([[0, 0, 0], [1, 0, 2], [1, 0, 3], [1, 0, 4], [1, 1, 0], [1, 1, 1], [1, 1, 4]])
A_rows = A.view([('', A.dtype)] * A.shape[1])
B_rows = B.view([('', B.dtype)] * B.shape[1])
diff_array = np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
As @Rahul suggested, for a non numpy easy solution,
正如@Rahul建议的那样,对于一个非笨拙的简单解决方案,
diff_array = [i for i in A if i not in B]
#5
4
Another non-numpy solution:
另一个非numpy解决方案:
[i for i in A if i not in B]