I am trying to find a vectorized way to accomplish the follwing:
我试图找到一种矢量化的方式来完成以下内容:
Say I have an array of x and y values. Note that the x values are not always ints and CAN be negative:
假设我有一个x和y值的数组。请注意,x值并不总是整数,而且可以是负数:
import numpy as np
x = np.array([-1,-1,-1,3,2,2,2,5,4,4], dtype=float)
y = np.array([0,1,0,1,0,1,0,1,0,1])
I want to group the y array by the sorted, unique values of the x array and summarize the counts for each y class. So the example above would look like this:
我想通过x数组的排序唯一值对y数组进行分组,并总结每个y类的计数。所以上面的例子看起来像这样:
array([[ 2., 1.],
[ 2., 1.],
[ 0., 1.],
[ 1., 1.],
[ 0., 1.]])
Where the first column represents the count of '0' values for each unique value of x and the second column represents the count of '1' values for each unique value of x.
其中第一列表示x的每个唯一值的“0”值的计数,第二列表示x的每个唯一值的“1”值的计数。
My current implementation looks like this:
我目前的实现如下:
x_sorted, y_sorted = x[x.argsort()], y[x.argsort()]
def collapse(x_sorted, y_sorted):
uniq_ids = np.unique(x_sorted, return_index=True)[1]
y_collapsed = np.zeros((len(uniq_ids), 2))
x_collapsed = x_sorted[uniq_ids]
for idx, y in enumerate(np.split(y_sorted, uniq_ids[1:])):
y_collapsed[idx,0] = (y == 0).sum()
y_collapsed[idx,1] = (y == 1).sum()
return (x_collapsed, y_collapsed)
collapse(x_sorted, y_sorted)
(array([-1, 2, 3, 4, 5]),
array([[ 2., 1.],
[ 2., 1.],
[ 0., 1.],
[ 1., 1.],
[ 0., 1.]]))
This doesn't seem very much in the spirit of numpy, however, and I'm hoping some vectorized method exists for this kind of operation. I am trying to do this without resorting to pandas. I know that library has a very convenient groupby operation.
然而,这似乎并不是在numpy的精神,并且我希望这种操作存在一些矢量化方法。我试图这样做而不诉诸熊猫。我知道库有一个非常方便的groupby操作。
5 个解决方案
#1
4
Since x
is float
. I would do this:
因为x是浮点数。我会这样做:
In [136]:
np.array([(x[y==0]==np.unique(x)[..., np.newaxis]).sum(axis=1),
(x[y==1]==np.unique(x)[..., np.newaxis]).sum(axis=1)]).T
Out[136]:
array([[2, 1],
[2, 1],
[0, 1],
[1, 1],
[0, 1]])
Speed:
速度:
In [152]:
%%timeit
ux=np.unique(x)[..., np.newaxis]
np.array([(x[y==0]==ux).sum(axis=1),
(x[y==1]==ux).sum(axis=1)]).T
10000 loops, best of 3: 92.7 µs per loop
Solution @seikichi
解决方案@seikichi
In [151]:
%%timeit
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.concatenate([[np.histogram(x[y == v], r)[0]] for v in sorted(set(y))]).T
1000 loops, best of 3: 388 µs per loop
For more general cases when y
is not just {0,1}
, as @askewchan pointed out:
对于更一般的情况,当y不仅仅是{0,1}时,@ askewchan指出:
In [155]:
%%timeit
ux=np.unique(x)[..., np.newaxis]
uy=np.unique(y)
np.asanyarray([(x[y==v]==ux).sum(axis=1) for v in uy]).T
10000 loops, best of 3: 116 µs per loop
To explain the broadcasting further, see this example:
要进一步解释广播,请参阅此示例:
In [5]:
np.unique(a)
Out[5]:
array([ 0. , 0.2, 0.4, 0.5, 0.6, 1.1, 1.5, 1.6, 1.7, 2. ])
In [8]:
np.unique(a)[...,np.newaxis] #what [..., np.newaxis] will do:
Out[8]:
array([[ 0. ],
[ 0.2],
[ 0.4],
[ 0.5],
[ 0.6],
[ 1.1],
[ 1.5],
[ 1.6],
[ 1.7],
[ 2. ]])
In [10]:
(a==np.unique(a)[...,np.newaxis]).astype('int') #then we can boardcast (converted to int for readability)
Out[10]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0]])
In [11]:
(a==np.unique(a)[...,np.newaxis]).sum(axis=1) #getting the count of unique value becomes summing among the 2nd axis
Out[11]:
array([1, 3, 1, 1, 2, 1, 1, 1, 1, 3])
#2
4
How about the following code? (use numpy.bincount
and numpy.concatenate
)
以下代码怎么样? (使用numpy.bincount和numpy.concatenate)
>>> import numpy as np
>>> x = np.array([1,1,1,3,2,2,2,5,4,4])
>>> y = np.array([0,1,0,1,0,1,0,1,0,1])
>>> xmax = x.max()
>>> numpy.concatenate([[numpy.bincount(x[y == v], minlength=xmax + 1)] for v in sorted(set(y))], axis=0)[:, 1:].T
array([[2, 1],
[2, 1],
[0, 1],
[1, 1],
[0, 1]])
UPDATE : Thanks @askewchan !
更新:谢谢@askewchan!
>>> import numpy as np
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.array([np.histogram(x[y == v], r)[0] for v in sorted(set(y))]).T
array([[2, 1],
[2, 1],
[0, 1],
[1, 1],
[0, 1]])
#3
3
np.unique
and np.bincount
are your friends here. The following should work for any type of the inputs, not neccessarily small consecutive integers:
np.unique和np.bincount是你的朋友。以下内容适用于任何类型的输入,而不是必需的小连续整数:
>>> x = np.array([1, 1, 1, 3, 2, 2, 2, 5, 4, 4])
>>> y = np.array([0, 1, 2, 2, 0, 1, 0, 2, 2, 1])
>>>
>>> x_unq, x_idx = np.unique(x, return_inverse=True)
>>> y_unq, y_idx = np.unique(y, return_inverse=True)
>>>
>>> np.column_stack(np.bincount(x_idx, y_idx == j) for j in range(len(y_unq)))
array([[ 1., 1., 1.],
[ 2., 1., 0.],
[ 0., 0., 1.],
[ 0., 1., 1.],
[ 0., 0., 1.]])
You can extract the row and column labels also:
您还可以提取行标签和列标签:
>>> x_unq
array([1, 2, 3, 4, 5])
>>> y_unq
array([0, 1, 2])
#4
2
I haven't tested this but I think it should work. Basically all I do is grab the values from y
based on x
being the value in question.
我没有测试过这个,但我认为它应该可行。基本上我所做的就是根据x作为有问题的值从y中获取值。
uniques = list(set(x))
uniques.sort()
lu = len(uniques)
res = np.zeros(lu * 2).reshape(lu, 2)
for i, v in enumerate(uniques):
cur = y[x == v]
s = cur.sum()
res[i, 0] = len(cur) - s
res[i, 1] = s
another way is to use numpy MaskedArrays
另一种方法是使用numpy MaskedArrays
#5
2
Here is another solution:
这是另一个解决方案:
y = y[np.argsort(x)]
b = np.bincount(x)
b = b[b!=0]
ans = np.array([[i.shape[0], i.sum()] for i in np.split(y, np.cumsum(b))[:-1]])
ans[:,0] -= ans[:,1]
print(ans)
#array([[2, 1],
# [2, 1],
# [0, 1],
# [1, 1],
# [0, 1]], dtype=int64)
Timing:
定时:
@seikichi solution:
10000 loops, best of 3: 37.2 µs per loop
@acushner solution:
10000 loops, best of 3: 65.4 µs per loop
@SaulloCastro solution:
10000 loops, best of 3: 154 µs per loop
#1
4
Since x
is float
. I would do this:
因为x是浮点数。我会这样做:
In [136]:
np.array([(x[y==0]==np.unique(x)[..., np.newaxis]).sum(axis=1),
(x[y==1]==np.unique(x)[..., np.newaxis]).sum(axis=1)]).T
Out[136]:
array([[2, 1],
[2, 1],
[0, 1],
[1, 1],
[0, 1]])
Speed:
速度:
In [152]:
%%timeit
ux=np.unique(x)[..., np.newaxis]
np.array([(x[y==0]==ux).sum(axis=1),
(x[y==1]==ux).sum(axis=1)]).T
10000 loops, best of 3: 92.7 µs per loop
Solution @seikichi
解决方案@seikichi
In [151]:
%%timeit
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.concatenate([[np.histogram(x[y == v], r)[0]] for v in sorted(set(y))]).T
1000 loops, best of 3: 388 µs per loop
For more general cases when y
is not just {0,1}
, as @askewchan pointed out:
对于更一般的情况,当y不仅仅是{0,1}时,@ askewchan指出:
In [155]:
%%timeit
ux=np.unique(x)[..., np.newaxis]
uy=np.unique(y)
np.asanyarray([(x[y==v]==ux).sum(axis=1) for v in uy]).T
10000 loops, best of 3: 116 µs per loop
To explain the broadcasting further, see this example:
要进一步解释广播,请参阅此示例:
In [5]:
np.unique(a)
Out[5]:
array([ 0. , 0.2, 0.4, 0.5, 0.6, 1.1, 1.5, 1.6, 1.7, 2. ])
In [8]:
np.unique(a)[...,np.newaxis] #what [..., np.newaxis] will do:
Out[8]:
array([[ 0. ],
[ 0.2],
[ 0.4],
[ 0.5],
[ 0.6],
[ 1.1],
[ 1.5],
[ 1.6],
[ 1.7],
[ 2. ]])
In [10]:
(a==np.unique(a)[...,np.newaxis]).astype('int') #then we can boardcast (converted to int for readability)
Out[10]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0]])
In [11]:
(a==np.unique(a)[...,np.newaxis]).sum(axis=1) #getting the count of unique value becomes summing among the 2nd axis
Out[11]:
array([1, 3, 1, 1, 2, 1, 1, 1, 1, 3])
#2
4
How about the following code? (use numpy.bincount
and numpy.concatenate
)
以下代码怎么样? (使用numpy.bincount和numpy.concatenate)
>>> import numpy as np
>>> x = np.array([1,1,1,3,2,2,2,5,4,4])
>>> y = np.array([0,1,0,1,0,1,0,1,0,1])
>>> xmax = x.max()
>>> numpy.concatenate([[numpy.bincount(x[y == v], minlength=xmax + 1)] for v in sorted(set(y))], axis=0)[:, 1:].T
array([[2, 1],
[2, 1],
[0, 1],
[1, 1],
[0, 1]])
UPDATE : Thanks @askewchan !
更新:谢谢@askewchan!
>>> import numpy as np
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.array([np.histogram(x[y == v], r)[0] for v in sorted(set(y))]).T
array([[2, 1],
[2, 1],
[0, 1],
[1, 1],
[0, 1]])
#3
3
np.unique
and np.bincount
are your friends here. The following should work for any type of the inputs, not neccessarily small consecutive integers:
np.unique和np.bincount是你的朋友。以下内容适用于任何类型的输入,而不是必需的小连续整数:
>>> x = np.array([1, 1, 1, 3, 2, 2, 2, 5, 4, 4])
>>> y = np.array([0, 1, 2, 2, 0, 1, 0, 2, 2, 1])
>>>
>>> x_unq, x_idx = np.unique(x, return_inverse=True)
>>> y_unq, y_idx = np.unique(y, return_inverse=True)
>>>
>>> np.column_stack(np.bincount(x_idx, y_idx == j) for j in range(len(y_unq)))
array([[ 1., 1., 1.],
[ 2., 1., 0.],
[ 0., 0., 1.],
[ 0., 1., 1.],
[ 0., 0., 1.]])
You can extract the row and column labels also:
您还可以提取行标签和列标签:
>>> x_unq
array([1, 2, 3, 4, 5])
>>> y_unq
array([0, 1, 2])
#4
2
I haven't tested this but I think it should work. Basically all I do is grab the values from y
based on x
being the value in question.
我没有测试过这个,但我认为它应该可行。基本上我所做的就是根据x作为有问题的值从y中获取值。
uniques = list(set(x))
uniques.sort()
lu = len(uniques)
res = np.zeros(lu * 2).reshape(lu, 2)
for i, v in enumerate(uniques):
cur = y[x == v]
s = cur.sum()
res[i, 0] = len(cur) - s
res[i, 1] = s
another way is to use numpy MaskedArrays
另一种方法是使用numpy MaskedArrays
#5
2
Here is another solution:
这是另一个解决方案:
y = y[np.argsort(x)]
b = np.bincount(x)
b = b[b!=0]
ans = np.array([[i.shape[0], i.sum()] for i in np.split(y, np.cumsum(b))[:-1]])
ans[:,0] -= ans[:,1]
print(ans)
#array([[2, 1],
# [2, 1],
# [0, 1],
# [1, 1],
# [0, 1]], dtype=int64)
Timing:
定时:
@seikichi solution:
10000 loops, best of 3: 37.2 µs per loop
@acushner solution:
10000 loops, best of 3: 65.4 µs per loop
@SaulloCastro solution:
10000 loops, best of 3: 154 µs per loop