逐行索引NumPy数组[重复]

时间:2022-07-02 21:42:50

This question already has an answer here:

这个问题在这里已有答案:

Say I have a NumPy array:

说我有一个NumPy数组:

>>> X = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> X
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

and an array of indexes that I want to select for each row:

以及我想为每一行选择的索引数组:

>>> ixs = np.array([[1, 3], [0, 1], [1, 2]])
>>> ixs
array([[1, 3],
       [0, 1],
       [1, 2]])

How do I index the array X so that for every row in X I select the two indices specified in ixs?

如何索引数组X,以便对于X中的每一行,我选择ixs中指定的两个索引?

So for this case, I want to select element 1 and 3 for the first row, element 0 and 1 for the second row, and so on. The output should be:

因此,对于这种情况,我想为第一行选择元素1和3,为第二行选择元素0和1,依此类推。输出应该是:

array([[2, 4],
       [5, 6],
       [10, 11]])

A slow solution would be something like this:

一个缓慢的解决方案是这样的:

output = np.array([row[ix] for row, ix in zip(X, ixs)])

output = np.array([row [ix] for row,ix in zip(X,ixs)])

however this can get kinda slow for extremely long arrays. Is there a faster way to do this without a loop using NumPy?

但是对于极长的阵列来说,这可能会有点慢。如果没有使用NumPy的循环,有没有更快的方法呢?

EDIT: Some very approximate speed tests on a 2.5K * 1M array with 2K wide ixs (10GB):

编辑:2.5K * 1M阵列的一些非常接近的速度测试,2K宽ixs(10GB):

np.array([row[ix] for row, ix in zip(X, ixs)]) 0.16s

np.array([行[ix]为行,ix为zip(X,ixs)])0.16s

X[np.arange(len(ixs)), ixs.T].T 0.175s

X [np.arange(len(ixs)),ixs.T] .T 0.175s

X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 33s

X.take(idx + np.arange(0,X.shape [0] * X.shape [1],X.shape [1])[:,None])33s

np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype).reshape(ixs.shape) 2.4s

np.fromiter((X [i,j]代表i,行代表枚举(ixs)代表行中的j),dtype = X.dtype).reshape(ixs.shape)2.4s

4 个解决方案

#1


6  

You can use this:

你可以用这个:

X[np.arange(len(ixs)), ixs.T].T

Here is the reference for complex indexing.

以下是复杂索引的参考。

#2


3  

I believe you can use .take thusly:

我相信你可以这样使用.take:

In [185]: X
Out[185]:
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [186]: idx
Out[186]:
array([[1, 3],
       [0, 1],
       [1, 2]])

In [187]: X.take(idx + (np.arange(X.shape[0]) * X.shape[1]).reshape(-1, 1))
Out[187]:
array([[ 2,  4],
       [ 5,  6],
       [10, 11]])

If your array dimensions are massive, it might be faster, albeit uglier, to do:

如果您的阵列尺寸很大,那么可能会更快,尽管更难以做到:

idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]

Just for fun, see how the following performs:

只是为了好玩,请看以下内容如何执行:

np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape)

Edit to add timings

In [15]: X = np.arange(1000*10000, dtype=np.int32).reshape(1000,-1)

In [16]: ixs = np.random.randint(0, 10000, (1000, 2))

In [17]: ixs.sort(axis=1)

In [18]: ixs
Out[18]:
array([[2738, 3511],
       [3600, 7414],
       [7426, 9851],
       ...,
       [1654, 8252],
       [2194, 8200],
       [5497, 8900]])

In [19]: %timeit  np.array([row[ix] for row, ix in zip(X, ixs)])
928 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: %timeit X[np.arange(len(ixs)), ixs.T].T
23.6 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
20.6 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape)
1.42 ms ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@mxbi I've added some timings and my results aren't really consistent with yours, you should check it out

@mxbi我已经添加了一些时间,我的结果与你的结果不一致,你应该检查一下

Here's a larger array:

这是一个更大的数组:

In [33]: X = np.arange(10000*100000, dtype=np.int32).reshape(10000,-1)

In [34]: ixs = np.random.randint(0, 100000, (10000, 2))

In [35]: ixs.sort(axis=1)

In [36]: X.shape
Out[36]: (10000, 100000)

In [37]: ixs.shape
Out[37]: (10000, 2)

With some results:

有一些结果:

In [42]: %timeit  np.array([row[ix] for row, ix in zip(X, ixs)])
11.4 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit X[np.arange(len(ixs)), ixs.T].T
596 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [44]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
540 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, we are using column 500 indices instead of two, and we see the list-comprehension start winning out:

现在,我们使用的是列500索引而不是两个,我们看到list-comprehension开始赢了:

In [45]: ixs = np.random.randint(0, 100000, (10000, 500))

In [46]: ixs.sort(axis=1)

In [47]: %timeit  np.array([row[ix] for row, ix in zip(X, ixs)])
93 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit X[np.arange(len(ixs)), ixs.T].T
133 ms ± 638 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [49]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
87.5 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#3


1  

The usual suggestion for indexing items from rows is:

从行索引项目的通常建议是:

X[np.arange(X.shape[0])[:,None], ixs]

That is, make a row index of shape (n,1) (column vector), which will broadcast with the (n,m) shape of ixs to give a (n,m) solution.

也就是说,制作一个形状(n,1)的行索引(列向量),它将以ixs的(n,m)形状广播,以给出(n,m)解。

This basically the same as:

这基本相同:

X[np.arange(len(ixs)), ixs.T].T

which broadcasts a (n,) index against a (m,n), and transposes.

它针对(m,n)广播(n,)索引,并进行转置。

Timings are essentially the same:

时间基本相同:

In [299]: X = np.ones((1000,2000))
In [300]: ixs = np.random.randint(0,2000,(1000,200))
In [301]: timeit X[np.arange(len(ixs)), ixs.T].T
6.58 ms ± 71.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [302]: timeit X[np.arange(X.shape[0])[:,None], ixs]
6.57 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and for comparison:

并进行比较:

In [307]: timeit np.array([row[ix] for row, ix in zip(X, ixs)])
6.63 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm a little surprised that this list comprehension does so well. I wonder how the relative advantages compare when the dimensions change, particularly in the relative shape of X and ixs (long, wide etc).

这个列表理解做得很好,我有点惊讶。我想知道尺寸变化时相对优势的比较,特别是X和ixs(长,宽等)的相对形状。


The first solution is the style of indexing produced by ix_:

第一个解决方案是ix_生成的索引样式:

In [303]: np.ix_(np.arange(3), np.arange(2))
Out[303]: 
(array([[0],
        [1],
        [2]]), array([[0, 1]]))

#4


0  

This should work

这应该工作

[X[i][[y]] for i, y in enumerate(ixs)] 

EDIT: I just noticed you wanted no loop solution.

编辑:我刚注意到你不想要循环解决方案。

#1


6  

You can use this:

你可以用这个:

X[np.arange(len(ixs)), ixs.T].T

Here is the reference for complex indexing.

以下是复杂索引的参考。

#2


3  

I believe you can use .take thusly:

我相信你可以这样使用.take:

In [185]: X
Out[185]:
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [186]: idx
Out[186]:
array([[1, 3],
       [0, 1],
       [1, 2]])

In [187]: X.take(idx + (np.arange(X.shape[0]) * X.shape[1]).reshape(-1, 1))
Out[187]:
array([[ 2,  4],
       [ 5,  6],
       [10, 11]])

If your array dimensions are massive, it might be faster, albeit uglier, to do:

如果您的阵列尺寸很大,那么可能会更快,尽管更难以做到:

idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]

Just for fun, see how the following performs:

只是为了好玩,请看以下内容如何执行:

np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape)

Edit to add timings

In [15]: X = np.arange(1000*10000, dtype=np.int32).reshape(1000,-1)

In [16]: ixs = np.random.randint(0, 10000, (1000, 2))

In [17]: ixs.sort(axis=1)

In [18]: ixs
Out[18]:
array([[2738, 3511],
       [3600, 7414],
       [7426, 9851],
       ...,
       [1654, 8252],
       [2194, 8200],
       [5497, 8900]])

In [19]: %timeit  np.array([row[ix] for row, ix in zip(X, ixs)])
928 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: %timeit X[np.arange(len(ixs)), ixs.T].T
23.6 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
20.6 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape)
1.42 ms ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@mxbi I've added some timings and my results aren't really consistent with yours, you should check it out

@mxbi我已经添加了一些时间,我的结果与你的结果不一致,你应该检查一下

Here's a larger array:

这是一个更大的数组:

In [33]: X = np.arange(10000*100000, dtype=np.int32).reshape(10000,-1)

In [34]: ixs = np.random.randint(0, 100000, (10000, 2))

In [35]: ixs.sort(axis=1)

In [36]: X.shape
Out[36]: (10000, 100000)

In [37]: ixs.shape
Out[37]: (10000, 2)

With some results:

有一些结果:

In [42]: %timeit  np.array([row[ix] for row, ix in zip(X, ixs)])
11.4 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit X[np.arange(len(ixs)), ixs.T].T
596 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [44]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
540 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, we are using column 500 indices instead of two, and we see the list-comprehension start winning out:

现在,我们使用的是列500索引而不是两个,我们看到list-comprehension开始赢了:

In [45]: ixs = np.random.randint(0, 100000, (10000, 500))

In [46]: ixs.sort(axis=1)

In [47]: %timeit  np.array([row[ix] for row, ix in zip(X, ixs)])
93 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit X[np.arange(len(ixs)), ixs.T].T
133 ms ± 638 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [49]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
87.5 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#3


1  

The usual suggestion for indexing items from rows is:

从行索引项目的通常建议是:

X[np.arange(X.shape[0])[:,None], ixs]

That is, make a row index of shape (n,1) (column vector), which will broadcast with the (n,m) shape of ixs to give a (n,m) solution.

也就是说,制作一个形状(n,1)的行索引(列向量),它将以ixs的(n,m)形状广播,以给出(n,m)解。

This basically the same as:

这基本相同:

X[np.arange(len(ixs)), ixs.T].T

which broadcasts a (n,) index against a (m,n), and transposes.

它针对(m,n)广播(n,)索引,并进行转置。

Timings are essentially the same:

时间基本相同:

In [299]: X = np.ones((1000,2000))
In [300]: ixs = np.random.randint(0,2000,(1000,200))
In [301]: timeit X[np.arange(len(ixs)), ixs.T].T
6.58 ms ± 71.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [302]: timeit X[np.arange(X.shape[0])[:,None], ixs]
6.57 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and for comparison:

并进行比较:

In [307]: timeit np.array([row[ix] for row, ix in zip(X, ixs)])
6.63 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm a little surprised that this list comprehension does so well. I wonder how the relative advantages compare when the dimensions change, particularly in the relative shape of X and ixs (long, wide etc).

这个列表理解做得很好,我有点惊讶。我想知道尺寸变化时相对优势的比较,特别是X和ixs(长,宽等)的相对形状。


The first solution is the style of indexing produced by ix_:

第一个解决方案是ix_生成的索引样式:

In [303]: np.ix_(np.arange(3), np.arange(2))
Out[303]: 
(array([[0],
        [1],
        [2]]), array([[0, 1]]))

#4


0  

This should work

这应该工作

[X[i][[y]] for i, y in enumerate(ixs)] 

EDIT: I just noticed you wanted no loop solution.

编辑:我刚注意到你不想要循环解决方案。