I have two dimensional array, X
of size (500,10)
and a single dimensional index array Y
whose size is 500
whose each entry is an index of correct value column of corresponding row of X, e.g, y(0)
is 2 then it means column 2 of first row of X is correct, similarly y(3) = 4
means Row 3 and Col 4 of X
has correct value.
我有二维数组,X大小(500,10)和单维索引数组Y,其大小为500,其每个条目是相应X行的正确值列的索引,例如,y(0)是2然后它表示X的第一行的第2列是正确的,类似地,y(3)= 4表示X的行3和第4列具有正确的值。
I want to get all the correct values from X using index array Y without using any loops, i.e, using vectorization and in this case the output should be (500,1)
. But when i do X[:,y]
then it gives output (500,500)
. Can someone help me how to correctly index array X using Y, plz.
我希望使用索引数组Y从X获得所有正确的值,而不使用任何循环,即使用向量化,在这种情况下输出应为(500,1)。但是当我做X [:,y]时,它会给出输出(500,500)。有人可以帮助我如何使用Y,PLZ正确索引数组X.
Thank you all for the help.
谢谢大家的帮助。
3 个解决方案
#1
5
Another option is multidimensional list-of-locations indexing:
另一种选择是多维列表位置索引:
import numpy as np
ncol = 10 # 10 in your case
nrow = 500 # 500 in your case
# just creating some test data:
x = np.arange(ncol*nrow).reshape(nrow,ncol)
y = (ncol * np.random.random_sample((nrow, 1))).astype(int)
print(x)
print(y)
print(x[np.arange(nrow),y.T].T)
The syntax is explained here. You basically need an array of indices for each dimension. In the first dimension this is simply [0,...,500] in your case and the second dimension is your y-array. We need to transpose it (.T), because it has to have the same shape as the first and the output array. The second transposition is not really needed, but gives you the shape you want.
这里解释了语法。您基本上需要每个维度的索引数组。在第一个维度中,在您的情况下,这只是[0,...,500],第二个维度是您的y数组。我们需要转置它(.T),因为它必须具有与第一个和输出数组相同的形状。第二个换位不是真的需要,但给你你想要的形状。
EDIT:
编辑:
The question of performance came up and I tried the three methods mentioned so far. You'll need line_profiler to run the following with
性能问题出现了,我尝试了迄今为止提到的三种方法。你需要line_profiler来运行以下命令
kernprof -l -v tmp.py
where tmp.py is:
其中tmp.py是:
import numpy as np
@profile
def calc(x,y):
z = np.arange(nrow)
a = x[z,y.T].T # mine, with the suggested speed up
b = x[:,y].diagonal().T # Christoph Terasa
c = np.array([i[j] for i, j in zip(x, y)]) # tobias_k
return (a,b,c)
ncol = 5 # 10 in your case
nrow = 10 # 500 in your case
x = np.arange(ncol*nrow).reshape(nrow,ncol)
y = (ncol * np.random.random_sample((nrow, 1))).astype(int)
a, b, c = calc(x,y)
print(a==b)
print(b==c)
The output for my python 2.7.6:
我的python 2.7.6的输出:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 @profile
4 def calc(x,y):
5 1 4 4.0 0.1 z = np.arange(nrow)
6 1 35 35.0 0.8 a = x[z,y.T].T
7 1 3409 3409.0 76.7 b = x[:,y].diagonal().T
8 501 995 2.0 22.4 c = np.array([i[j] for i, j in zip(x, y)])
9
10 1 1 1.0 0.0 return (a,b,c)
Where %Time or Time are the relevant columns. I don't know how to profile memory consumption, someone else would have to do that. For now it looks like my solution is the fastest for the requested dimensions.
其中%Time或Time是相关列。我不知道如何描述内存消耗,其他人则必须这样做。现在看起来我的解决方案对于所请求的尺寸来说速度最快。
#2
4
While not really intuitive from a syntactic perspective
虽然从句法角度来看并不是很直观
X[:,Y].diagonal()[0]
will give you the values you're looking for. The fancy indexing selects from each row of X
all values in Y
, and diagonal
selects only those at the indexes where i == j. The indexing with [0]
at the end just flattens the 2d array.
会给你你想要的价值。花式索引从Y的每一行中选择Y中的所有值,而对角线仅选择i == j的索引处的那些。最后用[0]索引只是使2d数组变平。
#3
4
You need an helper vector R
to index the rows
您需要辅助向量R来索引行
In [50]: X = np.arange(24).reshape((6,4))
In [51]: Y = np.random.randint(0,4,6)
In [52]: R = np.arange(6)
In [53]: Y
Out[53]: array([0, 2, 2, 0, 1, 0])
In [54]: X[R,Y]
Out[54]: array([ 0, 6, 10, 12, 17, 20])
for your use case
为您的用例
X_y = X[np.arange(500), Y]
Edit
编辑
I forgot to mention, if you want a 2D result you can obtain such a result using a dummy index
我忘了提及,如果你想要2D结果,你可以使用虚拟索引获得这样的结果
X_y_2D = X[np.arange(500), Y, None]
#1
5
Another option is multidimensional list-of-locations indexing:
另一种选择是多维列表位置索引:
import numpy as np
ncol = 10 # 10 in your case
nrow = 500 # 500 in your case
# just creating some test data:
x = np.arange(ncol*nrow).reshape(nrow,ncol)
y = (ncol * np.random.random_sample((nrow, 1))).astype(int)
print(x)
print(y)
print(x[np.arange(nrow),y.T].T)
The syntax is explained here. You basically need an array of indices for each dimension. In the first dimension this is simply [0,...,500] in your case and the second dimension is your y-array. We need to transpose it (.T), because it has to have the same shape as the first and the output array. The second transposition is not really needed, but gives you the shape you want.
这里解释了语法。您基本上需要每个维度的索引数组。在第一个维度中,在您的情况下,这只是[0,...,500],第二个维度是您的y数组。我们需要转置它(.T),因为它必须具有与第一个和输出数组相同的形状。第二个换位不是真的需要,但给你你想要的形状。
EDIT:
编辑:
The question of performance came up and I tried the three methods mentioned so far. You'll need line_profiler to run the following with
性能问题出现了,我尝试了迄今为止提到的三种方法。你需要line_profiler来运行以下命令
kernprof -l -v tmp.py
where tmp.py is:
其中tmp.py是:
import numpy as np
@profile
def calc(x,y):
z = np.arange(nrow)
a = x[z,y.T].T # mine, with the suggested speed up
b = x[:,y].diagonal().T # Christoph Terasa
c = np.array([i[j] for i, j in zip(x, y)]) # tobias_k
return (a,b,c)
ncol = 5 # 10 in your case
nrow = 10 # 500 in your case
x = np.arange(ncol*nrow).reshape(nrow,ncol)
y = (ncol * np.random.random_sample((nrow, 1))).astype(int)
a, b, c = calc(x,y)
print(a==b)
print(b==c)
The output for my python 2.7.6:
我的python 2.7.6的输出:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 @profile
4 def calc(x,y):
5 1 4 4.0 0.1 z = np.arange(nrow)
6 1 35 35.0 0.8 a = x[z,y.T].T
7 1 3409 3409.0 76.7 b = x[:,y].diagonal().T
8 501 995 2.0 22.4 c = np.array([i[j] for i, j in zip(x, y)])
9
10 1 1 1.0 0.0 return (a,b,c)
Where %Time or Time are the relevant columns. I don't know how to profile memory consumption, someone else would have to do that. For now it looks like my solution is the fastest for the requested dimensions.
其中%Time或Time是相关列。我不知道如何描述内存消耗,其他人则必须这样做。现在看起来我的解决方案对于所请求的尺寸来说速度最快。
#2
4
While not really intuitive from a syntactic perspective
虽然从句法角度来看并不是很直观
X[:,Y].diagonal()[0]
will give you the values you're looking for. The fancy indexing selects from each row of X
all values in Y
, and diagonal
selects only those at the indexes where i == j. The indexing with [0]
at the end just flattens the 2d array.
会给你你想要的价值。花式索引从Y的每一行中选择Y中的所有值,而对角线仅选择i == j的索引处的那些。最后用[0]索引只是使2d数组变平。
#3
4
You need an helper vector R
to index the rows
您需要辅助向量R来索引行
In [50]: X = np.arange(24).reshape((6,4))
In [51]: Y = np.random.randint(0,4,6)
In [52]: R = np.arange(6)
In [53]: Y
Out[53]: array([0, 2, 2, 0, 1, 0])
In [54]: X[R,Y]
Out[54]: array([ 0, 6, 10, 12, 17, 20])
for your use case
为您的用例
X_y = X[np.arange(500), Y]
Edit
编辑
I forgot to mention, if you want a 2D result you can obtain such a result using a dummy index
我忘了提及,如果你想要2D结果,你可以使用虚拟索引获得这样的结果
X_y_2D = X[np.arange(500), Y, None]