I am trying to translate every element of a numpy.array
according to a given key:
我试图根据给定的键翻译numpy.array的每个元素:
For example:
例如:
a = np.array([[1,2,3],
[3,2,4]])
my_dict = {1:23, 2:34, 3:36, 4:45}
I want to get:
我想得到:
array([[ 23., 34., 36.],
[ 36., 34., 45.]])
I can see how to do it with a loop:
我可以看到如何使用循环:
def loop_translate(a, my_dict):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(my_dict.get, row)
return new_a
Is there a more efficient and/or pure numpy way?
是否有更高效和/或纯粹的numpy方式?
Edit:
编辑:
I timed it, and np.vectorize
method proposed by DSM is considerably faster for larger arrays:
我计时了,DSM提出的np.vectorize方法对于更大的数组要快得多:
In [13]: def loop_translate(a, my_dict):
....: new_a = np.empty(a.shape)
....: for i,row in enumerate(a):
....: new_a[i,:] = map(my_dict.get, row)
....: return new_a
....:
In [14]: def vec_translate(a, my_dict):
....: return np.vectorize(my_dict.__getitem__)(a)
....:
In [15]: a = np.random.randint(1,5, (4,5))
In [16]: a
Out[16]:
array([[2, 4, 3, 1, 1],
[2, 4, 3, 2, 4],
[4, 2, 1, 3, 1],
[2, 4, 3, 4, 1]])
In [17]: %timeit loop_translate(a, my_dict)
10000 loops, best of 3: 77.9 us per loop
In [18]: %timeit vec_translate(a, my_dict)
10000 loops, best of 3: 70.5 us per loop
In [19]: a = np.random.randint(1, 5, (500,500))
In [20]: %timeit loop_translate(a, my_dict)
1 loops, best of 3: 298 ms per loop
In [21]: %timeit vec_translate(a, my_dict)
10 loops, best of 3: 37.6 ms per loop
In [22]: %timeit loop_translate(a, my_dict)
6 个解决方案
#1
42
I don't know about efficient, but you could use np.vectorize
on the .get
method of dictionaries:
我不知道有效,但你可以在字典的.get方法上使用np.vectorize:
>>> a = np.array([[1,2,3],
[3,2,4]])
>>> my_dict = {1:23, 2:34, 3:36, 4:45}
>>> np.vectorize(my_dict.get)(a)
array([[23, 34, 36],
[36, 34, 45]])
#2
5
Here's another approach, using numpy.unique
:
这是另一种方法,使用numpy.unique:
>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> u,inv = np.unique(a,return_inverse = True)
>>> np.array([d[x] for x in u])[inv].reshape(a.shape)
array([[11, 22, 33],
[33, 22, 11]])
#3
5
I think it'd be better to iterate over the dictionary, and set values in all the rows and columns "at once":
我认为最好迭代字典,并在“一次”设置所有行和列中的值:
>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> for k,v in d.iteritems():
... a[a == k] = v
...
>>> a
array([[11, 22, 33],
[33, 22, 11]])
Edit:
编辑:
While it may not be as sexy as DSM's (really good) answer using numpy.vectorize
, my tests of all the proposed methods show that this approach (using @jamylak's suggestion) is actually a bit faster:
虽然使用numpy.vectorize可能不如DSM(非常好)的答案那样性感,但我对所有提议方法的测试表明这种方法(使用@jamylak的建议)实际上要快一点:
from __future__ import division
import numpy as np
a = np.random.randint(1, 5, (500,500))
d = {1 : 11, 2 : 22, 3 : 33, 4 : 44}
def unique_translate(a,d):
u,inv = np.unique(a,return_inverse = True)
return np.array([d[x] for x in u])[inv].reshape(a.shape)
def vec_translate(a, d):
return np.vectorize(d.__getitem__)(a)
def loop_translate(a,d):
n = np.ndarray(a.shape)
for k in d:
n[a == k] = d[k]
return n
def orig_translate(a, d):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(d.get, row)
return new_a
if __name__ == '__main__':
import timeit
n_exec = 100
print 'orig'
print timeit.timeit("orig_translate(a,d)",
setup="from __main__ import np,a,d,orig_translate",
number = n_exec) / n_exec
print 'unique'
print timeit.timeit("unique_translate(a,d)",
setup="from __main__ import np,a,d,unique_translate",
number = n_exec) / n_exec
print 'vec'
print timeit.timeit("vec_translate(a,d)",
setup="from __main__ import np,a,d,vec_translate",
number = n_exec) / n_exec
print 'loop'
print timeit.timeit("loop_translate(a,d)",
setup="from __main__ import np,a,d,loop_translate",
number = n_exec) / n_exec
Outputs:
输出:
orig
0.222067718506
unique
0.0472617006302
vec
0.0357889199257
loop
0.0285375618935
#4
3
The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:
numpy_indexed包(免责声明:我是它的作者)为这类问题提供了一个优雅而高效的矢量化解决方案:
import numpy_indexed as npi
remapped_a = npi.remap(a, list(my_dict.keys()), list(my_dict.values()))
The method implemented is similar to the approach mentioned by John Vinyard, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves.
实施的方法类似于John Vinyard提到的方法,但更为一般。例如,数组的项不需要是int,但可以是任何类型,甚至是nd-subarrays本身。
If you set the optional 'missing' kwarg to 'raise' (default is 'ignore'), performance will be slightly better, and you will get a KeyError if not all elements of 'a' are present in the keys.
如果将可选的'missing'kwarg设置为'raise'(默认为'ignore'),性能会略好一些,如果键中没有'a'的所有元素,您将得到KeyError。
#5
1
Assuming your dict keys are positive integers, without huge gaps (similar to a range from 0 to N), you would be better off converting your translation dict to an array such that my_array[i] = my_dict[i]
, and using numpy indexing to do the translation.
假设你的dict键是正整数,没有巨大的间隙(类似于从0到N的范围),你最好将翻译字典转换为数组,使my_array [i] = my_dict [i],并使用numpy索引做翻译。
A code using this approach is:
使用此方法的代码是:
def direct_translate(a, d):
src, values = d.keys(), d.values()
d_array = np.arange(a.max() + 1)
d_array[src] = values
return d_array[a]
Testing with random arrays:
使用随机数组进行测试
N = 10000
shape = (5000, 5000)
a = np.random.randint(N, size=shape)
my_dict = dict(zip(np.arange(N), np.random.randint(N, size=N)))
For these sizes I get around 140 ms
for this approach. The np.get vectorization takes around 5.8 s
and the unique_translate
around 8 s
.
对于这些尺寸,这种方法可以达到140毫秒左右。 np.get矢量化大约需要5.8秒,而unique_translate大约需要8秒。
Possible generalizations:
可能的概括:
- If you have negative values to translate, you could shift the values in
a
and in the keys of the dictionary by a constant to map them back to positive integers: - 如果要翻译负值,可以将字典中的a和键中的值移动一个常量,将它们映射回正整数:
def direct_translate(a, d): # handles negative source keys
min_a = a.min()
src, values = np.array(d.keys()) - min_a, d.values()
d_array = np.arange(a.max() - min_a + 1)
d_array[src] = values
return d_array[a - min_a]
- If the source keys have huge gaps, the initial array creation would waste memory. I would resort to cython to speed up that function.
- 如果源键有很大的间隙,初始阵列创建会浪费内存。我会求助于cython来加速这个功能。
#6
0
If you don't really have to use dictionary as substitution table, simple solution would be (for your example):
如果你真的不必使用字典作为替换表,那么简单的解决方案就是(对于你的例子):
a = numpy.array([your array])
my_dict = numpy.array([0, 23, 34, 36, 45]) # your dictionary as array
def Sub (myarr, table) :
return table[myarr]
values = Sub(a, my_dict)
This will work of course only if indexes of d
cover all possible values of your a
, in other words, only for a
with usigned integers.
这当然只有当d的索引覆盖了a的所有可能值时,换句话说,仅用于带有有符号整数的a。
#1
42
I don't know about efficient, but you could use np.vectorize
on the .get
method of dictionaries:
我不知道有效,但你可以在字典的.get方法上使用np.vectorize:
>>> a = np.array([[1,2,3],
[3,2,4]])
>>> my_dict = {1:23, 2:34, 3:36, 4:45}
>>> np.vectorize(my_dict.get)(a)
array([[23, 34, 36],
[36, 34, 45]])
#2
5
Here's another approach, using numpy.unique
:
这是另一种方法,使用numpy.unique:
>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> u,inv = np.unique(a,return_inverse = True)
>>> np.array([d[x] for x in u])[inv].reshape(a.shape)
array([[11, 22, 33],
[33, 22, 11]])
#3
5
I think it'd be better to iterate over the dictionary, and set values in all the rows and columns "at once":
我认为最好迭代字典,并在“一次”设置所有行和列中的值:
>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> for k,v in d.iteritems():
... a[a == k] = v
...
>>> a
array([[11, 22, 33],
[33, 22, 11]])
Edit:
编辑:
While it may not be as sexy as DSM's (really good) answer using numpy.vectorize
, my tests of all the proposed methods show that this approach (using @jamylak's suggestion) is actually a bit faster:
虽然使用numpy.vectorize可能不如DSM(非常好)的答案那样性感,但我对所有提议方法的测试表明这种方法(使用@jamylak的建议)实际上要快一点:
from __future__ import division
import numpy as np
a = np.random.randint(1, 5, (500,500))
d = {1 : 11, 2 : 22, 3 : 33, 4 : 44}
def unique_translate(a,d):
u,inv = np.unique(a,return_inverse = True)
return np.array([d[x] for x in u])[inv].reshape(a.shape)
def vec_translate(a, d):
return np.vectorize(d.__getitem__)(a)
def loop_translate(a,d):
n = np.ndarray(a.shape)
for k in d:
n[a == k] = d[k]
return n
def orig_translate(a, d):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(d.get, row)
return new_a
if __name__ == '__main__':
import timeit
n_exec = 100
print 'orig'
print timeit.timeit("orig_translate(a,d)",
setup="from __main__ import np,a,d,orig_translate",
number = n_exec) / n_exec
print 'unique'
print timeit.timeit("unique_translate(a,d)",
setup="from __main__ import np,a,d,unique_translate",
number = n_exec) / n_exec
print 'vec'
print timeit.timeit("vec_translate(a,d)",
setup="from __main__ import np,a,d,vec_translate",
number = n_exec) / n_exec
print 'loop'
print timeit.timeit("loop_translate(a,d)",
setup="from __main__ import np,a,d,loop_translate",
number = n_exec) / n_exec
Outputs:
输出:
orig
0.222067718506
unique
0.0472617006302
vec
0.0357889199257
loop
0.0285375618935
#4
3
The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:
numpy_indexed包(免责声明:我是它的作者)为这类问题提供了一个优雅而高效的矢量化解决方案:
import numpy_indexed as npi
remapped_a = npi.remap(a, list(my_dict.keys()), list(my_dict.values()))
The method implemented is similar to the approach mentioned by John Vinyard, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves.
实施的方法类似于John Vinyard提到的方法,但更为一般。例如,数组的项不需要是int,但可以是任何类型,甚至是nd-subarrays本身。
If you set the optional 'missing' kwarg to 'raise' (default is 'ignore'), performance will be slightly better, and you will get a KeyError if not all elements of 'a' are present in the keys.
如果将可选的'missing'kwarg设置为'raise'(默认为'ignore'),性能会略好一些,如果键中没有'a'的所有元素,您将得到KeyError。
#5
1
Assuming your dict keys are positive integers, without huge gaps (similar to a range from 0 to N), you would be better off converting your translation dict to an array such that my_array[i] = my_dict[i]
, and using numpy indexing to do the translation.
假设你的dict键是正整数,没有巨大的间隙(类似于从0到N的范围),你最好将翻译字典转换为数组,使my_array [i] = my_dict [i],并使用numpy索引做翻译。
A code using this approach is:
使用此方法的代码是:
def direct_translate(a, d):
src, values = d.keys(), d.values()
d_array = np.arange(a.max() + 1)
d_array[src] = values
return d_array[a]
Testing with random arrays:
使用随机数组进行测试
N = 10000
shape = (5000, 5000)
a = np.random.randint(N, size=shape)
my_dict = dict(zip(np.arange(N), np.random.randint(N, size=N)))
For these sizes I get around 140 ms
for this approach. The np.get vectorization takes around 5.8 s
and the unique_translate
around 8 s
.
对于这些尺寸,这种方法可以达到140毫秒左右。 np.get矢量化大约需要5.8秒,而unique_translate大约需要8秒。
Possible generalizations:
可能的概括:
- If you have negative values to translate, you could shift the values in
a
and in the keys of the dictionary by a constant to map them back to positive integers: - 如果要翻译负值,可以将字典中的a和键中的值移动一个常量,将它们映射回正整数:
def direct_translate(a, d): # handles negative source keys
min_a = a.min()
src, values = np.array(d.keys()) - min_a, d.values()
d_array = np.arange(a.max() - min_a + 1)
d_array[src] = values
return d_array[a - min_a]
- If the source keys have huge gaps, the initial array creation would waste memory. I would resort to cython to speed up that function.
- 如果源键有很大的间隙,初始阵列创建会浪费内存。我会求助于cython来加速这个功能。
#6
0
If you don't really have to use dictionary as substitution table, simple solution would be (for your example):
如果你真的不必使用字典作为替换表,那么简单的解决方案就是(对于你的例子):
a = numpy.array([your array])
my_dict = numpy.array([0, 23, 34, 36, 45]) # your dictionary as array
def Sub (myarr, table) :
return table[myarr]
values = Sub(a, my_dict)
This will work of course only if indexes of d
cover all possible values of your a
, in other words, only for a
with usigned integers.
这当然只有当d的索引覆盖了a的所有可能值时,换句话说,仅用于带有有符号整数的a。