快速替换numpy数组中的值

时间:2022-03-08 01:41:24

I have a very large numpy array (containing up to a million elements) like the one below:

我有一个非常大的numpy数组(包含多达一百万个元素),如下所示:

[ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and a small dictionary map for replacing some of the elements in the above array

和一个小字典映射,用于替换上面数组中的一些元素

{4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?

我想根据上面的地图替换一些元素。 numpy数组非常大,只有一小部分元素(作为字典中的键出现)将被相应的值替换。最快的方法是什么?

10 个解决方案

#1


30  

I believe there's even more efficient method, but for now, try

我相信有更有效的方法,但是现在,试试吧

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v

Microbenchmark and test for correctness:

Microbenchmark和测试正确性:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

结果:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop

#2


17  

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int dict, like below

假设值介于0和某个最大整数之间,可以使用numpy-array作为int-> int dict实现快速替换,如下所示

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

哪里先

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

并替换为

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

我们获得

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]

#3


4  

Another more general way to achieve this is function vectorization:

另一种更通用的方法是函数矢量化:

import numpy as np

data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

def mp(entry):
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)

print mp(data)

#4


2  

No solution was posted still without a python loop on the array (except Celil's one, which however assume numbers are "small"), so here is an alternative:

没有解决方案仍然没有在阵列上没有python循环(除了Celil的一个,但是假设数字是“小”),所以这里有一个替代方案:

def replace(arr, rep_dict):
    """Assumes all elements of "arr" are keys of rep_dict"""

    # Removing the explicit "list" breaks python3
    rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))

    idces = digitize(arr, rep_keys, right=True)
    # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr

    return rep_vals[idces]

the way "idces" is created comes from here.

“idces”的创建方式来自于此。

#5


2  

The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

numpy_indexed包(免责声明:我是它的作者)为这类问题提供了一个优雅而高效的矢量化解决方案:

import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))

The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.

实施的方法类似于Jean Lescut提到的基于搜索排序的方法,但更为一般。例如,数组的项不需要是int,但可以是任何类型,甚至是nd-subarrays本身;但它应该达到同样的性能。

#6


2  

I benchmarked some solutions, and the result is without appeal :

我对一些解决方案进行了基准测试,结果没有吸引力:

import timeit
import numpy as np

array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))

def method_for_loop():
    out = array.copy()
    for from_value, to_value in zip(from_values, to_values) :
        out[out == from_value] = to_value
    print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))

def method_list_comprehension():
    out = [d[i] for i in array]
    print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))

def method_bruteforce():
    idx = np.nonzero(from_values == array[:,None])[1]
    out = to_values[idx]
    print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))

def method_searchsort():
    sort_idx = np.argsort(from_values)
    idx = np.searchsorted(from_values,array,sorter = sort_idx)
    out = to_values[sort_idx][idx]
    print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))

And I got the following results :

我得到了以下结果:

Check method_for_loop : True
Time method_for_loop : 2.6411612760275602

Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879

Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692

Check method_searchsort : True
Time method_searchsort : 0.03770717792212963

The "searchsort" method is almost a hundred times faster than the "for" loop, and about 3600 times faster than the numpy bruteforce method. The list comprehension method is also a very good trade-off between code simplicity and speed.

“searchsort”方法比“for”循环快几百倍,比numpy bruteforce方法快大约3600倍。列表理解方法也是代码简单性和速度之间的非常好的折衷。

#7


0  

for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])

#8


0  

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

好吧,你需要通过theArray进行一次传递,并且对于每个元素,如果它在字典中,则替换它。

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]

#9


0  

Pythonic way without the need for data to be integer, can be even strings:

Pythonic方式无需数据为整数,甚至可以是字符串:

from scipy.stats import rankdata
import numpy as np

data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}

arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]

unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]

#10


0  

A fully vectorized solution using np.in1d and np.searchsorted:

使用np.in1d和np.searchsorted的完全矢量化解决方案:

replace = numpy.array([list(replace.keys()), list(replace.values())])    # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :])                                   # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])]   # Replace elements

#1


30  

I believe there's even more efficient method, but for now, try

我相信有更有效的方法,但是现在,试试吧

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v

Microbenchmark and test for correctness:

Microbenchmark和测试正确性:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

结果:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop

#2


17  

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int dict, like below

假设值介于0和某个最大整数之间,可以使用numpy-array作为int-> int dict实现快速替换,如下所示

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

哪里先

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

并替换为

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

我们获得

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]

#3


4  

Another more general way to achieve this is function vectorization:

另一种更通用的方法是函数矢量化:

import numpy as np

data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

def mp(entry):
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)

print mp(data)

#4


2  

No solution was posted still without a python loop on the array (except Celil's one, which however assume numbers are "small"), so here is an alternative:

没有解决方案仍然没有在阵列上没有python循环(除了Celil的一个,但是假设数字是“小”),所以这里有一个替代方案:

def replace(arr, rep_dict):
    """Assumes all elements of "arr" are keys of rep_dict"""

    # Removing the explicit "list" breaks python3
    rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))

    idces = digitize(arr, rep_keys, right=True)
    # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr

    return rep_vals[idces]

the way "idces" is created comes from here.

“idces”的创建方式来自于此。

#5


2  

The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

numpy_indexed包(免责声明:我是它的作者)为这类问题提供了一个优雅而高效的矢量化解决方案:

import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))

The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.

实施的方法类似于Jean Lescut提到的基于搜索排序的方法,但更为一般。例如,数组的项不需要是int,但可以是任何类型,甚至是nd-subarrays本身;但它应该达到同样的性能。

#6


2  

I benchmarked some solutions, and the result is without appeal :

我对一些解决方案进行了基准测试,结果没有吸引力:

import timeit
import numpy as np

array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))

def method_for_loop():
    out = array.copy()
    for from_value, to_value in zip(from_values, to_values) :
        out[out == from_value] = to_value
    print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))

def method_list_comprehension():
    out = [d[i] for i in array]
    print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))

def method_bruteforce():
    idx = np.nonzero(from_values == array[:,None])[1]
    out = to_values[idx]
    print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))

def method_searchsort():
    sort_idx = np.argsort(from_values)
    idx = np.searchsorted(from_values,array,sorter = sort_idx)
    out = to_values[sort_idx][idx]
    print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))

And I got the following results :

我得到了以下结果:

Check method_for_loop : True
Time method_for_loop : 2.6411612760275602

Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879

Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692

Check method_searchsort : True
Time method_searchsort : 0.03770717792212963

The "searchsort" method is almost a hundred times faster than the "for" loop, and about 3600 times faster than the numpy bruteforce method. The list comprehension method is also a very good trade-off between code simplicity and speed.

“searchsort”方法比“for”循环快几百倍,比numpy bruteforce方法快大约3600倍。列表理解方法也是代码简单性和速度之间的非常好的折衷。

#7


0  

for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])

#8


0  

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

好吧,你需要通过theArray进行一次传递,并且对于每个元素,如果它在字典中,则替换它。

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]

#9


0  

Pythonic way without the need for data to be integer, can be even strings:

Pythonic方式无需数据为整数,甚至可以是字符串:

from scipy.stats import rankdata
import numpy as np

data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}

arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]

unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]

#10


0  

A fully vectorized solution using np.in1d and np.searchsorted:

使用np.in1d和np.searchsorted的完全矢量化解决方案:

replace = numpy.array([list(replace.keys()), list(replace.values())])    # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :])                                   # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])]   # Replace elements