将函数应用于numpy数组的每一行的快速方法

Suppose I have some nearest neighbor classifier. For a new observation it computes the distance between the new observation and all observations in the "known" data set. It returns the class label of the observation, that has the smallest distance to the new observation.

假设我有一些最近邻分类器。对于新观察,它计算新观测值与“已知”数据集中所有观测值之间的距离。它返回观察的类标签,与新观察的距离最小。

import numpy as np

known_obs = np.random.randint(0, 10, 40).reshape(8, 5)
new_obs = np.random.randint(0, 10, 80).reshape(16, 5)
labels = np.random.randint(0, 2, 8).reshape(8, )

def my_dist(x1, known_obs, axis=0):
    return (np.square(np.linalg.norm(x1 - known_obs, axis=axis)))

def nn_classifier(n, known_obs, labels, axis=1, distance=my_dist):
    return labels[np.argmin(distance(n, known_obs, axis=axis))]

def classify_batch(new_obs, known_obs, labels, classifier=nn_classifier, distance=my_dist):
    return [classifier(n, known_obs, labels, distance=distance) for n in new_obs]

print(classify_batch(new_obs, known_obs, labels, nn_classifier, my_dist))

For performance reasons I would like to avoid the for loop in the classify_batch function. Is there a way to use numpy operations to apply the nn_classifier function to each row of new_obs? I already tried apply_along_axis but as often mentioned it is convenient but not fast.

出于性能原因,我想避免使用classify_batch函数中的for循环。有没有办法使用numpy操作将nn_classifier函数应用于new_obs的每一行?我已经尝试过apply_along_axis但是经常提到它很方便但不快。

1 个解决方案

#1

The key to avoiding the loop is to express the action on the (16,8) array of 'distances'. The labels[] and argmin steps just cloud the issue.

避免循环的关键是表达“距离”(16,8)数组上的动作。标签[]和argmin步骤只是为了解决这个问题。

If I set labels = np.arange(8), then this

如果我设置labels = np.arange(8),那么这个

arr = np.array([my_dist(n, known_obs, axis=1) for n in new_obs])
print(arr)
print(np.argmin(arr, axis=1))

produces the same thing. It still has a list comprehension, but we are closer to 'source'.

产生同样的事情。它仍然有列表理解,但我们更接近'源'。

[[  32.  115.   22.  116.  162.   86.  161.  117.]
 [ 106.   31.  142.  164.   92.  106.   45.  103.]
 [  44.  135.   94.   18.   94.   50.   87.  135.]
 [  11.   92.   57.   67.   79.   43.  118.  106.]
 [  40.   67.  126.   98.   50.   74.   75.  175.]
 [  78.   61.  120.  148.  102.  128.   67.  191.]
 [  51.   48.   57.  133.  125.   35.  110.   14.]
 [  47.   28.   93.   91.   63.   49.   32.   88.]
 [  61.   86.   23.  141.  159.   85.  146.   22.]
 [ 131.   70.  155.  149.  129.  127.   44.  138.]
 [  97.  138.   87.  117.  223.   77.  130.  122.]
 [ 151.   78.  211.  161.  131.  115.   46.  164.]
 [  13.   50.   31.   69.   59.   43.   80.   40.]
 [ 131.  108.  157.  161.  207.   85.  102.  146.]
 [  39.  106.   67.   23.   61.   67.   70.   88.]
 [  54.   51.   74.   68.   42.   86.   35.   65.]]
[2 1 3 0 0 1 7 1 7 6 5 6 0 5 3 6]

With

print((new_obs[:,None,:] - known_obs[None,:,:]).shape)

I get a (16,8,5) array. So can I apply the linalg.norm on the last axis?

我得到一个(16,8,5)数组。那么我可以在最后一个轴上应用linalg.norm吗?

This seems to do the trick

这似乎可以解决问题

np.square(np.linalg.norm(diff, axis=-1))

So together:

diff = (new_obs[:,None,:] - known_obs[None,:,:])
dist = np.square(np.linalg.norm(diff, axis=-1))
idx = np.argmin(dist, axis=1)
print(idx)

#1