加入两个大小不等的numpy数组，并根据公共列填充第三个数组

I have two arrays of unequal size and dimensions:

我有两个不等大小和尺寸的数组：

a = [['50.561872473 25.047160868 0.0', '0']
['50.561905852 25.047537575 0.0', '1']
['50.562232967 25.048109789 0.0', '2']
['50.561940185 25.047914282 1.0', '5']]

b = [['50.561872473 25.047160868 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.562232967 25.048109789 0.0']
['50.562232967 25.048109789 0.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']]

b contains multiple occurrences of a's first column value. This is the join between the arrays.

b包含多次出现的第一列值。这是数组之间的连接。

In the desired output array wherever the a's first column matches b's first column I want to add a's second column such that:

在所需的输出数组中，无论a的第一列与b的第一列匹配，我想添加第二列，以便：

 c = [['50.561872473 25.047160868 0.0', '0']
 ['50.561905852 25.047537575 0.0', '1']
 ['50.561905852 25.047537575 0.0', '1']
 ['50.561905852 25.047537575 0.0', '1']
 ['50.562232967 25.048109789 0.0', '2']
 ['50.562232967 25.048109789 0.0', '2']
 ['50.561940185 25.047914282 1.0', '5']
 ['50.561940185 25.047914282 1.0', '5']
 ['50.561940185 25.047914282 1.0', '5']]

a and b are in the low millions of lines and Python For loops to accomplish this are way too slow. So I am hoping I can accomplish this much more efficiently using Numpy methods.

a和b是数百万行，Python For循环实现这一点太慢了。所以我希望我可以使用Numpy方法更有效地完成这项工作。

3 个解决方案

#1

You can do this with pandas

你可以用熊猫做到这一点

import numpy as np
import pandas as pd

a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]

b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]

df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)

print(df_a.merge(df_b))

Output

产量

                               0  1
0  50.561872473 25.047160868 0.0  0
1  50.561905852 25.047537575 0.0  1
2  50.561905852 25.047537575 0.0  1
3  50.561905852 25.047537575 0.0  1
4  50.562232967 25.048109789 0.0  2
5  50.562232967 25.048109789 0.0  2
6  50.561940185 25.047914282 1.0  5
7  50.561940185 25.047914282 1.0  5
8  50.561940185 25.047914282 1.0  5

#2

Whether this works for your specific case depends on some details, but it works for the simple example you've given.

这是否适用于您的具体情况取决于一些细节，但它适用于您给出的简单示例。

>>> sorted_a = a[a.argsort(axis=0)[:,0]]
>>> insertion_points = numpy.searchsorted(sorted_a[:,0], b).ravel()
>>> sorted_a[insertion_points]
array([['50.561872473 25.047160868 0.0', '0'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5']], 
      dtype='<S29')

This begins by sorting a. Then it uses searchsorted to do a binary search in a for the correct insertion index for each value in b. Assuming the values in the first columns are all perfectly equal, the insertion indices returned have two nice properties. First, they point to the matching value in a. Second, they can be used as indices into a to create a new array using fancy indexing.

这首先是排序a。然后它使用searchsorted在a中进行二进制搜索，以获得b中每个值的正确插入索引。假设第一列中的值完全相等，则返回的插入索引具有两个不错的属性。首先，他们指向a中的匹配值。其次，它们可以用作a的索引，以使用花式索引创建新数组。

This makes creating the third array very easy. However, it draws all its data from a, not from b. If the values in a and b are not always equal, then the solution will have to be more complex.

这使得创建第三个阵列非常容易。但是，它从a而不是b中绘制所有数据。如果a和b中的值不总是相等，则解决方案必须更复杂。

#3

a = [['50.561872473 25.047160868 0.0', '0'],
     ['50.561905852 25.047537575 0.0', '1'],
     ['50.562232967 25.048109789 0.0', '2'],
     ['50.561940185 25.047914282 1.0', '5']]

b = [['50.561872473 25.047160868 0.0'],
     ['50.561905852 25.047537575 0.0'],
     ['50.561905852 25.047537575 0.0'],
     ['50.561905852 25.047537575 0.0'],
     ['50.562232967 25.048109789 0.0'],
     ['50.562232967 25.048109789 0.0'],
     ['50.561940185 25.047914282 1.0'],
     ['50.561940185 25.047914282 1.0'],
     ['50.561940185 25.047914282 1.0']]

a = np.array(a)
b = np.array(b)

Find out where they match.

找出他们匹配的地方。

x = b == a[:,0]

>>> x
array([[ True, False, False, False],
       [False,  True, False, False],
       [False,  True, False, False],
       [False,  True, False, False],
       [False, False,  True, False],
       [False, False,  True, False],
       [False, False, False,  True],
       [False, False, False,  True],
       [False, False, False,  True]], dtype=bool)

Get the indices of the matches.

获取比赛的索引。

v = np.where(x)[1]

>>> v
array([0, 1, 1, 1, 2, 2, 3, 3, 3])

Use the indices to create the result from a

使用索引从a创建结果

s = a[v]

>>> s
array([['50.561872473 25.047160868 0.0', '0'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5']], 
      dtype='|S29')

If there are duplicates in a this might not produce what you want.

如果有重复，这可能不会产生你想要的。

#1