I have two arrays of unequal size and dimensions:
我有两个不等大小和尺寸的数组:
a = [['50.561872473 25.047160868 0.0', '0']
['50.561905852 25.047537575 0.0', '1']
['50.562232967 25.048109789 0.0', '2']
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.562232967 25.048109789 0.0']
['50.562232967 25.048109789 0.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']]
b
contains multiple occurrences of a
's first column value. This is the join between the arrays.
b包含多次出现的第一列值。这是数组之间的连接。
In the desired output array wherever the a
's first column matches b
's first column I want to add a
's second column such that:
在所需的输出数组中,无论a的第一列与b的第一列匹配,我想添加第二列,以便:
c = [['50.561872473 25.047160868 0.0', '0']
['50.561905852 25.047537575 0.0', '1']
['50.561905852 25.047537575 0.0', '1']
['50.561905852 25.047537575 0.0', '1']
['50.562232967 25.048109789 0.0', '2']
['50.562232967 25.048109789 0.0', '2']
['50.561940185 25.047914282 1.0', '5']
['50.561940185 25.047914282 1.0', '5']
['50.561940185 25.047914282 1.0', '5']]
a
and b
are in the low millions of lines and Python For loops to accomplish this are way too slow. So I am hoping I can accomplish this much more efficiently using Numpy methods.
a和b是数百万行,Python For循环实现这一点太慢了。所以我希望我可以使用Numpy方法更有效地完成这项工作。
3 个解决方案
#1
1
You can do this with pandas
你可以用熊猫做到这一点
import numpy as np
import pandas as pd
a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
print(df_a.merge(df_b))
Output
产量
0 1
0 50.561872473 25.047160868 0.0 0
1 50.561905852 25.047537575 0.0 1
2 50.561905852 25.047537575 0.0 1
3 50.561905852 25.047537575 0.0 1
4 50.562232967 25.048109789 0.0 2
5 50.562232967 25.048109789 0.0 2
6 50.561940185 25.047914282 1.0 5
7 50.561940185 25.047914282 1.0 5
8 50.561940185 25.047914282 1.0 5
#2
1
Whether this works for your specific case depends on some details, but it works for the simple example you've given.
这是否适用于您的具体情况取决于一些细节,但它适用于您给出的简单示例。
>>> sorted_a = a[a.argsort(axis=0)[:,0]]
>>> insertion_points = numpy.searchsorted(sorted_a[:,0], b).ravel()
>>> sorted_a[insertion_points]
array([['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5']],
dtype='<S29')
This begins by sorting a
. Then it uses searchsorted
to do a binary search in a
for the correct insertion index for each value in b
. Assuming the values in the first columns are all perfectly equal, the insertion indices returned have two nice properties. First, they point to the matching value in a
. Second, they can be used as indices into a
to create a new array using fancy indexing.
这首先是排序a。然后它使用searchsorted在a中进行二进制搜索,以获得b中每个值的正确插入索引。假设第一列中的值完全相等,则返回的插入索引具有两个不错的属性。首先,他们指向a中的匹配值。其次,它们可以用作a的索引,以使用花式索引创建新数组。
This makes creating the third array very easy. However, it draws all its data from a
, not from b
. If the values in a
and b
are not always equal, then the solution will have to be more complex.
这使得创建第三个阵列非常容易。但是,它从a而不是b中绘制所有数据。如果a和b中的值不总是相等,则解决方案必须更复杂。
#3
0
a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]
a = np.array(a)
b = np.array(b)
Find out where they match.
找出他们匹配的地方。
x = b == a[:,0]
>>> x
array([[ True, False, False, False],
[False, True, False, False],
[False, True, False, False],
[False, True, False, False],
[False, False, True, False],
[False, False, True, False],
[False, False, False, True],
[False, False, False, True],
[False, False, False, True]], dtype=bool)
Get the indices of the matches.
获取比赛的索引。
v = np.where(x)[1]
>>> v
array([0, 1, 1, 1, 2, 2, 3, 3, 3])
Use the indices to create the result from a
使用索引从a创建结果
s = a[v]
>>> s
array([['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5']],
dtype='|S29')
If there are duplicates in a
this might not produce what you want.
如果有重复,这可能不会产生你想要的。
#1
1
You can do this with pandas
你可以用熊猫做到这一点
import numpy as np
import pandas as pd
a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
print(df_a.merge(df_b))
Output
产量
0 1
0 50.561872473 25.047160868 0.0 0
1 50.561905852 25.047537575 0.0 1
2 50.561905852 25.047537575 0.0 1
3 50.561905852 25.047537575 0.0 1
4 50.562232967 25.048109789 0.0 2
5 50.562232967 25.048109789 0.0 2
6 50.561940185 25.047914282 1.0 5
7 50.561940185 25.047914282 1.0 5
8 50.561940185 25.047914282 1.0 5
#2
1
Whether this works for your specific case depends on some details, but it works for the simple example you've given.
这是否适用于您的具体情况取决于一些细节,但它适用于您给出的简单示例。
>>> sorted_a = a[a.argsort(axis=0)[:,0]]
>>> insertion_points = numpy.searchsorted(sorted_a[:,0], b).ravel()
>>> sorted_a[insertion_points]
array([['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5']],
dtype='<S29')
This begins by sorting a
. Then it uses searchsorted
to do a binary search in a
for the correct insertion index for each value in b
. Assuming the values in the first columns are all perfectly equal, the insertion indices returned have two nice properties. First, they point to the matching value in a
. Second, they can be used as indices into a
to create a new array using fancy indexing.
这首先是排序a。然后它使用searchsorted在a中进行二进制搜索,以获得b中每个值的正确插入索引。假设第一列中的值完全相等,则返回的插入索引具有两个不错的属性。首先,他们指向a中的匹配值。其次,它们可以用作a的索引,以使用花式索引创建新数组。
This makes creating the third array very easy. However, it draws all its data from a
, not from b
. If the values in a
and b
are not always equal, then the solution will have to be more complex.
这使得创建第三个阵列非常容易。但是,它从a而不是b中绘制所有数据。如果a和b中的值不总是相等,则解决方案必须更复杂。
#3
0
a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]
a = np.array(a)
b = np.array(b)
Find out where they match.
找出他们匹配的地方。
x = b == a[:,0]
>>> x
array([[ True, False, False, False],
[False, True, False, False],
[False, True, False, False],
[False, True, False, False],
[False, False, True, False],
[False, False, True, False],
[False, False, False, True],
[False, False, False, True],
[False, False, False, True]], dtype=bool)
Get the indices of the matches.
获取比赛的索引。
v = np.where(x)[1]
>>> v
array([0, 1, 1, 1, 2, 2, 3, 3, 3])
Use the indices to create the result from a
使用索引从a创建结果
s = a[v]
>>> s
array([['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5']],
dtype='|S29')
If there are duplicates in a
this might not produce what you want.
如果有重复,这可能不会产生你想要的。