基于第一列中未知的相同值切割numpy数组的有效方法

I have a relatively large data file on the order of 10GB with three columns that looks something like:

我有一个相对较大的数据文件,大约10GB,有三列,看起来像:

X             Y    Z
----          ---- ----
.10000E+05    100  35
.10000E+05    101  45
.             .    .
.             .    .
.             .    .
.10000E+05    400  45
.16730E+05    100  43
.16730E+05    101  25
.             .    .
.             .    .
.             .    .
.16730E+05    400  57
.             .    .
.             .    .
.             .    .
n             100  34
n             101  54
.             .    .
.             .    .
.             .    .
n             400  45

So basically, there are two independent variables X, Y and one dependent Z. The data is loaded into a NumPy array via:

基本上,有两个独立变量X,Y和一个从属Z.数据通过以下方式加载到NumPy数组中:

data = np.loadtxt('datafile.txt', skiprows = 2)

so the X, Y and Z columns correspond to data[:,0], data[:,1] and data[:,2] respectively. The X column is made up of sections of common floats which are unknown ahead of time but are in ascending order, as in the example (.10000E+05,.16730E+05,...,n), that I would like to slice upon resulting in new arrays that have common X values.

所以X,Y和Z列分别对应数据[:,0],数据[:,1]和数据[:,2]。 X列由常见浮点数的部分组成,这些浮点数提前未知但是按升序排列,如示例(.10000E + 05,.16730E + 05,...,n),我想要slice on导致具有共同X值的新数组。

What is an efficient way to slice this array as mentioned?

如上所述,切片这个数组的有效方法是什么?

I have tried a method that relies on looping over the X column and checking if neighboring elements are the same, but this takes way to long running in Python.

我已经尝试了一种依赖于循环遍历X列并检查相邻元素是否相同的方法,但这需要在Python中长时间运行。

3 个解决方案

#1

numpy has some functions, that help you to accomplish your task:

numpy有一些功能,可以帮助您完成任务:

borders = data[0,:].searchsorted(numpy.unique(data[0,:]))
part0 = data[borders[0]:borders[1]]

But I wouldn't suggest to break the big array apart, but to index into it with borders, whenever needed.

但我不建议将大阵列分开,而是在需要时用边框索引它。

#2

I don't know why you need this but slicing the big array and creating many smaller arrays may not be your best option.

我不知道为什么你需要这个,但切割大数组并创建许多较小的数组可能不是你最好的选择。

If you really want to do it, you can try to get unique values of the first column and split the array for each unique value.

如果您确实想要这样做,可以尝试获取第一列的唯一值,并为每个唯一值拆分数组。

uniq_vals = np.unique(data[:,1])
for u in uniq_vals:
    splitted = data[data[:,1]==u]
    # do whatever you want with `splitted`

and this will create a list of lists in one line

这将在一行中创建一个列表列表

[data[data[:,1]==u] for u in np.unique(data[:,1])]

#3

This can be done efficiently using the numpy_indexed package (disclaimer: I am its author):

这可以使用numpy_indexed包有效地完成(免责声明:我是它的作者):

import numpy_indexed as npi
npi.group_by(data[:, 0]).split(data[:, 1:])

#1