在大数据集上,线性nd插值器无限期挂起。

时间:2021-04-03 17:53:06

I'm interpolating some data in Python to regrid it on a regular mesh such that I can partially integrate it. The data represents a function of a high dimension parameter space (presently 3, to be extended to at least 5) and returns a multi-valued function of observables (presently 2, to be extended to 3 and then potentially dozens).

我在Python中插入一些数据,以便在常规的网格上重新生成它,这样我就可以部分地对它进行集成。数据表示一个高维度参数空间的函数(目前为3,扩展到至少5),并返回一个可观测值的多值函数(目前2,扩展到3,然后可能有几十个)。

I'm performing the interpolation via scipy.interpolate.LinearNDInterpolator for lack of any other apparent options (and because I understand griddata just calls it anyway). On a smallish data set (15,000 lines of columned data) it works okay. On larger sets (60,000+), the command appears to run indefinitely. top indicates that iPython is using 100% CPU and the terminal is completely unresponsive, including to C-c. So far I've left it a few hours to no avail and ultimately I'd like to pass several million entries.

我正在通过scipy.插值方法来执行插值,因为缺少其他任何明显的选项(而且我理解griddata只是调用它)。在一个短小的数据集(15000行列数据)中,它可以正常工作。在更大的集合(60000 +)中,命令似乎是无限运行的。top表示iPython使用的是100% CPU,而终端完全没有响应,包括C-c。到目前为止,我已经离开了几个小时,最终我想要通过几百万个条目。

I suspect the issue is related to this ticket but that was supposedly patched in SciPy 0.10.0, to which I upgraded yesterday.

我怀疑这个问题与这张票有关,但据说它是在SciPy 0.10.0中修补的,我昨天升级了。

My question is basically how do I perform multi-dimensional interpolation on large data sets? Based on what I've tried, there are a few possible places a solution could come from but I haven't had any luck finding them. (My search isn't helped by the fact that several of scipy's subdomains seem to be down...)

我的问题是如何在大数据集上执行多维插值?根据我的努力,有一些可能的解决办法,但我没有找到他们。(我的搜索并没有因为scipy的几个子域名似乎在下降…)

  • What's going wrong with LinearNDInterpolator? Or, at least, how can I find out what the issue is and try to circumvent the hanging?
  • 线性nd插值器出了什么问题?或者,至少,我怎样才能知道问题是什么,并试图绕过绞刑?
  • Is there a way to reformulate the interpolation so that LinearNDInterpolator will work? Perhaps by chunking up the data prudently to regrid it in parts?
  • 是否有一种方法来重新构造插值从而使线性nd插值器有效?也许是通过对数据进行谨慎的分组,使其在某些部分中重新获得成功?
  • Are there other high-dimension interpolators that are better suited to the problem? (I note that most of SciPy's alternatives are limited to <2D parameter space.)
  • 还有其他更适合这个问题的高维插值器吗?(我注意到SciPy的大多数选择仅限于<2D参数空间)。
  • Are there other ways to get multi-dimensional data onto a regular user-defined grid? That's all I'm trying to do by interpolating...
  • 是否有其他方法将多维数据传输到常规用户定义的网格上?这就是我想要做的。

1 个解决方案

#1


4  

The problem is most likely that your data set is simply too large, so that computing its Delaunay triangulation does not finish in an reasonable time. Check the time scaling of scipy.spatial.Delaunay using smaller data subsets randomly picked from your full data set, to estimate whether the full data set computation finishes before the universe ends.

问题很可能是您的数据集太大,所以计算它的Delaunay三角测量在合理的时间内没有完成。检查scipy.space的时间尺度。Delaunay使用从你的完整数据集随机抽取的较小的数据子集,来估计在宇宙结束之前,完整的数据集计算是否完成。

If your original data is on a rectangular grid, such as

如果原始数据位于矩形网格中,例如。

v[i,j,k,l] = f(x[i], y[j], z[k], u[l])

then using a triangulation-based interpolation is very inefficient. It's better to use tensor-product interpolation, i.e., interpolate each dimension successively by a 1-D interpolation method:

然后使用基于三角形的插值是非常低效的。最好使用tensorl -product插值,即。,通过一维插值方法依次插补各维度:

import numpy as np
from scipy.interpolate import interp1d

def interp3(x, y, z, v, xi, yi, zi, method='cubic'):
    """Interpolation on 3-D. x, y, xi, yi should be 1-D
    and z.shape == (len(x), len(y), len(z))"""
    q = (x, y, z)
    qi = (xi, yi, zi)
    for j in range(3):
        v = interp1d(q[j], v, axis=j, kind=method)(qi[j])
    return v

def somefunc(x, y, z):
    return x**2 + y**2 - z**2 + x*y*z

# some input data
x = np.linspace(0, 1, 5)
y = np.linspace(0, 2, 6)
z = np.linspace(0, 3, 7)
v = somefunc(x[:,None,None], y[None,:,None], z[None,None,:])

# interpolate
xi = np.linspace(0, 1, 45)
yi = np.linspace(0, 2, 46)
zi = np.linspace(0, 3, 47)
vi = interp3(x, y, z, v, xi, yi, zi)

import matplotlib.pyplot as plt
plt.subplot(121)
plt.pcolor(xi, yi, vi[:,:,12])
plt.title('interpolated')
plt.subplot(122)
plt.pcolor(xi, yi, somefunc(xi[:,None], yi[None,:], zi[12]))
plt.title('exact')
plt.show()

If your data set is scattered and too large for triangulation-based methods, then you need to switch to a different method. Some options are interpolation methods dealing with a small number of nearest neighbors at once (this information can be retrieved fast with a k-d-tree). Inverse distance weighing is one of these, but it may be one of the worse ones --- there are possible better options (which I don't know without further research).

如果您的数据集是分散的,并且对于基于三角形的方法来说太大,那么您需要切换到另一种方法。一些选项是插值方法,同时处理少量最近的邻居(这些信息可以用k-d树快速检索)。反向距离称重是其中之一,但它可能是最糟糕的一种——可能有更好的选择(我不知道没有进一步的研究)。

#1


4  

The problem is most likely that your data set is simply too large, so that computing its Delaunay triangulation does not finish in an reasonable time. Check the time scaling of scipy.spatial.Delaunay using smaller data subsets randomly picked from your full data set, to estimate whether the full data set computation finishes before the universe ends.

问题很可能是您的数据集太大,所以计算它的Delaunay三角测量在合理的时间内没有完成。检查scipy.space的时间尺度。Delaunay使用从你的完整数据集随机抽取的较小的数据子集,来估计在宇宙结束之前,完整的数据集计算是否完成。

If your original data is on a rectangular grid, such as

如果原始数据位于矩形网格中,例如。

v[i,j,k,l] = f(x[i], y[j], z[k], u[l])

then using a triangulation-based interpolation is very inefficient. It's better to use tensor-product interpolation, i.e., interpolate each dimension successively by a 1-D interpolation method:

然后使用基于三角形的插值是非常低效的。最好使用tensorl -product插值,即。,通过一维插值方法依次插补各维度:

import numpy as np
from scipy.interpolate import interp1d

def interp3(x, y, z, v, xi, yi, zi, method='cubic'):
    """Interpolation on 3-D. x, y, xi, yi should be 1-D
    and z.shape == (len(x), len(y), len(z))"""
    q = (x, y, z)
    qi = (xi, yi, zi)
    for j in range(3):
        v = interp1d(q[j], v, axis=j, kind=method)(qi[j])
    return v

def somefunc(x, y, z):
    return x**2 + y**2 - z**2 + x*y*z

# some input data
x = np.linspace(0, 1, 5)
y = np.linspace(0, 2, 6)
z = np.linspace(0, 3, 7)
v = somefunc(x[:,None,None], y[None,:,None], z[None,None,:])

# interpolate
xi = np.linspace(0, 1, 45)
yi = np.linspace(0, 2, 46)
zi = np.linspace(0, 3, 47)
vi = interp3(x, y, z, v, xi, yi, zi)

import matplotlib.pyplot as plt
plt.subplot(121)
plt.pcolor(xi, yi, vi[:,:,12])
plt.title('interpolated')
plt.subplot(122)
plt.pcolor(xi, yi, somefunc(xi[:,None], yi[None,:], zi[12]))
plt.title('exact')
plt.show()

If your data set is scattered and too large for triangulation-based methods, then you need to switch to a different method. Some options are interpolation methods dealing with a small number of nearest neighbors at once (this information can be retrieved fast with a k-d-tree). Inverse distance weighing is one of these, but it may be one of the worse ones --- there are possible better options (which I don't know without further research).

如果您的数据集是分散的,并且对于基于三角形的方法来说太大,那么您需要切换到另一种方法。一些选项是插值方法,同时处理少量最近的邻居(这些信息可以用k-d树快速检索)。反向距离称重是其中之一,但它可能是最糟糕的一种——可能有更好的选择(我不知道没有进一步的研究)。