用于核心计算/数据挖掘的Python工具。

I am interested in python mining data sets too big to sit in RAM but sitting within a single HD.

我对python挖掘数据集很感兴趣，这些数据集太大而不能放在RAM中，而只能放在一个HD中。

I understand that I can export the data as hdf5 files, using pytables. Also the numexpr allows for some basic out-of-core computation.

我理解我可以使用pytables将数据导出为hdf5文件。numexpr还允许一些基本的核心外计算。

What would come next? Mini-batching when possible, and relying on linear algebra results to decompose the computation when mini-batching cannot be used?

未来会是什么样?在可能的情况下进行小批处理，在不能使用小批处理时依靠线性代数结果分解计算?

Or are there some higher level tools I have missed?

或者我漏掉了一些高级的工具?

Thanks for insights,

谢谢你的见解,

3 个解决方案

#1

In sklearn 0.14 (to be released in the coming days) there is a full-fledged example of out-of-core classification of text documents.

在sklearn 0.14(将在未来几天发布)中，有一个完整的文本文档的核心外分类示例。

I think it could be a great example to start with :

我认为这是一个很好的例子:

http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html

In the next release we'll extend this example with more classifiers and add documentation in the user guide.

在下一个版本中，我们将使用更多的分类器扩展这个示例，并在用户指南中添加文档。

NB: you can reproduce this example with 0.13 too, all the building blocks were already there.

NB:你也可以用0.13复制这个例子，所有的构建块都已经在那里了。

#2

What exactly do you want to do — can you give an example or two please ?

你到底想做什么?你能举一两个例子吗?

numpy.memmap is easy —

numpy。memmap是很容易的,

Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. Numpy's memmap's are array-like objects ...

创建存储在磁盘上二进制文件中的数组的内存映射。内存映射文件用于访问磁盘上的大文件的小段，而不需要将整个文件读入内存。Numpy的memmap是类数组的对象…

#3

I have a similar need to work on sub map-reduce sized datasets. I posed this question on SO when I started to investigate python pandas as a serious alternative to SAS: "Large data" work flows using pandas

我也有类似的需要来处理子地图缩小大小的数据集。当我开始研究python熊猫时，我提出了这个问题，作为sa的一个重要替代:使用熊猫的“大数据”工作流

The answer presented there suggests using the HDF5 interface from pandas to store pandas data structures directly on disk. Once stored, you could access the data in batches and train a model incrementally. For, example, scikit-learn has several classes that can be trained on incremental pieces of a dataset. One such example is found here:

这里给出的答案建议使用来自熊猫的HDF5接口将熊猫数据结构直接存储在磁盘上。一旦存储，您就可以批量访问数据并增量地训练模型。例如，scikit-learn有几个类可以对数据集的增量部分进行培训。这里有一个这样的例子:

http://scikit-learn.org/0.13/modules/generated/sklearn.linear_model.SGDClassifier.html

Any class that implements the partial_fit method can be trained incrementally. I am still trying to get a viable workflow for these kinds of problems and would be interested in discussing possible solutions.

任何实现partial_fit方法的类都可以进行增量训练。我仍在尝试为这些问题建立一个可行的工作流，并有兴趣讨论可能的解决方案。

#1