如何在numpy和R之间传递大数组?

I'm using python and numpy/scipy to do regex and stemming for a text processing application. But I want to use some of R's statistical packages as well.

我正在使用python和numpy/scipy来为文本处理应用程序做正则表达式和词干。但是我也想用一些R的统计软件包。

What's the best way to pass the data from python to R? (And back?)

将数据从python传递到R的最好方法是什么?(?)

Also, I need to backup the array to disk at some point, so I'm open to saving from python and loading th R if that's the best solution. The matrices are pretty big (e.g. 100,000 x 10,000), so using sparse matrices might also be nice.

另外，我需要在某个时刻将数组备份到磁盘，所以如果这是最好的解决方案，我愿意从python中保存并加载th R。矩阵是相当大的(例如10万x万)，所以使用稀疏矩阵也很好。

Apologies if this is a repost. I haven't been able to find anything that puts all these pieces together.

如果这是转发，请道歉。我找不到任何东西能把这些碎片拼在一起。

3 个解决方案

#1

Have you already looked into RPy? It's a python interface to R. I guess that would spare you the data handling.

你已经研究过RPy了吗?它是一个到r的python接口，我想这样可以省去数据处理。
To backup your NumPy arrays you can use pickle. As it seems to create a lot of overhead when saving huge data, NumPy arrays are best saved using the HDF standard. Here's a article covering that: http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

要备份NumPy数组，可以使用pickle。由于在保存大量数据时似乎会产生大量开销，所以使用HDF标准可以最好地保存NumPy数组。这里有一篇文章介绍了这一点:http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk- pythonpick-vs-hdf5adsf/

#2

Use Rpy, http://rpy.sourceforge.net/, to call R from Python.

使用Rpy, http://rpy.sourceforge.net/从Python调用R。

The caveat is that both R and Python versions need to be exactly the one for which the Rpy binary has been built. You thus need to be careful with the installation.

需要注意的是，R和Python版本都必须是Rpy二进制文件所构建的版本。因此，您需要小心安装。

#3

I cannot comment on "large data" between shared between R and Python, but I have had a much easier time working with pyRserve than RPy or RPy2.

我无法评论R和Python之间共享的“大数据”，但与RPy或RPy2相比，使用pyrservice要容易得多。

That being said, I am curious about the text processing you are doing? Python obviously has a lot to offer on the text processing side, but statistically there is a lot too in packages like NLTK and the Pattern package from CLiPS. Are you just more comfortable doing stats in R, or is there something specific missing in Python?

话虽如此，我对你正在进行的文本处理很感兴趣。显然，Python在文本处理方面有很多可提供的功能，但从统计上看，在NLTK和剪辑中的模式包等包中也有很多。您只是更喜欢在R中做统计，还是在Python中缺少某些特定的东西?

#1

Have you already looked into RPy? It's a python interface to R. I guess that would spare you the data handling.

你已经研究过RPy了吗?它是一个到r的python接口，我想这样可以省去数据处理。
To backup your NumPy arrays you can use pickle. As it seems to create a lot of overhead when saving huge data, NumPy arrays are best saved using the HDF standard. Here's a article covering that: http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

要备份NumPy数组，可以使用pickle。由于在保存大量数据时似乎会产生大量开销，所以使用HDF标准可以最好地保存NumPy数组。这里有一篇文章介绍了这一点:http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk- pythonpick-vs-hdf5adsf/

#2