I'm using python and numpy/scipy to do regex and stemming for a text processing application. But I want to use some of R's statistical packages as well.
我正在使用python和numpy/scipy来为文本处理应用程序做正则表达式和词干。但是我也想用一些R的统计软件包。
What's the best way to pass the data from python to R? (And back?)
将数据从python传递到R的最好方法是什么?(?)
Also, I need to backup the array to disk at some point, so I'm open to saving from python and loading th R if that's the best solution. The matrices are pretty big (e.g. 100,000 x 10,000), so using sparse matrices might also be nice.
另外,我需要在某个时刻将数组备份到磁盘,所以如果这是最好的解决方案,我愿意从python中保存并加载th R。矩阵是相当大的(例如10万x万),所以使用稀疏矩阵也很好。
Apologies if this is a repost. I haven't been able to find anything that puts all these pieces together.
如果这是转发,请道歉。我找不到任何东西能把这些碎片拼在一起。
3 个解决方案
#1
6
-
Have you already looked into RPy? It's a python interface to R. I guess that would spare you the data handling.
你已经研究过RPy了吗?它是一个到r的python接口,我想这样可以省去数据处理。
-
To backup your NumPy arrays you can use pickle. As it seems to create a lot of overhead when saving huge data, NumPy arrays are best saved using the HDF standard. Here's a article covering that: http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/
要备份NumPy数组,可以使用pickle。由于在保存大量数据时似乎会产生大量开销,所以使用HDF标准可以最好地保存NumPy数组。这里有一篇文章介绍了这一点:http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk- pythonpick-vs-hdf5adsf/
#2
2
Use Rpy, http://rpy.sourceforge.net/, to call R from Python.
使用Rpy, http://rpy.sourceforge.net/从Python调用R。
The caveat is that both R and Python versions need to be exactly the one for which the Rpy binary has been built. You thus need to be careful with the installation.
需要注意的是,R和Python版本都必须是Rpy二进制文件所构建的版本。因此,您需要小心安装。
#3
0
I cannot comment on "large data" between shared between R and Python, but I have had a much easier time working with pyRserve than RPy or RPy2.
我无法评论R和Python之间共享的“大数据”,但与RPy或RPy2相比,使用pyrservice要容易得多。
That being said, I am curious about the text processing you are doing? Python obviously has a lot to offer on the text processing side, but statistically there is a lot too in packages like NLTK and the Pattern package from CLiPS. Are you just more comfortable doing stats in R, or is there something specific missing in Python?
话虽如此,我对你正在进行的文本处理很感兴趣。显然,Python在文本处理方面有很多可提供的功能,但从统计上看,在NLTK和剪辑中的模式包等包中也有很多。您只是更喜欢在R中做统计,还是在Python中缺少某些特定的东西?
#1
6
-
Have you already looked into RPy? It's a python interface to R. I guess that would spare you the data handling.
你已经研究过RPy了吗?它是一个到r的python接口,我想这样可以省去数据处理。
-
To backup your NumPy arrays you can use pickle. As it seems to create a lot of overhead when saving huge data, NumPy arrays are best saved using the HDF standard. Here's a article covering that: http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/
要备份NumPy数组,可以使用pickle。由于在保存大量数据时似乎会产生大量开销,所以使用HDF标准可以最好地保存NumPy数组。这里有一篇文章介绍了这一点:http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk- pythonpick-vs-hdf5adsf/
#2
2
Use Rpy, http://rpy.sourceforge.net/, to call R from Python.
使用Rpy, http://rpy.sourceforge.net/从Python调用R。
The caveat is that both R and Python versions need to be exactly the one for which the Rpy binary has been built. You thus need to be careful with the installation.
需要注意的是,R和Python版本都必须是Rpy二进制文件所构建的版本。因此,您需要小心安装。
#3
0
I cannot comment on "large data" between shared between R and Python, but I have had a much easier time working with pyRserve than RPy or RPy2.
我无法评论R和Python之间共享的“大数据”,但与RPy或RPy2相比,使用pyrservice要容易得多。
That being said, I am curious about the text processing you are doing? Python obviously has a lot to offer on the text processing side, but statistically there is a lot too in packages like NLTK and the Pattern package from CLiPS. Are you just more comfortable doing stats in R, or is there something specific missing in Python?
话虽如此,我对你正在进行的文本处理很感兴趣。显然,Python在文本处理方面有很多可提供的功能,但从统计上看,在NLTK和剪辑中的模式包等包中也有很多。您只是更喜欢在R中做统计,还是在Python中缺少某些特定的东西?