I am writing a data-harvesting code in Python. I'd like to produce a data frame file that would be as easy to import into R as possible. I have full control over what my Python code will produce, and I'd like to avoid unnecessary data processing on the R side, such as converting columns into factor/numeric vectors and such. Also, if possible, I'd like to make importing that data as easy as possible on the R side, preferably by calling a single function with a single argument of file name.
我正在用Python编写数据收集代码。我想生成一个尽可能容易导入R的数据框文件。我可以完全控制我的Python代码将产生什么,并且我想避免在R端进行不必要的数据处理,例如将列转换为因子/数字向量等。另外,如果可能的话,我想在R端尽可能简单地导入数据,最好是通过使用文件名的单个参数调用单个函数。
How should I store data into a file to make this possible?
我应该如何将数据存储到文件中以实现这一目标?
2 个解决方案
#1
4
You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv
module, then it's a simple matter of using read.csv
in R. (See ?read.csv
)
您可以使用http://docs.python.org/2/library/csv.html Python的csv模块将数据写入CSV,然后在R中使用read.csv这是一个简单的问题(参见?read.csv)
When you read in data to R using read.csv
, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA
.
使用read.csv将数据读入R时,除非另行指定,否则字符串将转换为因子,数字字段将转换为数字。空值将转换为NA。
First thing you should do after you import some data is to look at the ?str
of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.
导入一些数据后,您应该做的第一件事是查看它的内容,以确保包含的数据类符合您的期望。很多时候我犯了一个错误并在数字字段中混合了一个字符值,最后得到的是一个因子而不是一个数字。
One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings
argument (which can accept a vector of strings ie, c("-",".")
) to read.csv
.
需要注意的一点是,您可能需要设置自己的NA字符串。例如,如果你有“ - ”,“。”或其他一些表示空白的字符,你需要使用na.strings参数(它可以接受字符串向量,即c(“ - ”, “。”))read.csv。
If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date
)
如果您有日期字段,则需要正确转换它们。如果没有指定它们是什么,R不一定能识别日期或时间(参见?as.Date)
If you know in advance what each column is going to be you can specify the class using colClasses
.
如果您事先知道每列的内容,则可以使用colClasses指定类。
A thorough read through of ?read.csv
will provide you with more detailed information. But I've outlined some common issues.
彻底阅读?read.csv将为您提供更详细的信息。但我概述了一些常见问题。
#2
4
Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.
如果你的数据不是很庞大,布兰登建议使用CSV是很好的,特别是如果它不包含大量的浮点值,在这种情况下,CSV格式的效率非常低。
An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.
一个更好处理大数据集的选项可能是在pandas中构造一个等效的DataFrame并使用它的工具转储到hdf5,然后以这种方式在R中打开它。例如,请参阅此问题的示例。
This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.
这种方法感觉有些过分,但您也可以使用pandas的实验性R接口将数据帧内存中的数据帧直接传输到R,然后直接从R保存。
#1
4
You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv
module, then it's a simple matter of using read.csv
in R. (See ?read.csv
)
您可以使用http://docs.python.org/2/library/csv.html Python的csv模块将数据写入CSV,然后在R中使用read.csv这是一个简单的问题(参见?read.csv)
When you read in data to R using read.csv
, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA
.
使用read.csv将数据读入R时,除非另行指定,否则字符串将转换为因子,数字字段将转换为数字。空值将转换为NA。
First thing you should do after you import some data is to look at the ?str
of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.
导入一些数据后,您应该做的第一件事是查看它的内容,以确保包含的数据类符合您的期望。很多时候我犯了一个错误并在数字字段中混合了一个字符值,最后得到的是一个因子而不是一个数字。
One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings
argument (which can accept a vector of strings ie, c("-",".")
) to read.csv
.
需要注意的一点是,您可能需要设置自己的NA字符串。例如,如果你有“ - ”,“。”或其他一些表示空白的字符,你需要使用na.strings参数(它可以接受字符串向量,即c(“ - ”, “。”))read.csv。
If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date
)
如果您有日期字段,则需要正确转换它们。如果没有指定它们是什么,R不一定能识别日期或时间(参见?as.Date)
If you know in advance what each column is going to be you can specify the class using colClasses
.
如果您事先知道每列的内容,则可以使用colClasses指定类。
A thorough read through of ?read.csv
will provide you with more detailed information. But I've outlined some common issues.
彻底阅读?read.csv将为您提供更详细的信息。但我概述了一些常见问题。
#2
4
Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.
如果你的数据不是很庞大,布兰登建议使用CSV是很好的,特别是如果它不包含大量的浮点值,在这种情况下,CSV格式的效率非常低。
An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.
一个更好处理大数据集的选项可能是在pandas中构造一个等效的DataFrame并使用它的工具转储到hdf5,然后以这种方式在R中打开它。例如,请参阅此问题的示例。
This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.
这种方法感觉有些过分,但您也可以使用pandas的实验性R接口将数据帧内存中的数据帧直接传输到R,然后直接从R保存。