
时间:2022-04-17 22:51:19

I am writing a data-harvesting code in Python. I'd like to produce a data frame file that would be as easy to import into R as possible. I have full control over what my Python code will produce, and I'd like to avoid unnecessary data processing on the R side, such as converting columns into factor/numeric vectors and such. Also, if possible, I'd like to make importing that data as easy as possible on the R side, preferably by calling a single function with a single argument of file name.


How should I store data into a file to make this possible?


2 个解决方案



You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv module, then it's a simple matter of using read.csv in R. (See ?read.csv)

您可以使用http://docs.python.org/2/library/csv.html Python的csv模块将数据写入CSV,然后在R中使用read.csv这是一个简单的问题(参见?read.csv)

When you read in data to R using read.csv, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA.


First thing you should do after you import some data is to look at the ?str of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.


One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings argument (which can accept a vector of strings ie, c("-",".")) to read.csv.

需要注意的一点是,您可能需要设置自己的NA字符串。例如,如果你有“ - ”,“。”或其他一些表示空白的字符,你需要使用na.strings参数(它可以接受字符串向量,即c(“ - ”, “。”))read.csv。

If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date)


If you know in advance what each column is going to be you can specify the class using colClasses.


A thorough read through of ?read.csv will provide you with more detailed information. But I've outlined some common issues.




Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.


An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.


This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.




You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv module, then it's a simple matter of using read.csv in R. (See ?read.csv)

您可以使用http://docs.python.org/2/library/csv.html Python的csv模块将数据写入CSV,然后在R中使用read.csv这是一个简单的问题(参见?read.csv)

When you read in data to R using read.csv, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA.


First thing you should do after you import some data is to look at the ?str of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.


One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings argument (which can accept a vector of strings ie, c("-",".")) to read.csv.

需要注意的一点是,您可能需要设置自己的NA字符串。例如,如果你有“ - ”,“。”或其他一些表示空白的字符,你需要使用na.strings参数(它可以接受字符串向量,即c(“ - ”, “。”))read.csv。

If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date)


If you know in advance what each column is going to be you can specify the class using colClasses.


A thorough read through of ?read.csv will provide you with more detailed information. But I've outlined some common issues.




Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.


An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.


This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.
