fread
function from data.table package reads large csv files faster than the read.cvs
function. But as you can see from the output of a data frame from both routines are different for the "device _id" column (see last 3 digits). Why? Is there a parameter in these functions to read them correctly? Or this is a normal behavior for fread
? (it reads this datafile 10x faster though).
data.table包中的fread函数比read.cvs函数更快地读取大型csv文件。但是从“两个例程”的数据框输出中可以看出“device _id”列的不同(参见最后3位数字)。为什么?这些函数中是否有参数可以正确读取它们?或者这是fread的正常行为? (它读取此数据文件的速度提高了10倍)。
# Read file
p<-fread("C:\\User\\Documents\\Data\\device.csv",sep=", integer64="character" )
> str(p)
Classes ‘data.table’ and 'data.frame': 187245 obs. of 3 variables:
$ device_id : Factor w/ 186716 levels "-1000025442746372936",..: 89025 96789 140102 123523 45208 118633 32423 22215 54410 81947 ...
$ phone_brand : Factor w/ 131 levels "E<U+4EBA>E<U+672C>""| __truncated__,"E<U+6D3E>""| __truncated__,..: 52 52 16 10 16 32 52 32 52 14 ...
$ device_model: Factor w/ 1598 levels "1100","1105",..: 1517 750 561 1503 537 775 753 433 759 983 ...
- attr(*, ".internal.selfref")=<externalptr>
> head(p)
device_id brand device_model
1: -8890648629457979026 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2: 1277779817574759137 <U+5C0F><U+7C73> MI 2
3: 5137427614288105724 <U+4E09><U+661F> Galaxy S4
4: 3669464369358936369 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F> Galaxy Note 2
6: 3238009352149731868 <U+534E><U+4E3A> Mate
# Read file
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",")
# Convert device_id to character
> p$device_id<-as.character(p$device_id)
> str(p)
'data.frame': 187245 obs. of 3 variables:
$ device_id : chr "-8890648629457979392" "1277779817574759168" "5137427614288105472" "3669464369358936576" ...
$ phone_brand : chr "<U+5C0F><U+7C73>""| __truncated__ "<U+5C0F><U+7C73>""| __truncated__ "<U+4E09><U+661F>""| __truncated__ "SUGAR" ...
$ device_model: chr "<U+7EA2><U+7C73>""| __truncated__ "MI 2" "Galaxy S4" "<U+65F6><U+5C1A><U+624B><U+673A>""| __truncated__ ...
> head(p)
device_id brand device_model
1 -8890648629457979392 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2 1277779817574759168 <U+5C0F><U+7C73> MI 2
3 5137427614288105472 <U+4E09><U+661F> Galaxy S4
4 3669464369358936576 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5 -5019277647504317440 <U+4E09><U+661F> Galaxy Note 2
6 3238009352149731840 <U+534E><U+4E3A> Mate
2 个解决方案
#1
1
If the bit64
library is present, fread
will automatically use it to correctly read integers that exceed 2^32 - 1.
如果bit64库存在,fread将自动使用它来正确读取超过2 ^ 32 - 1的整数。
read.csv
does not do that, so it suffers from overflow.
read.csv不这样做,所以它会溢出。
This is mentioned in the first paragraph at ?fread
:
这在第一段中提到了?fread:
Similar to
read.table
but faster and more convenient. All controls such assep
,colClasses
andnrows
are automatically detected.bit64::integer64
types are also detected and read directly without needing to read as character before converting.类似于read.table但更快更方便。自动检测所有控件,如sep,colClasses和nrows。 bit64 :: integer64类型也可以直接检测和读取,无需在转换前读取字符。
You are using the integer64="character"
option, so they will be detected and read as characters. With read.table
, they will not be detected and not read as characters. If you want read.csv
to behave similarly, you will need to use the colClasses
argument to specify the column you want read as a character during import. By the time it has been read in, it is too late. The overflow has already resulted in lost information, p$device_id<-as.character(p$device_id)
cannot "undo" the problem.
您正在使用integer64 =“character”选项,因此它们将被检测并作为字符读取。使用read.table,它们将不会被检测到,也不会被读作字符。如果希望read.csv的行为类似,则需要使用colClasses参数指定要在导入期间作为字符读取的列。当它被读入时,为时已晚。溢出已导致信息丢失,p $ device_id <-as.character(p $ device_id)无法“撤消”该问题。
Is there a parameter in these functions to read them correctly? Or this is a normal behavior for
fread
?这些函数中是否有参数可以正确读取它们?或者这是fread的正常行为?
Yes, fread
is reading things correctly, this is normal behavior. read.csv
will take a little more work to read things correctly - you will need to use the colClassses
argument to read the long integer as a character. And it will still be slower.
是的,fread正在正确阅读,这是正常的行为。 read.csv需要更多的工作才能正确读取内容 - 您需要使用colClassses参数将长整数读取为字符。而且它仍然会变慢。
#2
1
Like teger elegantly discussed the read.csv
function has a limitation in reading 64 bit numbers. So like fread
, if the numerals
argument is defined as "no.loss" read.cvs
also works. Thanks all the contributors to this question.
像teger一样优雅地讨论了read.csv函数在读取64位数时有局限性。所以像fread一样,如果数字参数定义为“no.loss”,read.cvs也可以。感谢所有这个问题的贡献者。
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",",encoding="UTF-8", numerals="no.loss" )
> head(p)
device_id phone_brand device_model
1: -8890648629457979026 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2: 1277779817574759137 <U+5C0F><U+7C73> MI 2
3: 5137427614288105724 <U+4E09><U+661F> Galaxy S4
4: 3669464369358936369 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F> Galaxy Note 2
6: 3238009352149731868 <U+534E><U+4E3A> Mate
#1
1
If the bit64
library is present, fread
will automatically use it to correctly read integers that exceed 2^32 - 1.
如果bit64库存在,fread将自动使用它来正确读取超过2 ^ 32 - 1的整数。
read.csv
does not do that, so it suffers from overflow.
read.csv不这样做,所以它会溢出。
This is mentioned in the first paragraph at ?fread
:
这在第一段中提到了?fread:
Similar to
read.table
but faster and more convenient. All controls such assep
,colClasses
andnrows
are automatically detected.bit64::integer64
types are also detected and read directly without needing to read as character before converting.类似于read.table但更快更方便。自动检测所有控件,如sep,colClasses和nrows。 bit64 :: integer64类型也可以直接检测和读取,无需在转换前读取字符。
You are using the integer64="character"
option, so they will be detected and read as characters. With read.table
, they will not be detected and not read as characters. If you want read.csv
to behave similarly, you will need to use the colClasses
argument to specify the column you want read as a character during import. By the time it has been read in, it is too late. The overflow has already resulted in lost information, p$device_id<-as.character(p$device_id)
cannot "undo" the problem.
您正在使用integer64 =“character”选项,因此它们将被检测并作为字符读取。使用read.table,它们将不会被检测到,也不会被读作字符。如果希望read.csv的行为类似,则需要使用colClasses参数指定要在导入期间作为字符读取的列。当它被读入时,为时已晚。溢出已导致信息丢失,p $ device_id <-as.character(p $ device_id)无法“撤消”该问题。
Is there a parameter in these functions to read them correctly? Or this is a normal behavior for
fread
?这些函数中是否有参数可以正确读取它们?或者这是fread的正常行为?
Yes, fread
is reading things correctly, this is normal behavior. read.csv
will take a little more work to read things correctly - you will need to use the colClassses
argument to read the long integer as a character. And it will still be slower.
是的,fread正在正确阅读,这是正常的行为。 read.csv需要更多的工作才能正确读取内容 - 您需要使用colClassses参数将长整数读取为字符。而且它仍然会变慢。
#2
1
Like teger elegantly discussed the read.csv
function has a limitation in reading 64 bit numbers. So like fread
, if the numerals
argument is defined as "no.loss" read.cvs
also works. Thanks all the contributors to this question.
像teger一样优雅地讨论了read.csv函数在读取64位数时有局限性。所以像fread一样,如果数字参数定义为“no.loss”,read.cvs也可以。感谢所有这个问题的贡献者。
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",",encoding="UTF-8", numerals="no.loss" )
> head(p)
device_id phone_brand device_model
1: -8890648629457979026 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2: 1277779817574759137 <U+5C0F><U+7C73> MI 2
3: 5137427614288105724 <U+4E09><U+661F> Galaxy S4
4: 3669464369358936369 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F> Galaxy Note 2
6: 3238009352149731868 <U+534E><U+4E3A> Mate