read.csv和fread为同一数据帧产生不同的结果

fread function from data.table package reads large csv files faster than the read.cvs function. But as you can see from the output of a data frame from both routines are different for the "device _id" column (see last 3 digits). Why? Is there a parameter in these functions to read them correctly? Or this is a normal behavior for fread? (it reads this datafile 10x faster though).

data.table包中的fread函数比read.cvs函数更快地读取大型csv文件。但是从“两个例程”的数据框输出中可以看出“device _id”列的不同(参见最后3位数字)。为什么?这些函数中是否有参数可以正确读取它们?或者这是fread的正常行为? (它读取此数据文件的速度提高了10倍)。

# Read file
p<-fread("C:\\User\\Documents\\Data\\device.csv",sep=", integer64="character" )
> str(p)
         Classes ‘data.table’ and 'data.frame': 187245 obs. of  3 variables:
         $ device_id   : Factor w/ 186716 levels "-1000025442746372936",..: 89025 96789 140102 123523 45208 118633 32423 22215 54410 81947 ...
         $ phone_brand : Factor w/ 131 levels "E<U+4EBA>E<U+672C>""| __truncated__,"E<U+6D3E>""| __truncated__,..: 52 52 16 10 16 32 52 32 52 14 ...
         $ device_model: Factor w/ 1598 levels "1100","1105",..: 1517 750 561 1503 537 775 753 433 759 983 ...
         - attr(*, ".internal.selfref")=<externalptr>

> head(p)
                          device_id            brand                     device_model
            1: -8890648629457979026 <U+5C0F><U+7C73>                 <U+7EA2><U+7C73>
            2:  1277779817574759137 <U+5C0F><U+7C73>                             MI 2
            3:  5137427614288105724 <U+4E09><U+661F>                        Galaxy S4
            4:  3669464369358936369            SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
            5: -5019277647504317457 <U+4E09><U+661F>                    Galaxy Note 2
            6:  3238009352149731868 <U+534E><U+4E3A>                             Mate

# Read file
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",")

# Convert device_id to character
> p$device_id<-as.character(p$device_id)

> str(p)
    'data.frame':   187245 obs. of  3 variables:
 $ device_id   : chr  "-8890648629457979392" "1277779817574759168" "5137427614288105472" "3669464369358936576" ...
 $ phone_brand : chr  "<U+5C0F><U+7C73>""| __truncated__ "<U+5C0F><U+7C73>""| __truncated__ "<U+4E09><U+661F>""| __truncated__ "SUGAR" ...
 $ device_model: chr  "<U+7EA2><U+7C73>""| __truncated__ "MI 2" "Galaxy S4" "<U+65F6><U+5C1A><U+624B><U+673A>""| __truncated__ ...

    > head(p)
                     device_id            brand                     device_model
        1 -8890648629457979392 <U+5C0F><U+7C73>                 <U+7EA2><U+7C73>
        2  1277779817574759168 <U+5C0F><U+7C73>                             MI 2
        3  5137427614288105472 <U+4E09><U+661F>                        Galaxy S4
        4  3669464369358936576            SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
        5 -5019277647504317440 <U+4E09><U+661F>                    Galaxy Note 2
        6  3238009352149731840 <U+534E><U+4E3A>                             Mate

2 个解决方案

#1

If the bit64 library is present, fread will automatically use it to correctly read integers that exceed 2^32 - 1.

如果bit64库存在,fread将自动使用它来正确读取超过2 ^ 32 - 1的整数。

read.csv does not do that, so it suffers from overflow.

read.csv不这样做,所以它会溢出。

This is mentioned in the first paragraph at ?fread:

这在第一段中提到了?fread:

Similar to read.table but faster and more convenient. All controls such as sep, colClasses and nrows are automatically detected. bit64::integer64 types are also detected and read directly without needing to read as character before converting.

类似于read.table但更快更方便。自动检测所有控件,如sep,colClasses和nrows。 bit64 :: integer64类型也可以直接检测和读取,无需在转换前读取字符。

You are using the integer64="character" option, so they will be detected and read as characters. With read.table, they will not be detected and not read as characters. If you want read.csv to behave similarly, you will need to use the colClasses argument to specify the column you want read as a character during import. By the time it has been read in, it is too late. The overflow has already resulted in lost information, p$device_id<-as.character(p$device_id) cannot "undo" the problem.

您正在使用integer64 =“character”选项,因此它们将被检测并作为字符读取。使用read.table,它们将不会被检测到,也不会被读作字符。如果希望read.csv的行为类似,则需要使用colClasses参数指定要在导入期间作为字符读取的列。当它被读入时,为时已晚。溢出已导致信息丢失,p $ device_id <-as.character(p $ device_id)无法“撤消”该问题。

Is there a parameter in these functions to read them correctly? Or this is a normal behavior for fread?

这些函数中是否有参数可以正确读取它们?或者这是fread的正常行为?

Yes, fread is reading things correctly, this is normal behavior. read.csv will take a little more work to read things correctly - you will need to use the colClassses argument to read the long integer as a character. And it will still be slower.

是的,fread正在正确阅读,这是正常的行为。 read.csv需要更多的工作才能正确读取内容 - 您需要使用colClassses参数将长整数读取为字符。而且它仍然会变慢。

#2

Like teger elegantly discussed the read.csv function has a limitation in reading 64 bit numbers. So like fread, if the numerals argument is defined as "no.loss" read.cvs also works. Thanks all the contributors to this question.

像teger一样优雅地讨论了read.csv函数在读取64位数时有局限性。所以像fread一样,如果数字参数定义为“no.loss”,read.cvs也可以。感谢所有这个问题的贡献者。

p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",",encoding="UTF-8", numerals="no.loss" )

> head(p)
              device_id      phone_brand                     device_model
1: -8890648629457979026 <U+5C0F><U+7C73>                 <U+7EA2><U+7C73>
2:  1277779817574759137 <U+5C0F><U+7C73>                             MI 2
3:  5137427614288105724 <U+4E09><U+661F>                        Galaxy S4
4:  3669464369358936369            SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F>                    Galaxy Note 2
6:  3238009352149731868 <U+534E><U+4E3A>                             Mate

#1