当一些数字包含逗号作为分隔符时如何读取数据?

时间:2021-02-17 03:39:05

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?

我有一个csv文件,其中一些数值用逗号表示为字符串,例如千位分隔符。“1513”,而不是1513年。将数据读入R的最简单方法是什么?

I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.

我可以使用read.csv(…但是,在将这些列转换成数字之前,我必须去掉相关元素的逗号,而我找不到一种简洁的方法来实现这一点。

11 个解决方案

#1


116  

Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:

不知道怎么读。csv正确地解释了它,但是您可以使用gsub来替换“”和“”,然后将字符串转换为数值。

y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1]  1200 20000 100 12111

This was also answered previously on R-Help (and in Q2 here).

这也在之前的R-Help(和Q2)中得到了解答。

Alternatively, you can pre-process the file, for instance with sed in unix.

或者,您可以预先处理文件,例如在unix中使用sed。

#2


49  

You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:

你可以阅读。表或阅读。csv会自动为您进行这种转换。首先创建一个新的类定义,然后创建一个转换函数,并将其设置为使用setAs函数的“as”方法:

setClass("num.with.commas")
setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

然后运行阅读。csv:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

#3


14  

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

我希望使用R,而不是预处理数据,因为当数据被修改时,它使数据变得更容易。根据Shane对gsub的建议,我认为这是我能做到的最整洁的:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

#4


10  

This question is several years old, but I stumbled upon it, which means maybe others will.

这个问题已经有好几年的历史了,但我偶然发现了它,这意味着也许其他人会。

The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

readr库/包有一些不错的特性。其中一个是解释“混乱”列的好方法,就像这样。

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

This yields

这个收益率

Source: local data frame [4 x 1]

来源:本地数据框架[4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

在文件中阅读的一个重要的点:你要么必须预先处理,比如上面的评论,要么你必须在阅读过程中处理。通常,如果你试图在事后补救,就会发现一些很难找到的危险的假设。(这就是为什么平面文件如此邪恶的原因。)

For instance, if I had not flagged the col_types, I would have gotten this:

例如,如果我没有标记col_types,我就会得到这个:

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

(Notice that it is now a chr (character) instead of a numeric.)

(注意,它现在是一个chr(字符)而不是数字。)

Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

或者,更危险的是,如果时间足够长,而且大多数早期的元素都不包含逗号:

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

(such that the last few elements look like:)

(最后几个元素看起来是这样的:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

Then you'll find trouble reading that comma at all!

然后你就会发现读那个逗号有困难!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details. 

#5


6  

"Preprocess" in R:

接待员:“预处理”

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"

Can use readLines on a textConnection. Then remove only the commas that are between digits:

可以在文本连接上使用readLines。然后只删除数字之间的逗号:

gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)

## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"

It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).

这是很有用的,但与这个问题没有直接关系,因为可以通过读取来处理小数分隔符。csv2(自动)或阅读。表(设置“dec”参数)。

Edit: Later I discovered how to use colClasses by designing a new class. See:

编辑:后来我发现了如何通过设计一个新类来使用colclass。看到的:

How to load df with 1000 separator in R as numeric class?

如何在R中使用1000个分隔符来加载df作为数字类?

#6


4  

a dplyr solution using mutate_each and pipes

say you have the following:

说你有以下几点:

> dft
Source: local data frame [11 x 5]

   Bureau.Name Account.Code   X2014   X2015   X2016
1       Senate          110 158,000 211,000 186,000
2       Senate          115       0       0       0
3       Senate          123  15,000  71,000  21,000
4       Senate          126   6,000  14,000   8,000
5       Senate          127 110,000 234,000 134,000
6       Senate          128 120,000 159,000 134,000
7       Senate          129       0       0       0
8       Senate          130 368,000 465,000 441,000
9       Senate          132       0       0       0
10      Senate          140       0       0       0
11      Senate          140       0       0       0

and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)

并希望从X2014-X2016年变量中删除逗号,并将其转换为数值。同样,我们把X2014-X2016作为因子读取(默认)

dft %>%
    mutate_each(funs(as.character(.)), X2014:X2016) %>%
    mutate_each(funs(gsub(",", "", .)), X2014:X2016) %>%
    mutate_each(funs(as.numeric(.)), X2014:X2016)

mutate_each applies the function(s) inside funs to the specified columns

mutate_each将函数内的函数应用到指定的列中。

I did it sequentially, one function at a time (if you use multiple functions inside funs then you create additional, unnecessary columns)

我按顺序做了,一次一个函数(如果在funs中使用多个函数,那么就会创建额外的不必要的列)

#7


3  

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

我认为预处理是一种方法。您可以使用Notepad++,它有一个正则表达式替换选项。

For example, if your file were like this:

例如,如果您的文件是这样的:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

然后,您可以使用正则表达式“([0-9]+),([0-9]+)”并将其替换为\1\2。

1234,"123",1234
"234","123",1234
123,456,789

Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

然后可以使用x <- read.csv(文件="x.csv",header=FALSE)来读取文件。

#8


3  

If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

如果number是由"."和"."(1.200.000,00)在调用gsub时,你必须设置固定=TRUE .numeric(gsub(".", y,fixed=TRUE))

#9


1  

A very convenient way is readr::read_delim-family. Taking the example from here: Importing csv with multiple separators into R you can do it as follows:

read_delim-family是一个非常方便的方法。以这里的示例为例:将多个分隔符导入到R中,您可以这样做:

txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'

require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")

Which results in the expected result:

这就导致了预期的结果:

# A tibble: 3 × 6
  OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
     <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
1        1   Bagamoyo         1 136227  8514187500 352678.8
2        2    Bariadi         2  88350  5521875000 526307.3
3        3     Chunya         3 483059 30191187500 352444.7

#10


0  

Another solution:

另一个解决方案:

 y <- c("1,200","20,000","100","12,111") 

 as.numeric(unlist(lapply( strsplit(y,","),paste, collapse="")))

It will be considerably slower than gsub,though.

不过它要比gsub慢得多。

#11


0  

It is not as complicated, try this: y<- as.numeric(gsub(",", "", as.character(y))) and if it is just one of the columns, you can subset it with y$2 as shown y$2<- as.numeric(gsub(",", "", as.character(y$2)))

它没有那么复杂,试试这个:y<- as。数字(gsub(“,”,“,”,字符(y))),如果它只是其中的一列,你可以用y$2将其子集显示为y$2<- as。数字(gsub(”、“、”“as.character(y 2美元)))

#1


116  

Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:

不知道怎么读。csv正确地解释了它,但是您可以使用gsub来替换“”和“”,然后将字符串转换为数值。

y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1]  1200 20000 100 12111

This was also answered previously on R-Help (and in Q2 here).

这也在之前的R-Help(和Q2)中得到了解答。

Alternatively, you can pre-process the file, for instance with sed in unix.

或者,您可以预先处理文件,例如在unix中使用sed。

#2


49  

You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:

你可以阅读。表或阅读。csv会自动为您进行这种转换。首先创建一个新的类定义,然后创建一个转换函数,并将其设置为使用setAs函数的“as”方法:

setClass("num.with.commas")
setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

然后运行阅读。csv:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

#3


14  

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

我希望使用R,而不是预处理数据,因为当数据被修改时,它使数据变得更容易。根据Shane对gsub的建议,我认为这是我能做到的最整洁的:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

#4


10  

This question is several years old, but I stumbled upon it, which means maybe others will.

这个问题已经有好几年的历史了,但我偶然发现了它,这意味着也许其他人会。

The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

readr库/包有一些不错的特性。其中一个是解释“混乱”列的好方法,就像这样。

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

This yields

这个收益率

Source: local data frame [4 x 1]

来源:本地数据框架[4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

在文件中阅读的一个重要的点:你要么必须预先处理,比如上面的评论,要么你必须在阅读过程中处理。通常,如果你试图在事后补救,就会发现一些很难找到的危险的假设。(这就是为什么平面文件如此邪恶的原因。)

For instance, if I had not flagged the col_types, I would have gotten this:

例如,如果我没有标记col_types,我就会得到这个:

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

(Notice that it is now a chr (character) instead of a numeric.)

(注意,它现在是一个chr(字符)而不是数字。)

Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

或者,更危险的是,如果时间足够长,而且大多数早期的元素都不包含逗号:

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

(such that the last few elements look like:)

(最后几个元素看起来是这样的:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

Then you'll find trouble reading that comma at all!

然后你就会发现读那个逗号有困难!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details. 

#5


6  

"Preprocess" in R:

接待员:“预处理”

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"

Can use readLines on a textConnection. Then remove only the commas that are between digits:

可以在文本连接上使用readLines。然后只删除数字之间的逗号:

gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)

## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"

It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).

这是很有用的,但与这个问题没有直接关系,因为可以通过读取来处理小数分隔符。csv2(自动)或阅读。表(设置“dec”参数)。

Edit: Later I discovered how to use colClasses by designing a new class. See:

编辑:后来我发现了如何通过设计一个新类来使用colclass。看到的:

How to load df with 1000 separator in R as numeric class?

如何在R中使用1000个分隔符来加载df作为数字类?

#6


4  

a dplyr solution using mutate_each and pipes

say you have the following:

说你有以下几点:

> dft
Source: local data frame [11 x 5]

   Bureau.Name Account.Code   X2014   X2015   X2016
1       Senate          110 158,000 211,000 186,000
2       Senate          115       0       0       0
3       Senate          123  15,000  71,000  21,000
4       Senate          126   6,000  14,000   8,000
5       Senate          127 110,000 234,000 134,000
6       Senate          128 120,000 159,000 134,000
7       Senate          129       0       0       0
8       Senate          130 368,000 465,000 441,000
9       Senate          132       0       0       0
10      Senate          140       0       0       0
11      Senate          140       0       0       0

and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)

并希望从X2014-X2016年变量中删除逗号,并将其转换为数值。同样,我们把X2014-X2016作为因子读取(默认)

dft %>%
    mutate_each(funs(as.character(.)), X2014:X2016) %>%
    mutate_each(funs(gsub(",", "", .)), X2014:X2016) %>%
    mutate_each(funs(as.numeric(.)), X2014:X2016)

mutate_each applies the function(s) inside funs to the specified columns

mutate_each将函数内的函数应用到指定的列中。

I did it sequentially, one function at a time (if you use multiple functions inside funs then you create additional, unnecessary columns)

我按顺序做了,一次一个函数(如果在funs中使用多个函数,那么就会创建额外的不必要的列)

#7


3  

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

我认为预处理是一种方法。您可以使用Notepad++,它有一个正则表达式替换选项。

For example, if your file were like this:

例如,如果您的文件是这样的:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

然后,您可以使用正则表达式“([0-9]+),([0-9]+)”并将其替换为\1\2。

1234,"123",1234
"234","123",1234
123,456,789

Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

然后可以使用x <- read.csv(文件="x.csv",header=FALSE)来读取文件。

#8


3  

If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

如果number是由"."和"."(1.200.000,00)在调用gsub时,你必须设置固定=TRUE .numeric(gsub(".", y,fixed=TRUE))

#9


1  

A very convenient way is readr::read_delim-family. Taking the example from here: Importing csv with multiple separators into R you can do it as follows:

read_delim-family是一个非常方便的方法。以这里的示例为例:将多个分隔符导入到R中,您可以这样做:

txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'

require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")

Which results in the expected result:

这就导致了预期的结果:

# A tibble: 3 × 6
  OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
     <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
1        1   Bagamoyo         1 136227  8514187500 352678.8
2        2    Bariadi         2  88350  5521875000 526307.3
3        3     Chunya         3 483059 30191187500 352444.7

#10


0  

Another solution:

另一个解决方案:

 y <- c("1,200","20,000","100","12,111") 

 as.numeric(unlist(lapply( strsplit(y,","),paste, collapse="")))

It will be considerably slower than gsub,though.

不过它要比gsub慢得多。

#11


0  

It is not as complicated, try this: y<- as.numeric(gsub(",", "", as.character(y))) and if it is just one of the columns, you can subset it with y$2 as shown y$2<- as.numeric(gsub(",", "", as.character(y$2)))

它没有那么复杂,试试这个:y<- as。数字(gsub(“,”,“,”,字符(y))),如果它只是其中的一列,你可以用y$2将其子集显示为y$2<- as。数字(gsub(”、“、”“as.character(y 2美元)))