用R和数据代替NA。表或蜂巢

时间:2020-12-16 22:47:44

I have some data in the form:

我有一些表格中的数据:

          V1   V2  V3         V4  V5         V6   V7  V8  V9 V10 V11
 1. 14342667 4336 1.5 2015-10-03 \\N 2015-10-03  804 2.0 2.0   0 3.0
 2. 14342667 4336 0.8 2015-06-13 \\N 2015-06-11 2912 2.0 2.0   0 \\N
 3. 14342667 4336 0.5 2016-01-02 \\N 2015-12-27 1618 0.0 0.0   1 \\N
 4. 14342667 4336 0.7 2015-08-22 \\N 2015-08-22 1780 2.0 2.0   0 \\N
 5. 14342667 4336 0.9 2015-02-21 1.2 2015-02-17 1548 0.0 0.0   1 \\N
 6. 14342667 4336 1.0 2015-08-08 \\N 2015-08-06 1538 2.0 2.0   0 2.25
 7. 14342667 4336 0.9 2015-03-28 \\N 2015-03-24 2129 7.0 7.0   0 \\N
 8. 14342667 4336 0.8 2015-04-11 \\N 2015-04-11 2316 1.0 2.0   0 \\N

I want to replace instances of \\N with NA using R and data.table. I got these in places of blank when I sqooped data into HIVE.

我想用R和data.table来替换N的实例。当我把数据整理成蜂箱时,我把它们放在空白的地方。

I have tried things like data1 <- data1[, lapply(.SD, recode, '"\\N"=NA')] and data1 <- data1[, lapply(.SD, recode, '"\N"=NA')] using CAR package which works well for other replacements but failing in the current case with error:

我尝试过data1 <- data1[, lapply(。SD,重新编码,“\ \ N”= NA)]和data1 <——data1、拉普兰人(。使用汽车包装,对其他替代产品效果良好,但在当前的情况下失败:

Error in FUN(X[[1L]], ...) : in recode term: "\N"=NA message: Error : '\N' is an unrecognized escape in character string starting ""\N"

错误,有趣(X[[1L]],…):in recode term:“\N”=NA message:“\N”是一个未被识别的字符串,从字符串开始“\N”。

I have also read solutions like x[x=="\\N"] <- NA but I couldn't lapply those successfully using data.table. I have also looked into HIVE table side solution but apparently regexp_replace works only for one column not for all instances in all columns.

我也读过像x[x= "\\N"] <- NA的解决方案,但我不能用数据表来成功地应用这些解决方案。我还研究了HIVE table端解决方案,但显然regexp_replace只适用于一个列,而不是所有列中的所有实例。

3 个解决方案

#1


4  

We can find out the columns that have this pattern with grep and then use as.numeric on those columns

我们可以找到与grep有这种模式的列,然后使用as。数字的列

library(data.table)
nm1 <- names(df1)[colSums(sapply(df1, grepl, pattern = "\\\\"))!=0]
setDT(df1)[,(nm1):= lapply(.SD, as.numeric) , .SDcols= nm1]
df1
#         V1   V2  V3         V4  V5         V6   V7 V8 V9 V10  V11
#1: 14342667 4336 1.5 2015-10-03  NA 2015-10-03  804  2  2   0 3.00
#2: 14342667 4336 0.8 2015-06-13  NA 2015-06-11 2912  2  2   0   NA
#3: 14342667 4336 0.5 2016-01-02  NA 2015-12-27 1618  0  0   1   NA
#4: 14342667 4336 0.7 2015-08-22  NA 2015-08-22 1780  2  2   0   NA
#5: 14342667 4336 0.9 2015-02-21 1.2 2015-02-17 1548  0  0   1   NA
#6: 14342667 4336 1.0 2015-08-08  NA 2015-08-06 1538  2  2   0 2.25
#7: 14342667 4336 0.9 2015-03-28  NA 2015-03-24 2129  7  7   0   NA
#8: 14342667 4336 0.8 2015-04-11  NA 2015-04-11 2316  1  2   0   NA

data

df1 <- structure(list(V1 = c(14342667L, 14342667L, 14342667L, 14342667L, 
14342667L, 14342667L, 14342667L, 14342667L), V2 = c(4336L, 4336L, 
4336L, 4336L, 4336L, 4336L, 4336L, 4336L), V3 = c(1.5, 0.8, 0.5, 
0.7, 0.9, 1, 0.9, 0.8), V4 = c("2015-10-03", "2015-06-13", "2016-01-02", 
"2015-08-22", "2015-02-21", "2015-08-08", "2015-03-28", "2015-04-11"
), V5 = c("\\\\N", "\\\\N", "\\\\N", "\\\\N", "1.2", "\\\\N", 
"\\\\N", "\\\\N"), V6 = c("2015-10-03", "2015-06-11", "2015-12-27", 
"2015-08-22", "2015-02-17", "2015-08-06", "2015-03-24", "2015-04-11"
), V7 = c(804L, 2912L, 1618L, 1780L, 1548L, 1538L, 2129L, 2316L
), V8 = c(2, 2, 0, 2, 0, 2, 7, 1), V9 = c(2, 2, 0, 2, 0, 2, 7, 
2), V10 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L), V11 = c("3.0", "\\\\N", 
"\\\\N", "\\\\N", "\\\\N", "2.25", "\\\\N", "\\\\N")), .Names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11"), 
 class = "data.frame", row.names = c("1.", 
"2.", "3.", "4.", "5.", "6.", "7.", "8."))

#2


2  

At @akrun's suggestion, here is another simple way:

在@akrun的建议中,还有一个简单的方法:

ccols = which(sapply(DF, class) == "character")
DF[ccols] <- lapply(DF[ccols], type.convert, na.strings="\\\\N")

Use str(DF) to see that the columns are whatever they should be (numeric or integer).

使用str(DF)可以看到这些列是它们应该的(数字或整数)。


With data.table, this would look like

与数据。表格,看起来是这样的。

library(data.table)
setDT(DF)
ccols = which(sapply(DF, class) == "character")
DF[, (ccols) := lapply(.SD, type.convert, na.strings="\\\\N"), .SDcols=ccols]

#3


0  

NULL is in Hive inside HDFS looks like //N so one can use function nvl() for replacement as well if looking for HIVE side solution.

在HDFS内部的Hive中,NULL看起来像//N,所以可以使用函数nvl()来替换,如果寻找蜂巢边解决方案。

#1


4  

We can find out the columns that have this pattern with grep and then use as.numeric on those columns

我们可以找到与grep有这种模式的列,然后使用as。数字的列

library(data.table)
nm1 <- names(df1)[colSums(sapply(df1, grepl, pattern = "\\\\"))!=0]
setDT(df1)[,(nm1):= lapply(.SD, as.numeric) , .SDcols= nm1]
df1
#         V1   V2  V3         V4  V5         V6   V7 V8 V9 V10  V11
#1: 14342667 4336 1.5 2015-10-03  NA 2015-10-03  804  2  2   0 3.00
#2: 14342667 4336 0.8 2015-06-13  NA 2015-06-11 2912  2  2   0   NA
#3: 14342667 4336 0.5 2016-01-02  NA 2015-12-27 1618  0  0   1   NA
#4: 14342667 4336 0.7 2015-08-22  NA 2015-08-22 1780  2  2   0   NA
#5: 14342667 4336 0.9 2015-02-21 1.2 2015-02-17 1548  0  0   1   NA
#6: 14342667 4336 1.0 2015-08-08  NA 2015-08-06 1538  2  2   0 2.25
#7: 14342667 4336 0.9 2015-03-28  NA 2015-03-24 2129  7  7   0   NA
#8: 14342667 4336 0.8 2015-04-11  NA 2015-04-11 2316  1  2   0   NA

data

df1 <- structure(list(V1 = c(14342667L, 14342667L, 14342667L, 14342667L, 
14342667L, 14342667L, 14342667L, 14342667L), V2 = c(4336L, 4336L, 
4336L, 4336L, 4336L, 4336L, 4336L, 4336L), V3 = c(1.5, 0.8, 0.5, 
0.7, 0.9, 1, 0.9, 0.8), V4 = c("2015-10-03", "2015-06-13", "2016-01-02", 
"2015-08-22", "2015-02-21", "2015-08-08", "2015-03-28", "2015-04-11"
), V5 = c("\\\\N", "\\\\N", "\\\\N", "\\\\N", "1.2", "\\\\N", 
"\\\\N", "\\\\N"), V6 = c("2015-10-03", "2015-06-11", "2015-12-27", 
"2015-08-22", "2015-02-17", "2015-08-06", "2015-03-24", "2015-04-11"
), V7 = c(804L, 2912L, 1618L, 1780L, 1548L, 1538L, 2129L, 2316L
), V8 = c(2, 2, 0, 2, 0, 2, 7, 1), V9 = c(2, 2, 0, 2, 0, 2, 7, 
2), V10 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L), V11 = c("3.0", "\\\\N", 
"\\\\N", "\\\\N", "\\\\N", "2.25", "\\\\N", "\\\\N")), .Names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11"), 
 class = "data.frame", row.names = c("1.", 
"2.", "3.", "4.", "5.", "6.", "7.", "8."))

#2


2  

At @akrun's suggestion, here is another simple way:

在@akrun的建议中,还有一个简单的方法:

ccols = which(sapply(DF, class) == "character")
DF[ccols] <- lapply(DF[ccols], type.convert, na.strings="\\\\N")

Use str(DF) to see that the columns are whatever they should be (numeric or integer).

使用str(DF)可以看到这些列是它们应该的(数字或整数)。


With data.table, this would look like

与数据。表格,看起来是这样的。

library(data.table)
setDT(DF)
ccols = which(sapply(DF, class) == "character")
DF[, (ccols) := lapply(.SD, type.convert, na.strings="\\\\N"), .SDcols=ccols]

#3


0  

NULL is in Hive inside HDFS looks like //N so one can use function nvl() for replacement as well if looking for HIVE side solution.

在HDFS内部的Hive中,NULL看起来像//N,所以可以使用函数nvl()来替换,如果寻找蜂巢边解决方案。