r data.table impute missing values for multiple set of columns

时间:2021-03-06 22:51:46

I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA and for categorical variables I want to use the mode to impute the NA. I did search for how to impute it separately for different set of columns and did not find.

我想为少数几列列出缺失值。我的想法是数字变量,我想使用中位数来估算NA,对于分类变量,我想用模式来估算NA。我确实搜索了如何为不同的列集分别插入它并且没有找到。

My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.

我的数据很多,有很多列,所以我在data.table中有它。因为我不知道如何在data.table中做到这一点,我尝试了下面的代码库R.我试过下面的代码,但不知怎的,我似乎弄乱了列名称识别。

My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.

我的数据很大,有多个变量。我在向量var_num中存储数值变量,我将分类变量存储在向量var_chr中。

Please see my sample code below -

请参阅下面的示例代码 -

library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)

df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)

var_num <- c("bills","nos","value")
var_chr <- c("stru","type")

impute <- function(x){
  #print(x)
  if(colnames(x) %in% var_num){
    x[is.na(x)] = median(x,na.rm = T)
  } else if (colnames(x) %in% var_chr){
    x[is.na(x)] = mode(x)
  } else {
    x #if not part of var_num and var_chr then nothing needs to be done and return the original value
  }
  return(x)
}


df1_imp_med <- data.frame(apply(df1,2,impute))

When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero

当我尝试运行上面的代码时,它给出了错误if(colnames(x)%in%var_num){:argument的长度为零

Please help me understand how I can correct this and achieve my requirement.

请帮助我理解我如何纠正这个并达到我的要求。

4 个解决方案

#1


3  

As suggested in comments, you can use for-set combination in data.table for a faster imputation:

正如评论中所建议的那样,您可以在data.table中使用for-set组合来实现更快的插补:

for(k in names(df1)){

      if(k %in% var_num){

        # impute numeric variables with median
        med <- median(df1[[k]],na.rm = T)
        set(x = df1, which(is.na(df1[[k]])), k, med)

    } else if(k %in% var_char){

        ## impute categorical variables with mode
        mode <- names(which.max(table(df1[[k]])))
        set(x = df1, which(is.na(df1[[k]])), k, mode)
    }
}

#2


1  

It may or may not be worth your time coding up a single function for both of your use cases. A direct (but specific) solution is below -- note that mode may not be behaving as you expect, by reading ?mode.

对于两个用例,编写单个函数可能值得也可能不值得。下面是一个直接(但具体)的解决方案 - 请注意,通过读取?模式,模式可能无法按预期运行。

library(data.table)

set.seed(1200)
df1 <- data.table(
id = 1:100,
bills = sample(c(1:20,NA),100,replace = T),
nos = sample(c(1:80,NA),100,replace = T),
stru = sample(c("A","B","C","D",NA),100,replace = T),
type = sample(c(as.character(1:7),NA),100,replace = T),
value = sample(c(100:1000,NA),100,replace = T)
)

# Function to calculate the most frequent object in a vector:
getMode <- function(myvector) {
    mytable <- table(myvector)
    return(names(mytable)[which.max(mytable)])
}

# replace na values by reference, with `:=`
df1[is.na(bills), bills := median(df1[,bills], na.rm=T)]
df1[is.na(nos), nos := median(df1[,nos], na.rm=T)]
df1[is.na(value), value := median(df1[,value], na.rm=T)]
df1[is.na(stru), stru := getMode(df1[,stru])]
df1[is.na(type), type := getMode(df1[,type])]

#3


0  

I managed to get a working solution. One of the key things was to refer to the variables specified in var_num and var_chr for numeric and categorical imputation. Variables that are not specified in these vectors need not be imputed.

我设法找到了一个有效的解决方案。其中一个关键是引用var_num和var_chr中指定的变量进行数值和分类插补。不需要估算这些向量中未指定的变量。

Challenge I was facing is to refer to them in the function. I dropped the idea of writing the function and managed to write a for loop as below -

我面临的挑战是在功能中引用它们。我放弃了编写函数的想法并设法编写了一个for循环,如下所示 -

df1 <- as.data.frame(df1)

for (var in 1:ncol(df1)) {
  if (names(df1[var]) %in% var_num) {
    df1[is.na(df1[,var]),var] <- median(df1[,var], na.rm = TRUE)
  } else if (names(df1[var]) %in% var_chr) {
    df1[is.na(df1[,var]),var] <- names(which.max(table(df1[,var])))
  }
}

This for loop does the needed imputation.

这个for循环做了所需的插补。

If there is more simpler and concise way of achieving this do let me know. Maybe some apply family may do the trick.

如果有更简单,更简洁的方法来实现这一点,请告诉我。也许一些申请家庭可能会做的伎俩。

#4


0  

Another option using lapply

使用lapply的另一种选择

lapply(c(var_num, var_chr), function(x){ 
  imp.fun <- ifelse(x %in% var_num
                   , function(x) median(x, na.rm = T) 
                   , function(x) names(which.max(table(x))))
  df1[is.na(df1[[x]]), (x) := imp.fun(df1[[x]])]})

#1


3  

As suggested in comments, you can use for-set combination in data.table for a faster imputation:

正如评论中所建议的那样,您可以在data.table中使用for-set组合来实现更快的插补:

for(k in names(df1)){

      if(k %in% var_num){

        # impute numeric variables with median
        med <- median(df1[[k]],na.rm = T)
        set(x = df1, which(is.na(df1[[k]])), k, med)

    } else if(k %in% var_char){

        ## impute categorical variables with mode
        mode <- names(which.max(table(df1[[k]])))
        set(x = df1, which(is.na(df1[[k]])), k, mode)
    }
}

#2


1  

It may or may not be worth your time coding up a single function for both of your use cases. A direct (but specific) solution is below -- note that mode may not be behaving as you expect, by reading ?mode.

对于两个用例,编写单个函数可能值得也可能不值得。下面是一个直接(但具体)的解决方案 - 请注意,通过读取?模式,模式可能无法按预期运行。

library(data.table)

set.seed(1200)
df1 <- data.table(
id = 1:100,
bills = sample(c(1:20,NA),100,replace = T),
nos = sample(c(1:80,NA),100,replace = T),
stru = sample(c("A","B","C","D",NA),100,replace = T),
type = sample(c(as.character(1:7),NA),100,replace = T),
value = sample(c(100:1000,NA),100,replace = T)
)

# Function to calculate the most frequent object in a vector:
getMode <- function(myvector) {
    mytable <- table(myvector)
    return(names(mytable)[which.max(mytable)])
}

# replace na values by reference, with `:=`
df1[is.na(bills), bills := median(df1[,bills], na.rm=T)]
df1[is.na(nos), nos := median(df1[,nos], na.rm=T)]
df1[is.na(value), value := median(df1[,value], na.rm=T)]
df1[is.na(stru), stru := getMode(df1[,stru])]
df1[is.na(type), type := getMode(df1[,type])]

#3


0  

I managed to get a working solution. One of the key things was to refer to the variables specified in var_num and var_chr for numeric and categorical imputation. Variables that are not specified in these vectors need not be imputed.

我设法找到了一个有效的解决方案。其中一个关键是引用var_num和var_chr中指定的变量进行数值和分类插补。不需要估算这些向量中未指定的变量。

Challenge I was facing is to refer to them in the function. I dropped the idea of writing the function and managed to write a for loop as below -

我面临的挑战是在功能中引用它们。我放弃了编写函数的想法并设法编写了一个for循环,如下所示 -

df1 <- as.data.frame(df1)

for (var in 1:ncol(df1)) {
  if (names(df1[var]) %in% var_num) {
    df1[is.na(df1[,var]),var] <- median(df1[,var], na.rm = TRUE)
  } else if (names(df1[var]) %in% var_chr) {
    df1[is.na(df1[,var]),var] <- names(which.max(table(df1[,var])))
  }
}

This for loop does the needed imputation.

这个for循环做了所需的插补。

If there is more simpler and concise way of achieving this do let me know. Maybe some apply family may do the trick.

如果有更简单,更简洁的方法来实现这一点,请告诉我。也许一些申请家庭可能会做的伎俩。

#4


0  

Another option using lapply

使用lapply的另一种选择

lapply(c(var_num, var_chr), function(x){ 
  imp.fun <- ifelse(x %in% var_num
                   , function(x) median(x, na.rm = T) 
                   , function(x) names(which.max(table(x))))
  df1[is.na(df1[[x]]), (x) := imp.fun(df1[[x]])]})