将值按行分配给多个列

时间:2021-06-30 13:17:19

PROBLEM STATEMENT: Generating dummy variables based on values in multiple columns.

问题语句:基于多个列中的值生成虚拟变量。

To assign values (more like dummy variables) to columns based on their presence in other “multiple columns". The following code uses data frames.

根据列在其他“多列”中的存在为列赋值(更像虚拟变量)。下面的代码使用数据帧。

Explanation:

解释:

  • V2 column represents value 2. If the variables A1 or A4, either has value 2 then V2=1 and V1, V3:V12=0
  • V2列表示值2。如果变量A1或A4,那么V2=1和V1, V3:V12=0
  • Similarly if A1=1 and A2 =4, then V1=1 ,V4=1 and V2,V3, V5:V12=0
  • 同样的,如果A1=1和A2 =4,那么V1=1,V4=1, V2,V3, V5:V12=0。

Code is given to explain the output desired.

给出了解释所需输出的代码。

set.seed(12345)
df<- data.frame(A1=c(1L,2L),A2=LETTERS[1:3],A3=round(rnorm(4),4),A4=1:12)
df
names= paste0("V",c(1:12))
df[,c(names)]=0
for ( i in 1:nrow(df)){ df[i,c(names)]=match(c(1:12),df[i,c("A1","A4")])}
df[,c(names)][!is.na(df[,c(names)])]=1
df[,c(names)][is.na(df[,c(names)])]=0
df

I would like to have suggestions for code using data table : = operator so that process can be faster. Thanks

我想对使用data table: = operator的代码提出一些建议,这样可以更快地进行处理。谢谢

1 个解决方案

#1


3  

We can use lapply to loop the columns 'A1' and 'A4' of df, compare with the values 1:12 with sapply, Use Reduce with | and collapse the list output to a single matrix. The + is for converting logical matrix to binary format. In the last step we cbind with the original dataset

我们可以使用lapply循环df的列'A1'和'A4',与sapply的值1:12相比,使用Reduce和|,将列表输出折叠成一个矩阵。+表示将逻辑矩阵转换为二进制格式。在最后一步中,我们使用原始数据集进行cbind

cbind(df, +(Reduce('|', lapply(df[c(1,4)], function(x) sapply(1:12, '==', x)))))

Another base R option without looping will be table. We unlist the columns of interest i.e. 'A1', 'A4', get the table with 1:12 values, double negate (!!) to make '0' values FALSE and all other TRUE, use + to coerce the logical matrix to binary 1/0, and cbind with the original dataset.

另一个没有循环的基本R选项是table。我们列出感兴趣的列。“A1”,“A4”,以1:12的值得到表格,双重否定(!!),使“0”值为假,所有其他为真,使用+将逻辑矩阵强制为二进制1/0,并与原始数据集绑定。

subDF <- df[c('A1', 'A4')]
newdf <- cbind(df, +(!!table(rep(1:12, ncol(subDF)), unlist(subDF))))
colnames(newdf)[5:ncol(newdf)] <- paste0('V', 1:12)
newdf
#    A1 A2      A3 A4 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1   1  A  0.5855  1  1  0  0  0  0  0  0  0  0   0   0   0
#2   2  B  0.7095  2  0  1  0  0  0  0  0  0  0   0   0   0
#3   1  C -0.1093  3  1  0  1  0  0  0  0  0  0   0   0   0
#4   2  A -0.4535  4  0  1  0  1  0  0  0  0  0   0   0   0
#5   1  B  0.5855  5  1  0  0  0  1  0  0  0  0   0   0   0
#6   2  C  0.7095  6  0  1  0  0  0  1  0  0  0   0   0   0
#7   1  A -0.1093  7  1  0  0  0  0  0  1  0  0   0   0   0
#8   2  B -0.4535  8  0  1  0  0  0  0  0  1  0   0   0   0
#9   1  C  0.5855  9  1  0  0  0  0  0  0  0  1   0   0   0
#10  2  A  0.7095 10  0  1  0  0  0  0  0  0  0   1   0   0
#11  1  B -0.1093 11  1  0  0  0  0  0  0  0  0   0   1   0
#12  2  C -0.4535 12  0  1  0  0  0  0  0  0  0   0   0   1

We can also use data.table. I am not sure whether this is very efficient as we do table inside the data.table. The approach would be to first convert the 'data.frame' to 'data.table' (setDT(df)), unlist the columns specified in the .SDcols, get the seq_len of number of rows (.N) i.e. 1:12 in the example, replicate (rep) it by the length of 'nm1', and get the table.

我们也可以使用data.table。我不确定这是否非常有效,因为我们在data.table中做table。方法是首先将“data.frame”转换为“data”。表' (setDT(df)),列出. sdcols中指定的列,获取行数(. n)的seq_len (. n),例如,以“nm1”的长度复制(rep),并得到该表。

We create a data.table from the table class (split(tbl..), by looping through the columns using a for loop, we set the values to binary 0/1. The set approach is efficient as it avoids the overhead of [.data.table. Later, we can cbind with the original dataset.

我们创建一个数据。表类(split(tbl.. .)中的表,通过使用for循环遍历列,我们将值设置为二进制0/1。set方法是有效的,因为它避免了[.data.table. .表的开销。稍后,我们可以使用原始数据集进行cbind。

library(data.table)
nm1 <- c('A1', 'A4')
tbl <- setDT(df)[, table(rep(seq_len(.N),length(nm1)), unlist(.SD)), .SDcols=nm1]

dt1 <- setDT(split(tbl, col(tbl)))[]
for(j in seq_along(dt1)) {
       set(dt1, i=NULL, j=j, value=+(!!dt1[[j]]))
}

cbind(df, dt1)

#1


3  

We can use lapply to loop the columns 'A1' and 'A4' of df, compare with the values 1:12 with sapply, Use Reduce with | and collapse the list output to a single matrix. The + is for converting logical matrix to binary format. In the last step we cbind with the original dataset

我们可以使用lapply循环df的列'A1'和'A4',与sapply的值1:12相比,使用Reduce和|,将列表输出折叠成一个矩阵。+表示将逻辑矩阵转换为二进制格式。在最后一步中,我们使用原始数据集进行cbind

cbind(df, +(Reduce('|', lapply(df[c(1,4)], function(x) sapply(1:12, '==', x)))))

Another base R option without looping will be table. We unlist the columns of interest i.e. 'A1', 'A4', get the table with 1:12 values, double negate (!!) to make '0' values FALSE and all other TRUE, use + to coerce the logical matrix to binary 1/0, and cbind with the original dataset.

另一个没有循环的基本R选项是table。我们列出感兴趣的列。“A1”,“A4”,以1:12的值得到表格,双重否定(!!),使“0”值为假,所有其他为真,使用+将逻辑矩阵强制为二进制1/0,并与原始数据集绑定。

subDF <- df[c('A1', 'A4')]
newdf <- cbind(df, +(!!table(rep(1:12, ncol(subDF)), unlist(subDF))))
colnames(newdf)[5:ncol(newdf)] <- paste0('V', 1:12)
newdf
#    A1 A2      A3 A4 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1   1  A  0.5855  1  1  0  0  0  0  0  0  0  0   0   0   0
#2   2  B  0.7095  2  0  1  0  0  0  0  0  0  0   0   0   0
#3   1  C -0.1093  3  1  0  1  0  0  0  0  0  0   0   0   0
#4   2  A -0.4535  4  0  1  0  1  0  0  0  0  0   0   0   0
#5   1  B  0.5855  5  1  0  0  0  1  0  0  0  0   0   0   0
#6   2  C  0.7095  6  0  1  0  0  0  1  0  0  0   0   0   0
#7   1  A -0.1093  7  1  0  0  0  0  0  1  0  0   0   0   0
#8   2  B -0.4535  8  0  1  0  0  0  0  0  1  0   0   0   0
#9   1  C  0.5855  9  1  0  0  0  0  0  0  0  1   0   0   0
#10  2  A  0.7095 10  0  1  0  0  0  0  0  0  0   1   0   0
#11  1  B -0.1093 11  1  0  0  0  0  0  0  0  0   0   1   0
#12  2  C -0.4535 12  0  1  0  0  0  0  0  0  0   0   0   1

We can also use data.table. I am not sure whether this is very efficient as we do table inside the data.table. The approach would be to first convert the 'data.frame' to 'data.table' (setDT(df)), unlist the columns specified in the .SDcols, get the seq_len of number of rows (.N) i.e. 1:12 in the example, replicate (rep) it by the length of 'nm1', and get the table.

我们也可以使用data.table。我不确定这是否非常有效,因为我们在data.table中做table。方法是首先将“data.frame”转换为“data”。表' (setDT(df)),列出. sdcols中指定的列,获取行数(. n)的seq_len (. n),例如,以“nm1”的长度复制(rep),并得到该表。

We create a data.table from the table class (split(tbl..), by looping through the columns using a for loop, we set the values to binary 0/1. The set approach is efficient as it avoids the overhead of [.data.table. Later, we can cbind with the original dataset.

我们创建一个数据。表类(split(tbl.. .)中的表,通过使用for循环遍历列,我们将值设置为二进制0/1。set方法是有效的,因为它避免了[.data.table. .表的开销。稍后,我们可以使用原始数据集进行cbind。

library(data.table)
nm1 <- c('A1', 'A4')
tbl <- setDT(df)[, table(rep(seq_len(.N),length(nm1)), unlist(.SD)), .SDcols=nm1]

dt1 <- setDT(split(tbl, col(tbl)))[]
for(j in seq_along(dt1)) {
       set(dt1, i=NULL, j=j, value=+(!!dt1[[j]]))
}

cbind(df, dt1)