PROBLEM STATEMENT: Generating dummy variables based on values in multiple columns.
问题语句:基于多个列中的值生成虚拟变量。
To assign values (more like dummy variables) to columns based on their presence in other “multiple columns". The following code uses data frames.
根据列在其他“多列”中的存在为列赋值(更像虚拟变量)。下面的代码使用数据帧。
Explanation:
解释:
- V2 column represents value 2. If the variables A1 or A4, either has value 2 then V2=1 and V1, V3:V12=0
- V2列表示值2。如果变量A1或A4,那么V2=1和V1, V3:V12=0
- Similarly if A1=1 and A2 =4, then V1=1 ,V4=1 and V2,V3, V5:V12=0
- 同样的,如果A1=1和A2 =4,那么V1=1,V4=1, V2,V3, V5:V12=0。
Code is given to explain the output desired.
给出了解释所需输出的代码。
set.seed(12345)
df<- data.frame(A1=c(1L,2L),A2=LETTERS[1:3],A3=round(rnorm(4),4),A4=1:12)
df
names= paste0("V",c(1:12))
df[,c(names)]=0
for ( i in 1:nrow(df)){ df[i,c(names)]=match(c(1:12),df[i,c("A1","A4")])}
df[,c(names)][!is.na(df[,c(names)])]=1
df[,c(names)][is.na(df[,c(names)])]=0
df
I would like to have suggestions for code using data table : = operator so that process can be faster. Thanks
我想对使用data table: = operator的代码提出一些建议,这样可以更快地进行处理。谢谢
1 个解决方案
#1
3
We can use lapply
to loop the columns 'A1' and 'A4' of df
, compare with the values 1:12 with sapply
, Use Reduce
with |
and collapse the list
output to a single matrix. The +
is for converting logical matrix to binary format. In the last step we cbind
with the original dataset
我们可以使用lapply循环df的列'A1'和'A4',与sapply的值1:12相比,使用Reduce和|,将列表输出折叠成一个矩阵。+表示将逻辑矩阵转换为二进制格式。在最后一步中,我们使用原始数据集进行cbind
cbind(df, +(Reduce('|', lapply(df[c(1,4)], function(x) sapply(1:12, '==', x)))))
Another base R
option without looping will be table
. We unlist
the columns of interest i.e. 'A1', 'A4', get the table
with 1:12
values, double negate (!!
) to make '0' values FALSE and all other TRUE, use +
to coerce the logical matrix to binary 1/0
, and cbind
with the original dataset.
另一个没有循环的基本R选项是table。我们列出感兴趣的列。“A1”,“A4”,以1:12的值得到表格,双重否定(!!),使“0”值为假,所有其他为真,使用+将逻辑矩阵强制为二进制1/0,并与原始数据集绑定。
subDF <- df[c('A1', 'A4')]
newdf <- cbind(df, +(!!table(rep(1:12, ncol(subDF)), unlist(subDF))))
colnames(newdf)[5:ncol(newdf)] <- paste0('V', 1:12)
newdf
# A1 A2 A3 A4 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1 1 A 0.5855 1 1 0 0 0 0 0 0 0 0 0 0 0
#2 2 B 0.7095 2 0 1 0 0 0 0 0 0 0 0 0 0
#3 1 C -0.1093 3 1 0 1 0 0 0 0 0 0 0 0 0
#4 2 A -0.4535 4 0 1 0 1 0 0 0 0 0 0 0 0
#5 1 B 0.5855 5 1 0 0 0 1 0 0 0 0 0 0 0
#6 2 C 0.7095 6 0 1 0 0 0 1 0 0 0 0 0 0
#7 1 A -0.1093 7 1 0 0 0 0 0 1 0 0 0 0 0
#8 2 B -0.4535 8 0 1 0 0 0 0 0 1 0 0 0 0
#9 1 C 0.5855 9 1 0 0 0 0 0 0 0 1 0 0 0
#10 2 A 0.7095 10 0 1 0 0 0 0 0 0 0 1 0 0
#11 1 B -0.1093 11 1 0 0 0 0 0 0 0 0 0 1 0
#12 2 C -0.4535 12 0 1 0 0 0 0 0 0 0 0 0 1
We can also use data.table
. I am not sure whether this is very efficient as we do table
inside the data.table
. The approach would be to first convert the 'data.frame' to 'data.table' (setDT(df)
), unlist
the columns specified in the .SDcols
, get the seq_len
of number of rows (.N
) i.e. 1:12 in the example, replicate (rep
) it by the length
of 'nm1', and get the table
.
我们也可以使用data.table。我不确定这是否非常有效,因为我们在data.table中做table。方法是首先将“data.frame”转换为“data”。表' (setDT(df)),列出. sdcols中指定的列,获取行数(. n)的seq_len (. n),例如,以“nm1”的长度复制(rep),并得到该表。
We create a data.table
from the table
class (split(tbl..
), by looping through the columns using a for
loop, we set
the values to binary 0/1
. The set
approach is efficient as it avoids the overhead of [.data.table
. Later, we can cbind
with the original dataset.
我们创建一个数据。表类(split(tbl.. .)中的表,通过使用for循环遍历列,我们将值设置为二进制0/1。set方法是有效的,因为它避免了[.data.table. .表的开销。稍后,我们可以使用原始数据集进行cbind。
library(data.table)
nm1 <- c('A1', 'A4')
tbl <- setDT(df)[, table(rep(seq_len(.N),length(nm1)), unlist(.SD)), .SDcols=nm1]
dt1 <- setDT(split(tbl, col(tbl)))[]
for(j in seq_along(dt1)) {
set(dt1, i=NULL, j=j, value=+(!!dt1[[j]]))
}
cbind(df, dt1)
#1
3
We can use lapply
to loop the columns 'A1' and 'A4' of df
, compare with the values 1:12 with sapply
, Use Reduce
with |
and collapse the list
output to a single matrix. The +
is for converting logical matrix to binary format. In the last step we cbind
with the original dataset
我们可以使用lapply循环df的列'A1'和'A4',与sapply的值1:12相比,使用Reduce和|,将列表输出折叠成一个矩阵。+表示将逻辑矩阵转换为二进制格式。在最后一步中,我们使用原始数据集进行cbind
cbind(df, +(Reduce('|', lapply(df[c(1,4)], function(x) sapply(1:12, '==', x)))))
Another base R
option without looping will be table
. We unlist
the columns of interest i.e. 'A1', 'A4', get the table
with 1:12
values, double negate (!!
) to make '0' values FALSE and all other TRUE, use +
to coerce the logical matrix to binary 1/0
, and cbind
with the original dataset.
另一个没有循环的基本R选项是table。我们列出感兴趣的列。“A1”,“A4”,以1:12的值得到表格,双重否定(!!),使“0”值为假,所有其他为真,使用+将逻辑矩阵强制为二进制1/0,并与原始数据集绑定。
subDF <- df[c('A1', 'A4')]
newdf <- cbind(df, +(!!table(rep(1:12, ncol(subDF)), unlist(subDF))))
colnames(newdf)[5:ncol(newdf)] <- paste0('V', 1:12)
newdf
# A1 A2 A3 A4 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1 1 A 0.5855 1 1 0 0 0 0 0 0 0 0 0 0 0
#2 2 B 0.7095 2 0 1 0 0 0 0 0 0 0 0 0 0
#3 1 C -0.1093 3 1 0 1 0 0 0 0 0 0 0 0 0
#4 2 A -0.4535 4 0 1 0 1 0 0 0 0 0 0 0 0
#5 1 B 0.5855 5 1 0 0 0 1 0 0 0 0 0 0 0
#6 2 C 0.7095 6 0 1 0 0 0 1 0 0 0 0 0 0
#7 1 A -0.1093 7 1 0 0 0 0 0 1 0 0 0 0 0
#8 2 B -0.4535 8 0 1 0 0 0 0 0 1 0 0 0 0
#9 1 C 0.5855 9 1 0 0 0 0 0 0 0 1 0 0 0
#10 2 A 0.7095 10 0 1 0 0 0 0 0 0 0 1 0 0
#11 1 B -0.1093 11 1 0 0 0 0 0 0 0 0 0 1 0
#12 2 C -0.4535 12 0 1 0 0 0 0 0 0 0 0 0 1
We can also use data.table
. I am not sure whether this is very efficient as we do table
inside the data.table
. The approach would be to first convert the 'data.frame' to 'data.table' (setDT(df)
), unlist
the columns specified in the .SDcols
, get the seq_len
of number of rows (.N
) i.e. 1:12 in the example, replicate (rep
) it by the length
of 'nm1', and get the table
.
我们也可以使用data.table。我不确定这是否非常有效,因为我们在data.table中做table。方法是首先将“data.frame”转换为“data”。表' (setDT(df)),列出. sdcols中指定的列,获取行数(. n)的seq_len (. n),例如,以“nm1”的长度复制(rep),并得到该表。
We create a data.table
from the table
class (split(tbl..
), by looping through the columns using a for
loop, we set
the values to binary 0/1
. The set
approach is efficient as it avoids the overhead of [.data.table
. Later, we can cbind
with the original dataset.
我们创建一个数据。表类(split(tbl.. .)中的表,通过使用for循环遍历列,我们将值设置为二进制0/1。set方法是有效的,因为它避免了[.data.table. .表的开销。稍后,我们可以使用原始数据集进行cbind。
library(data.table)
nm1 <- c('A1', 'A4')
tbl <- setDT(df)[, table(rep(seq_len(.N),length(nm1)), unlist(.SD)), .SDcols=nm1]
dt1 <- setDT(split(tbl, col(tbl)))[]
for(j in seq_along(dt1)) {
set(dt1, i=NULL, j=j, value=+(!!dt1[[j]]))
}
cbind(df, dt1)