I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:
我有一个R数据框架,其中包含一个我想要“展开”的因子,以便对于每个因子级别,在一个新的数据框架中有一个相关的列,它包含一个1/0指示器。例如,假设我有:
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
I want:
我想要:
df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))
Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.
因为对于某些分析,您需要拥有一个完整的数字数据框架(例如,主成分分析),我认为这个特性可能是内置的。编写这样的函数应该不会太难,但我可以预见到与列名相关的一些挑战,如果已经存在某些东西,我宁愿使用它。
8 个解决方案
#1
109
Use the model.matrix
function:
使用模型。矩阵函数:
model.matrix( ~ Species - 1, data=iris )
#2
15
If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif
function from the ade4
package :
如果您的数据框架仅由因子组成(或者您正在处理所有因子的变量子集),您也可以使用acm。disjonctif函数来自ade4包:
R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
eggs.bar eggs.foo ham.blue ham.green ham.red
1 0 1 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
Not exactly the case you are describing, but it can be useful too...
不完全是你所描述的情况,但它也可能是有用的……
#3
8
A quick way using the reshape2
package:
使用reshape2包的快速方法:
require(reshape2)
> dcast(df.original, ham ~ eggs, length)
Using ham as value column: use value_var to override.
ham bar foo
1 1 0 1
2 2 0 1
3 3 1 0
4 4 1 0
Note that this produces precisely the column names you want.
注意,这会生成您想要的列名。
#4
6
probably dummy variable is similar to what you want. Then, model.matrix is useful:
可能哑变量与你想要的相似。然后,模型。矩阵是有用的:
> with(df.original, data.frame(model.matrix(~eggs+0), ham))
eggsbar eggsfoo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4
#5
6
A late entry class.ind
from the nnet
package
晚条目类。从nnet包中获取。
library(nnet)
with(df.original, data.frame(class.ind(eggs), ham))
bar foo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4
#6
4
Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.
刚刚遇到了这个旧线程,并认为我将添加一个函数,该函数利用ade4来获取由factors和/或数值数据组成的dataframe,并将一个dataframe作为虚拟代码返回。
dummy <- function(df) {
NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]
require(ade4)
if (is.null(ncol(NUM(df)))) {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
} else {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
}
return(DF)
}
Let's try it.
让我们试一试。
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"))
dummy(df2)
#7
1
Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.
这里有一个更清晰的方法。我使用模型。矩阵创建假的布尔变量,然后将其合并回原始的dataframe。
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
# eggs ham
# 1 foo 1
# 2 foo 2
# 3 bar 3
# 4 bar 4
# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
# eggsbar eggsfoo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
# bar foo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
# eggs ham bar foo
# 1 foo 1 0 1
# 2 foo 2 0 1
# 3 bar 3 1 0
# 4 bar 4 1 0
# At this point, you can select out the columns that you want.
#8
0
I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package. This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.
我需要一个函数来“引爆”更灵活的因素,并基于acm制作一个。disjonctif函数来自ade4包。这允许您选择在ac .disjonctif中为0和1的爆炸值。它只会爆炸那些“很少”水平的因素。数字列。
# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
exploders <- colnames(data)[sapply(data, function(col){
is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
})]
if (length(exploders) > 0) {
exploded <- lapply(exploders, function(exp){
col <- data[, exp]
n <- length(col)
dummies <- matrix(values[1], n, length(levels(col)))
dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
colnames(dummies) <- paste(exp, levels(col), sep = '_')
dummies
})
# Only keep numeric data.
data <- data[sapply(data, is.numeric)]
# Add exploded values.
data <- cbind(data, exploded)
}
return(data)
}
#1
109
Use the model.matrix
function:
使用模型。矩阵函数:
model.matrix( ~ Species - 1, data=iris )
#2
15
If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif
function from the ade4
package :
如果您的数据框架仅由因子组成(或者您正在处理所有因子的变量子集),您也可以使用acm。disjonctif函数来自ade4包:
R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
eggs.bar eggs.foo ham.blue ham.green ham.red
1 0 1 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
Not exactly the case you are describing, but it can be useful too...
不完全是你所描述的情况,但它也可能是有用的……
#3
8
A quick way using the reshape2
package:
使用reshape2包的快速方法:
require(reshape2)
> dcast(df.original, ham ~ eggs, length)
Using ham as value column: use value_var to override.
ham bar foo
1 1 0 1
2 2 0 1
3 3 1 0
4 4 1 0
Note that this produces precisely the column names you want.
注意,这会生成您想要的列名。
#4
6
probably dummy variable is similar to what you want. Then, model.matrix is useful:
可能哑变量与你想要的相似。然后,模型。矩阵是有用的:
> with(df.original, data.frame(model.matrix(~eggs+0), ham))
eggsbar eggsfoo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4
#5
6
A late entry class.ind
from the nnet
package
晚条目类。从nnet包中获取。
library(nnet)
with(df.original, data.frame(class.ind(eggs), ham))
bar foo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4
#6
4
Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.
刚刚遇到了这个旧线程,并认为我将添加一个函数,该函数利用ade4来获取由factors和/或数值数据组成的dataframe,并将一个dataframe作为虚拟代码返回。
dummy <- function(df) {
NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]
require(ade4)
if (is.null(ncol(NUM(df)))) {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
} else {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
}
return(DF)
}
Let's try it.
让我们试一试。
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"))
dummy(df2)
#7
1
Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.
这里有一个更清晰的方法。我使用模型。矩阵创建假的布尔变量,然后将其合并回原始的dataframe。
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
# eggs ham
# 1 foo 1
# 2 foo 2
# 3 bar 3
# 4 bar 4
# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
# eggsbar eggsfoo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
# bar foo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
# eggs ham bar foo
# 1 foo 1 0 1
# 2 foo 2 0 1
# 3 bar 3 1 0
# 4 bar 4 1 0
# At this point, you can select out the columns that you want.
#8
0
I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package. This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.
我需要一个函数来“引爆”更灵活的因素,并基于acm制作一个。disjonctif函数来自ade4包。这允许您选择在ac .disjonctif中为0和1的爆炸值。它只会爆炸那些“很少”水平的因素。数字列。
# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
exploders <- colnames(data)[sapply(data, function(col){
is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
})]
if (length(exploders) > 0) {
exploded <- lapply(exploders, function(exp){
col <- data[, exp]
n <- length(col)
dummies <- matrix(values[1], n, length(levels(col)))
dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
colnames(dummies) <- paste(exp, levels(col), sep = '_')
dummies
})
# Only keep numeric data.
data <- data[sapply(data, is.numeric)]
# Add exploded values.
data <- cbind(data, exploded)
}
return(data)
}