My question is: I have a data frame with some factor variables. I now want to assign a new vector to this data frame, which creates an index for each subset of those factor variables.
我的问题是:我有一个带有一些因子变量的数据框。我现在想为这个数据帧分配一个新的向量,它为这些因子变量的每个子集创建一个索引。
data <-data.frame(fac1=factor(rep(1:2,5)), fac2=sample(letters[1:3],10,rep=T))
Gives me something like:
给我一些类似的东西:
fac1 fac2
1 1 a
2 2 c
3 1 b
4 2 a
5 1 c
6 2 b
7 1 a
8 2 a
9 1 b
10 2 c
And what I want is a combination counter which counts the occurrence of each factor combination. Like this
我想要的是一个组合计数器,它计算每个因子组合的发生。喜欢这个
fac1 fac2 counter
1 1 a 1
2 2 c 1
3 1 b 1
4 2 a 1
5 1 c 1
6 2 b 1
7 1 a 2
8 2 a 2
9 1 b 2
10 1 a 3
So far I thought about using tapply to get the counter over all factor-combinations, which works fine
到目前为止,我想过使用tapply来获取所有因子组合的计数器,这很好
counter <-tapply(data$fac1, list(data$fac1,data$fac2), function(x) 1:length(x))
But I do not know how I can assign the counter list (e.g. unlisted) to the combinations in the data-frame without using inefficient looping :)
但我不知道如何在不使用低效循环的情况下将计数器列表(例如,未列出)分配给数据框中的组合:)
4 个解决方案
#1
6
This is a job for the ave()
function:
这是ave()函数的工作:
# Use set.seed for reproducible examples
# when random number generation is involved
set.seed(1)
myDF <- data.frame(fac1 = factor(rep(1:2, 7)),
fac2 = sample(letters[1:3], 14, replace = TRUE),
stringsAsFactors=FALSE)
myDF$counter <- ave(myDF$fac2, myDF$fac1, myDF$fac2, FUN = seq_along)
myDF
# fac1 fac2 counter
# 1 1 a 1
# 2 2 b 1
# 3 1 b 1
# 4 2 c 1
# 5 1 a 2
# 6 2 c 2
# 7 1 c 1
# 8 2 b 2
# 9 1 b 2
# 10 2 a 1
# 11 1 a 3
# 12 2 a 2
# 13 1 c 2
# 14 2 b 3
Note the use of stringsAsFactors=FALSE
in the data.frame()
step. If you didn't have that, you can still get the output with: myDF$counter <- ave(as.character(myDF$fac2), myDF$fac1, myDF$fac2, FUN = seq_along)
.
请注意在data.frame()步骤中使用stringsAsFactors = FALSE。如果你没有,你仍然可以得到输出:myDF $ counter < - ave(as.character(myDF $ fac2),myDF $ fac1,myDF $ fac2,FUN = seq_along)。
#2
4
A data.table solution
一个data.table解决方案
library(data.table)
DT <- data.table(data)
DT[, counter := seq_len(.N), by = list(fac1, fac2)]
#3
0
This is a base R way that avoids (explicit) looping.
这是避免(显式)循环的基本R方式。
data$counter <- with(data, {
inter <- as.character(interaction(fac1, fac2))
names(inter) <- seq_along(inter)
inter.ordered <- inter[order(inter)]
counter <- with(rle(inter.ordered), unlist(sapply(lengths, sequence)))
counter[match(names(inter), names(inter.ordered))]
})
#4
0
Here a variant with a little looping (I have renamed your variable to "x" since "data" is being used otherwise):
这里有一个带有一点循环的变体(我已将你的变量重命名为“x”,因为“data”正在被使用)
x <-data.frame(fac1=rep(1:2,5), fac2=sample(letters[1:3],10,rep=T))
x$fac3 <- paste( x$fac1, x$fac2, sep="" )
x$ctr <- 1
y <- table( x$fac3 )
for( i in 1 : length( rownames( y ) ) )
x$ctr[x$fac3 == rownames(y)[i]] <- 1:length( x$ctr[x$fac3 == rownames(y)[i]] )
x <- x[-3]
No idea whether this is efficient over a large data.frame but it works!
不知道这对于大型数据框架是否有效但是有效!
#1
6
This is a job for the ave()
function:
这是ave()函数的工作:
# Use set.seed for reproducible examples
# when random number generation is involved
set.seed(1)
myDF <- data.frame(fac1 = factor(rep(1:2, 7)),
fac2 = sample(letters[1:3], 14, replace = TRUE),
stringsAsFactors=FALSE)
myDF$counter <- ave(myDF$fac2, myDF$fac1, myDF$fac2, FUN = seq_along)
myDF
# fac1 fac2 counter
# 1 1 a 1
# 2 2 b 1
# 3 1 b 1
# 4 2 c 1
# 5 1 a 2
# 6 2 c 2
# 7 1 c 1
# 8 2 b 2
# 9 1 b 2
# 10 2 a 1
# 11 1 a 3
# 12 2 a 2
# 13 1 c 2
# 14 2 b 3
Note the use of stringsAsFactors=FALSE
in the data.frame()
step. If you didn't have that, you can still get the output with: myDF$counter <- ave(as.character(myDF$fac2), myDF$fac1, myDF$fac2, FUN = seq_along)
.
请注意在data.frame()步骤中使用stringsAsFactors = FALSE。如果你没有,你仍然可以得到输出:myDF $ counter < - ave(as.character(myDF $ fac2),myDF $ fac1,myDF $ fac2,FUN = seq_along)。
#2
4
A data.table solution
一个data.table解决方案
library(data.table)
DT <- data.table(data)
DT[, counter := seq_len(.N), by = list(fac1, fac2)]
#3
0
This is a base R way that avoids (explicit) looping.
这是避免(显式)循环的基本R方式。
data$counter <- with(data, {
inter <- as.character(interaction(fac1, fac2))
names(inter) <- seq_along(inter)
inter.ordered <- inter[order(inter)]
counter <- with(rle(inter.ordered), unlist(sapply(lengths, sequence)))
counter[match(names(inter), names(inter.ordered))]
})
#4
0
Here a variant with a little looping (I have renamed your variable to "x" since "data" is being used otherwise):
这里有一个带有一点循环的变体(我已将你的变量重命名为“x”,因为“data”正在被使用)
x <-data.frame(fac1=rep(1:2,5), fac2=sample(letters[1:3],10,rep=T))
x$fac3 <- paste( x$fac1, x$fac2, sep="" )
x$ctr <- 1
y <- table( x$fac3 )
for( i in 1 : length( rownames( y ) ) )
x$ctr[x$fac3 == rownames(y)[i]] <- 1:length( x$ctr[x$fac3 == rownames(y)[i]] )
x <- x[-3]
No idea whether this is efficient over a large data.frame but it works!
不知道这对于大型数据框架是否有效但是有效!