I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
我有员工名册,我需要知道他们最常在哪个部门工作。将员工ID与部门名称进行制表是很简单的事情,但是从频率表返回部门名称而不是花名册计数则比较麻烦。下面是一个简单的例子(列名=部门,行名=员工id)。
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
现在我怎么得到
> DF2
RE
1 V3
2 V1
3 V2
4 个解决方案
#1
51
One option using your data (for future reference, use set.seed()
to make examples using sample
reproducible):
使用您的数据的一个选项(为了将来的参考,使用set.seed()来使用示例再现):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply
might be max.col
:
一个比apply更快的解决方案可能是maxe .col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method
can be any of "random"
"first"
or "last"
…关系的地方。方法可以是任意的“随机”“首先”或“最后”
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
这当然会引起问题,如果你碰巧有两列等于最大值。我不确定在那个实例中要做什么,因为对于某些行,会有多个结果。例如:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
#2
9
If you're interested in a data.table
solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
如果你对数据感兴趣。表解决方案,这是一个。这有点棘手,因为您更喜欢获得第一个最大值的id。如果你想要最后一个最大值,那就简单多了。不过,它并不复杂,而且很快!
Here I've generated data of your dimensions (26746 * 18).
这里我生成了你的尺寸数据(26746 * 18)。
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table
answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table
scales pretty well too.
这些维度和数据的数据快了11倍。表的刻度也很好。
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
#3
1
Based on the above suggestions, the following data.table
solution worked very fast for me:
根据以上建议,以下数据。表格解决方案对我来说非常快:
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time( DT[, MAX := colnames(.SD)[max.col(.SD, ties.method="first")]] )
user system elapsed
0.10 0.02 0.21
DT
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 MAX
1: 7 4 1 2 3 7 6 6 6 1 V1
2: 4 6 9 10 6 2 7 7 1 3 V4
3: 3 4 9 8 9 9 8 8 6 7 V3
4: 4 8 8 9 7 5 9 2 7 1 V4
5: 4 3 9 10 2 7 9 6 6 9 V4
---
999996: 4 6 10 5 4 7 3 8 2 8 V3
999997: 8 7 6 6 3 10 2 3 10 1 V6
999998: 2 3 2 7 4 7 5 2 7 3 V4
999999: 8 10 3 2 3 4 5 1 1 4 V2
1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD
should consider by mentioning them in .SDcols
:
此外,它还有一个优点,就是可以通过在。sdcols中提到。sd应该考虑哪些列:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
#4
0
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
一个解决方案可能是将日期从宽到长,将所有部门放在一个列中,然后由雇主id(在本例中为行号)进行分组,然后用最大值对部门进行筛选。对于处理与此方法的关系,还有几个选项。
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
#1
51
One option using your data (for future reference, use set.seed()
to make examples using sample
reproducible):
使用您的数据的一个选项(为了将来的参考,使用set.seed()来使用示例再现):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply
might be max.col
:
一个比apply更快的解决方案可能是maxe .col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method
can be any of "random"
"first"
or "last"
…关系的地方。方法可以是任意的“随机”“首先”或“最后”
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
这当然会引起问题,如果你碰巧有两列等于最大值。我不确定在那个实例中要做什么,因为对于某些行,会有多个结果。例如:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
#2
9
If you're interested in a data.table
solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
如果你对数据感兴趣。表解决方案,这是一个。这有点棘手,因为您更喜欢获得第一个最大值的id。如果你想要最后一个最大值,那就简单多了。不过,它并不复杂,而且很快!
Here I've generated data of your dimensions (26746 * 18).
这里我生成了你的尺寸数据(26746 * 18)。
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table
answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table
scales pretty well too.
这些维度和数据的数据快了11倍。表的刻度也很好。
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
#3
1
Based on the above suggestions, the following data.table
solution worked very fast for me:
根据以上建议,以下数据。表格解决方案对我来说非常快:
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time( DT[, MAX := colnames(.SD)[max.col(.SD, ties.method="first")]] )
user system elapsed
0.10 0.02 0.21
DT
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 MAX
1: 7 4 1 2 3 7 6 6 6 1 V1
2: 4 6 9 10 6 2 7 7 1 3 V4
3: 3 4 9 8 9 9 8 8 6 7 V3
4: 4 8 8 9 7 5 9 2 7 1 V4
5: 4 3 9 10 2 7 9 6 6 9 V4
---
999996: 4 6 10 5 4 7 3 8 2 8 V3
999997: 8 7 6 6 3 10 2 3 10 1 V6
999998: 2 3 2 7 4 7 5 2 7 3 V4
999999: 8 10 3 2 3 4 5 1 1 4 V2
1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD
should consider by mentioning them in .SDcols
:
此外,它还有一个优点,就是可以通过在。sdcols中提到。sd应该考虑哪些列:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
#4
0
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
一个解决方案可能是将日期从宽到长,将所有部门放在一个列中,然后由雇主id(在本例中为行号)进行分组,然后用最大值对部门进行筛选。对于处理与此方法的关系,还有几个选项。
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.