R:用ID列折叠所有列[复制]

时间:2022-02-07 22:55:28

This question already has an answer here:

这个问题已经有了答案:

I'm trying to do something similar to what's answered here, which gets me 80% of the way. I have a data frame with one ID column and multiple information columns. I'd like to roll up all of the other columns so that there's only one row for each ID, and multiple entries are separated by, for instance, a semicolon. Here's an example of what I have and what I want.

我试着做一些和这里答案相似的事情,这让我得到了80%的答案。我有一个包含一个ID列和多个信息列的数据框架。我想把所有其他的列都卷起来,这样每个ID只有一行,多个条目被分号隔开。这是我所拥有和想要的一个例子。

HAVE:

有:

     ID  info1          info2
1 id101    one          first
2 id102   twoA second alias A
3 id102   twoB second alias B
4 id103 threeA  third alias A
5 id103 threeB  third alias B
6 id104   four         fourth
7 id105   five          fifth

WANT:

想要:

     ID          info1                          info2
1 id101            one                          first
2 id102     twoA; twoB second alias A; second alias B
3 id103 threeA; threeB   third alias A; third alias B
4 id104           four                         fourth
5 id105           five                          fifth

Here's the code used to generate those:

下面是生成这些代码的代码:

have <- data.frame(ID=paste0("id", c(101, 102, 102, 103, 103, 104, 105)),
                   info1=c("one", "twoA", "twoB", "threeA", "threeB", "four", "five"), 
                   info2=c("first", "second alias A", "second alias B", "third alias A", "third alias B", "fourth", "fifth"),
                   stringsAsFactors=FALSE)
want <- data_frame(ID=paste0("id", c(101:105)),
                   info1=c("one", "twoA; twoB", "threeA; threeB", "four", "five"), 
                   info2=c("first", "second alias A; second alias B", "third alias A; third alias B", "fourth", "fifth"),
                   stringsAsFactors=FALSE)

This question asked basically the same question, but only a single "info" column. I have multiple other columns and would like to do this for all of them.

这个问题问的基本上是同一个问题,但只有一个“信息”专栏。我还有很多列,我想把它们都写出来。

Bonus points for doing this using dplyr.

使用dplyr进行此操作的额外积分。

5 个解决方案

#1


15  

Here's an option using summarise_each (which makes it easy to apply the changes to all columns except the grouping variables) and toString:

这里有一个使用overview se_each(这使得除了分组变量之外的所有列都可以很容易地应用更改)和toString的选项:

require(dplyr)

have %>%
  group_by(ID) %>%
  summarise_each(funs(toString))

#Source: local data frame [5 x 3]
#
#     ID          info1                          info2
#1 id101            one                          first
#2 id102     twoA, twoB second alias A, second alias B
#3 id103 threeA, threeB   third alias A, third alias B
#4 id104           four                         fourth
#5 id105           five                          fifth

Or, if you want it separated by semicolons, you can use:

或者,如果要用分号分隔,可以使用:

have %>%
  group_by(ID) %>%
  summarise_each(funs(paste(., collapse = "; ")))

#2


10  

Good old aggregate does this just fine

好的老骨料也可以

aggregate(have[,2:3], by=list(have$ID), paste, collapse=";")

Question is: does it scale?

问题是:它有规模吗?

#3


8  

Here's a data.table solution.

这里有一个数据。表解决方案。

library(data.table)
setDT(have)[, lapply(.SD, paste, collapse = "; "), by = ID]
#       ID          info1                          info2
# 1: id101            one                          first
# 2: id102     twoA; twoB second alias A; second alias B
# 3: id103 threeA; threeB   third alias A; third alias B
# 4: id104           four                         fourth
# 5: id105           five                          fifth

#4


4  

Here is SQL solution^1:

这是SQL解决^ 1:

library(sqldf)
#Static solution
sqldf("
SELECT ID,
       GROUP_CONCAT(info1,';') as info1,
       GROUP_CONCAT(info2,';') as info2
FROM have
GROUP BY ID")

#Dynamic solution
concat_cols <- colnames(have)[2:ncol(have)]
group_concat <-
  paste(paste0("GROUP_CONCAT(",concat_cols,",';') as ", concat_cols),
        collapse = ",")
sqldf(
  paste("
      SELECT ID,",
      group_concat,"
      FROM have
      GROUP BY ID"))

# Same output for both static and dynamic solutions
#      ID         info1                         info2
# 1 id101           one                         first
# 2 id102     twoA;twoB second alias A;second alias B
# 3 id103 threeA;threeB   third alias A;third alias B
# 4 id104          four                        fourth
# 5 id105          five                         fifth

^1 - probably data.table solution would perform better with millions of rows, luckily we don't have that many genes yet :)

^ 1 -可能的数据。表解决方案将在数百万行中表现得更好,幸运的是我们还没有那么多的基因:)

#5


1  

library(stringr)
library(dplyr)
have %>% tbl_df %>% group_by(ID) %>% summarise_each(funs(str_c(., collapse="; ")))

Edit 1: So tbl_df may not needed and instead of the str_c of the stringr package you could use paste (in base). And what the above does is to group by the ID column and then apply the str_c (or paste) function to each remaining column for each group.

编辑1:所以tbl_df可能不需要,而不是stringr包的str_c,您可以使用paste(在base中)。上面所做的是按ID列进行分组,然后对每个组的每个剩余列应用str_c(或粘贴)函数。

Edit 2: Another solution using the data.table package:

编辑2:另一个使用数据的解决方案。表包:

library(data.table)
dtbl <- as.data.table(have)
dtbl[,lapply(.SD, function(x) paste(x,collapse=";")), by=ID]

The above may be faster, especially if you set the key:

以上可能会更快,特别是如果你设置了关键:

setkey(dtbl, ID)

"Hybrid" solution: You can use the dplyr syntax for data.tables! For example:

“混合”解决方案:您可以对data.tables使用dplyr语法!例如:

dtbl %>% tbl_dt %>%
     group_by(ID) %>% 
     summarise_each(funs(paste(., collapse="; ")))

#1


15  

Here's an option using summarise_each (which makes it easy to apply the changes to all columns except the grouping variables) and toString:

这里有一个使用overview se_each(这使得除了分组变量之外的所有列都可以很容易地应用更改)和toString的选项:

require(dplyr)

have %>%
  group_by(ID) %>%
  summarise_each(funs(toString))

#Source: local data frame [5 x 3]
#
#     ID          info1                          info2
#1 id101            one                          first
#2 id102     twoA, twoB second alias A, second alias B
#3 id103 threeA, threeB   third alias A, third alias B
#4 id104           four                         fourth
#5 id105           five                          fifth

Or, if you want it separated by semicolons, you can use:

或者,如果要用分号分隔,可以使用:

have %>%
  group_by(ID) %>%
  summarise_each(funs(paste(., collapse = "; ")))

#2


10  

Good old aggregate does this just fine

好的老骨料也可以

aggregate(have[,2:3], by=list(have$ID), paste, collapse=";")

Question is: does it scale?

问题是:它有规模吗?

#3


8  

Here's a data.table solution.

这里有一个数据。表解决方案。

library(data.table)
setDT(have)[, lapply(.SD, paste, collapse = "; "), by = ID]
#       ID          info1                          info2
# 1: id101            one                          first
# 2: id102     twoA; twoB second alias A; second alias B
# 3: id103 threeA; threeB   third alias A; third alias B
# 4: id104           four                         fourth
# 5: id105           five                          fifth

#4


4  

Here is SQL solution^1:

这是SQL解决^ 1:

library(sqldf)
#Static solution
sqldf("
SELECT ID,
       GROUP_CONCAT(info1,';') as info1,
       GROUP_CONCAT(info2,';') as info2
FROM have
GROUP BY ID")

#Dynamic solution
concat_cols <- colnames(have)[2:ncol(have)]
group_concat <-
  paste(paste0("GROUP_CONCAT(",concat_cols,",';') as ", concat_cols),
        collapse = ",")
sqldf(
  paste("
      SELECT ID,",
      group_concat,"
      FROM have
      GROUP BY ID"))

# Same output for both static and dynamic solutions
#      ID         info1                         info2
# 1 id101           one                         first
# 2 id102     twoA;twoB second alias A;second alias B
# 3 id103 threeA;threeB   third alias A;third alias B
# 4 id104          four                        fourth
# 5 id105          five                         fifth

^1 - probably data.table solution would perform better with millions of rows, luckily we don't have that many genes yet :)

^ 1 -可能的数据。表解决方案将在数百万行中表现得更好,幸运的是我们还没有那么多的基因:)

#5


1  

library(stringr)
library(dplyr)
have %>% tbl_df %>% group_by(ID) %>% summarise_each(funs(str_c(., collapse="; ")))

Edit 1: So tbl_df may not needed and instead of the str_c of the stringr package you could use paste (in base). And what the above does is to group by the ID column and then apply the str_c (or paste) function to each remaining column for each group.

编辑1:所以tbl_df可能不需要,而不是stringr包的str_c,您可以使用paste(在base中)。上面所做的是按ID列进行分组,然后对每个组的每个剩余列应用str_c(或粘贴)函数。

Edit 2: Another solution using the data.table package:

编辑2:另一个使用数据的解决方案。表包:

library(data.table)
dtbl <- as.data.table(have)
dtbl[,lapply(.SD, function(x) paste(x,collapse=";")), by=ID]

The above may be faster, especially if you set the key:

以上可能会更快,特别是如果你设置了关键:

setkey(dtbl, ID)

"Hybrid" solution: You can use the dplyr syntax for data.tables! For example:

“混合”解决方案:您可以对data.tables使用dplyr语法!例如:

dtbl %>% tbl_dt %>%
     group_by(ID) %>% 
     summarise_each(funs(paste(., collapse="; ")))