Most of the approaches I've come across involve using dplyr to apply a function when combining features, however, I would just like to restructure a single data frame without applying any function to each group.
我遇到的大多数方法都涉及在组合特征时使用dplyr应用函数,但是,我只想重构单个数据帧而不对每个组应用任何函数。
I have a single data frame that looks like this:
我有一个如下所示的数据框:
gene_name chr nb_pos nb_ref nb_alt m_pos m_ref m_alt
ACAA1 3 38173733 C T 38144875 G T
ACAA1 3 38144875 G T 38144876 G A
I would like to combine each row with a common gene_name
and chr
, where each gene can have a variable amount of rows, to look like this:
我想将每一行与一个共同的gene_name和chr组合在一起,其中每个基因可以有一个可变的行数,如下所示:
gene_name chr np_pos1 nb_ref1 nb_alt1 nb_pos2 nb_ref2 nb_alt2 nb_alt2
ACAA1 3 38173733 C T 38144875 G T T
Does anyone know of a way to do this?
有谁知道这样做的方法?
1 个解决方案
#1
We can use dcast
from the devel
version of data.table
i.e. v1.9.5
. Instructions to install it are here
.
我们可以使用devel版本的data.table中的dcast,即v1.9.5。安装说明在这里。
Create a sequence column ('ind') based on the grouping columns ('gene_name', 'chr'), and then use dcast
specifying the value.var
columns.
根据分组列('gene_name','chr')创建序列列('ind'),然后使用dcast指定value.var列。
library(data.table)
dcast(setDT(df1)[, ind:= 1:.N ,.(gene_name, chr)],
gene_name+chr~ind, value.var=names(df1)[3:8])
# gene_name chr 1_nb_pos 2_nb_pos 1_nb_ref 2_nb_ref 1_nb_alt 2_nb_alt 1_m_pos
#1: ACAA1 3 38173733 38144875 C G TRUE TRUE 38144875
# 2_m_pos 1_m_ref 2_m_ref 1_m_alt 2_m_alt
#1: 38144876 G G T A
Or using reshape
from base R
after we create the sequence column using ave
.
或者在使用ave创建序列列后使用基础R的重塑。
df2 <- transform(df1, ind=ave(seq_along(gene_name),
gene_name, chr, FUN=seq_along))
reshape(df2, idvar=c('gene_name', 'chr'), timevar='ind',
direction='wide')
# gene_name chr nb_pos.1 nb_ref.1 nb_alt.1 m_pos.1 m_ref.1 m_alt.1 nb_pos.2
#1 ACAA1 3 38173733 C TRUE 38144875 G T 38144875
# nb_ref.2 nb_alt.2 m_pos.2 m_ref.2 m_alt.2
#1 G TRUE 38144876 G A
data
df1 <- structure(list(gene_name = c("ACAA1", "ACAA1"), chr = c(3L, 3L
), nb_pos = c(38173733L, 38144875L), nb_ref = c("C", "G"),
nb_alt = c(TRUE,
TRUE), m_pos = 38144875:38144876, m_ref = c("G", "G"), m_alt = c("T",
"A")), .Names = c("gene_name", "chr", "nb_pos", "nb_ref", "nb_alt",
"m_pos", "m_ref", "m_alt"), class = "data.frame",
row.names = c(NA, -2L))
#1
We can use dcast
from the devel
version of data.table
i.e. v1.9.5
. Instructions to install it are here
.
我们可以使用devel版本的data.table中的dcast,即v1.9.5。安装说明在这里。
Create a sequence column ('ind') based on the grouping columns ('gene_name', 'chr'), and then use dcast
specifying the value.var
columns.
根据分组列('gene_name','chr')创建序列列('ind'),然后使用dcast指定value.var列。
library(data.table)
dcast(setDT(df1)[, ind:= 1:.N ,.(gene_name, chr)],
gene_name+chr~ind, value.var=names(df1)[3:8])
# gene_name chr 1_nb_pos 2_nb_pos 1_nb_ref 2_nb_ref 1_nb_alt 2_nb_alt 1_m_pos
#1: ACAA1 3 38173733 38144875 C G TRUE TRUE 38144875
# 2_m_pos 1_m_ref 2_m_ref 1_m_alt 2_m_alt
#1: 38144876 G G T A
Or using reshape
from base R
after we create the sequence column using ave
.
或者在使用ave创建序列列后使用基础R的重塑。
df2 <- transform(df1, ind=ave(seq_along(gene_name),
gene_name, chr, FUN=seq_along))
reshape(df2, idvar=c('gene_name', 'chr'), timevar='ind',
direction='wide')
# gene_name chr nb_pos.1 nb_ref.1 nb_alt.1 m_pos.1 m_ref.1 m_alt.1 nb_pos.2
#1 ACAA1 3 38173733 C TRUE 38144875 G T 38144875
# nb_ref.2 nb_alt.2 m_pos.2 m_ref.2 m_alt.2
#1 G TRUE 38144876 G A
data
df1 <- structure(list(gene_name = c("ACAA1", "ACAA1"), chr = c(3L, 3L
), nb_pos = c(38173733L, 38144875L), nb_ref = c("C", "G"),
nb_alt = c(TRUE,
TRUE), m_pos = 38144875:38144876, m_ref = c("G", "G"), m_alt = c("T",
"A")), .Names = c("gene_name", "chr", "nb_pos", "nb_ref", "nb_alt",
"m_pos", "m_ref", "m_alt"), class = "data.frame",
row.names = c(NA, -2L))