I have the following data frame from which I would like to extract rows based on matching strings.
我有以下的数据框架,我想从其中提取基于匹配字符串的行。
> GEMA_EO5
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
KNG1 3.433049 8.56e-28 NM_000893,NM_001102416 1.234245e-24
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
VPS29 3.827665 2.22e-25 NM_057180,NM_016226 2.560770e-22
CYP51A1 3.363149 5.95e-25 NM_000786,NM_001146152 6.239386e-22
TNPO2 4.707600 1.60e-23 NM_001136195,NM_001136196,NM_013433 1.538000e-20
NSDHL 2.703922 6.74e-23 NM_001129765,NM_015922 5.980454e-20
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
So I would like to extract e.g. two rows based on matching strings in $RefSeq_ID, that works fine with the following:
因此,我想提取,例如,基于$RefSeq_ID中的匹配字符串的两行,它可以很好地处理以下内容:
> list<-c("NM_001386", "NM_020385")
> GEMA_EO6<-subset(GEMA_EO5, GEMA_EO5$RefSeq_ID %in% list, drop = TRUE)
> GEMA_EO6
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
But some of the rows have several RefSeq_IDs separated with commas, so I am looking for a general way of telling if $RefSeq_ID contains a certain string pattern and then subset that row.
但是有些行中有几个RefSeq_ID,它们之间用逗号分隔,因此我正在寻找一种通用的方法来判断$RefSeq_ID是否包含某个字符串模式,然后再将该行划分为子集。
2 个解决方案
#1
15
To do partial matching you'll need to use regular expressions (see ?grepl
). Here's a solution to your particular problem:
要进行部分匹配,需要使用正则表达式(参见grepl)。这里有一个解决你的特殊问题的方法:
##Notice that the first element appears in
##a row containing commas
l = c( "NM_013433", "NM_001386", "NM_020385")
To test one sequence at a time, we just select a particular seq id:
要一次测试一个序列,我们只需选择一个特定的seq id:
R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
5 TNPO2 4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433 1.538e-20
To test for multiple genes, we use the |
operator:
为了检测多个基因,我们使用|算子:
R> paste(l, collapse="|")
[1] "NM_013433|NM_001386|NM_020385"
R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE
So
所以
subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
should give you what you want.
应该给你你想要的。
#2
1
A different approach is to recognize the duplicate entries in RefSeq_ID
as an attempt to represent two data base tables in a single data frame. So if the original table is csv
, then normalize the data into two tables
另一种方法是识别RefSeq_ID中的重复项,以尝试在单个数据框架中表示两个数据基表。因此,如果原始表是csv,那么将数据规范化为两个表。
Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
key0 <- strsplit(csv$RefSeq_ID, ",")
RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
ID = unlist(key0))
and recognize that the query is a subset
(select) on the RefSeq
table, followed by a merge
(join) with Anno
并认识到查询是RefSeq表上的一个子集(select),然后是带有Anno的merge (join)
l <- c( "NM_013433", "NM_001386", "NM_020385")
merge(Anno, subset(RefSeq, ID %in% l))[, -1]
leading to
导致
> merge(Anno, subset(RefSeq, ID %in% l))[, -1]
gene_symbol fold_EO p_value BH_p_value ID
1 REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
2 TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
3 DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
Perhaps the goal is to merge with a `Master' table, then
也许我们的目标是合并一个“Master”表
Master <- cbind(key = seq_len(nrow(csv)), csv)
merge(Master, subset(RefSeq, ID %in% l))[,-1]
or similar.
或类似的。
#1
15
To do partial matching you'll need to use regular expressions (see ?grepl
). Here's a solution to your particular problem:
要进行部分匹配,需要使用正则表达式(参见grepl)。这里有一个解决你的特殊问题的方法:
##Notice that the first element appears in
##a row containing commas
l = c( "NM_013433", "NM_001386", "NM_020385")
To test one sequence at a time, we just select a particular seq id:
要一次测试一个序列,我们只需选择一个特定的seq id:
R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
5 TNPO2 4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433 1.538e-20
To test for multiple genes, we use the |
operator:
为了检测多个基因,我们使用|算子:
R> paste(l, collapse="|")
[1] "NM_013433|NM_001386|NM_020385"
R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE
So
所以
subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
should give you what you want.
应该给你你想要的。
#2
1
A different approach is to recognize the duplicate entries in RefSeq_ID
as an attempt to represent two data base tables in a single data frame. So if the original table is csv
, then normalize the data into two tables
另一种方法是识别RefSeq_ID中的重复项,以尝试在单个数据框架中表示两个数据基表。因此,如果原始表是csv,那么将数据规范化为两个表。
Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
key0 <- strsplit(csv$RefSeq_ID, ",")
RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
ID = unlist(key0))
and recognize that the query is a subset
(select) on the RefSeq
table, followed by a merge
(join) with Anno
并认识到查询是RefSeq表上的一个子集(select),然后是带有Anno的merge (join)
l <- c( "NM_013433", "NM_001386", "NM_020385")
merge(Anno, subset(RefSeq, ID %in% l))[, -1]
leading to
导致
> merge(Anno, subset(RefSeq, ID %in% l))[, -1]
gene_symbol fold_EO p_value BH_p_value ID
1 REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
2 TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
3 DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
Perhaps the goal is to merge with a `Master' table, then
也许我们的目标是合并一个“Master”表
Master <- cbind(key = seq_len(nrow(csv)), csv)
merge(Master, subset(RefSeq, ID %in% l))[,-1]
or similar.
或类似的。