I have a matrix M
with row names as below;
我有一个矩阵M,行名如下;
S003_T1_p555
S003_T2_p456
S004_T3_p785
S004_T4_p426
SuperSMART_27_T1_p112
SuperSMART_27_T2_p414
SuperSMART_42_T3_p155
SuperSMART_42_T5_p775
I would like to make a function to:
我想做一个函数:
- substitute
SuperSMART_
withS
in rows that this is the case - then extract only characters before the first
_
as keys and assign a unique name to each similar individual
在行中用S代替SuperSMART_,就是这种情况
然后在第一个_之前仅提取字符作为键,并为每个相似的个体分配唯一的名称
So both S003_T1_p555
and S003_T2_p456
become "group1"
, S004_T3_p785
and S004_T4_p426
"group2"
, and so on.
因此,S003_T1_p555和S003_T2_p456都变为“group1”,S004_T3_p785和S004_T4_p426“group2”,依此类推。
MWE
nms <- c("S003_T1_p555", "S003_T2_p456", "S004_T3_p785", "S004_T4_p426",
"SuperSMART_27_T1_p112", "SuperSMART_27_T2_p414",
"SuperSMART_42_T3_p155", "SuperSMART_42_T5_p775")
M <- matrix(
seq_along(nms),
dimnames = list(
nms,
'x'
)
)
1 个解决方案
#1
4
library(tidyverse)
as.data.frame(M, stringsAsFactors = FALSE) %>%
rownames_to_column('id') %>%
mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
mutate(., group = group_indices(., S))
## id S R p x group
## 1 S003_T1_p555 S003 T1 p555 1 1
## 2 S003_T2_p456 S003 T2 p456 2 1
## 3 S004_T3_p785 S004 T3 p785 3 2
## 4 S004_T4_p426 S004 T4 p426 4 2
## 5 S027_T1_p112 S027 T1 p112 5 3
## 6 S027_T2_p414 S027 T2 p414 6 3
## 7 S042_T3_p155 S042 T3 p155 7 4
## 8 S042_T5_p775 S042 T5 p775 8 4
## If you really want it as a function:
normalize_data <- function(m, ..) {
as.data.frame(m, stringsAsFactors = FALSE) %>%
tibble::rownames_to_column('id') %>%
dplyr::mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
dplyr::mutate(., group = dplyr::group_indices(., S))
}
So this is a groupped capture denoted by the parenthesis '(^S)(\d{2})(_)'
. There are 3 groups being captured. 1: (^S)
, 2:(\d{2})
, 3: (_)
. The first one says grab from the beginning (^
) and S
. The second group says grab after that where there are exactly 2 digits (\\d{2}
) and then the 3rd group says it must be followed by an underscore.
所以这是一个由括号'(^ S)(\ d {2})(_)'表示的分组捕获。有3组被捕获。 1:(^ S),2:(\ d {2}),3:(_)。第一个说从头开始抓取(^)和S.第二个小组说在那之后有2个数字(\\ d {2}),然后第3个小组说它必须后跟一个下划线。
So S27_T2_p414
would be matched by this but S004_T3_p785
would not.
因此S27_T2_p414将与此匹配,但S004_T3_p785不会。
For the replacment of '\10\2\3'
....If it matches '(^S)(\d{2})(_)'
we can use perl = TRUE
to replace the group capturing (denoted by parenthesis above. The \1
corresponds to (^S)
; the \2
corresponds to (\d{2})
AND \3
goes with (_)
. We can insert things in between the capture groups. This technique is called backreference. In this case I insert an extra zero between the first capture group and the second to ensure all numbers have 3 digits. This makes an assumption that at most you have 3 digits in the string after S
.
对于'\ 10 \ 2 \ 3'的替换....如果匹配'(^ S)(\ d {2})(_)'我们可以使用perl = TRUE来替换组捕获(用括号表示) \ 1对应于(^ S); \ 2对应于(\ d {2})AND \ 3与(_)对应。我们可以在捕获组之间插入东西。这种技术称为反向引用。这种情况我在第一个捕获组和第二个捕获组之间插入一个额外的零,以确保所有数字都有3个数字。这假设在S之后的字符串中最多有3个数字。
#1
4
library(tidyverse)
as.data.frame(M, stringsAsFactors = FALSE) %>%
rownames_to_column('id') %>%
mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
mutate(., group = group_indices(., S))
## id S R p x group
## 1 S003_T1_p555 S003 T1 p555 1 1
## 2 S003_T2_p456 S003 T2 p456 2 1
## 3 S004_T3_p785 S004 T3 p785 3 2
## 4 S004_T4_p426 S004 T4 p426 4 2
## 5 S027_T1_p112 S027 T1 p112 5 3
## 6 S027_T2_p414 S027 T2 p414 6 3
## 7 S042_T3_p155 S042 T3 p155 7 4
## 8 S042_T5_p775 S042 T5 p775 8 4
## If you really want it as a function:
normalize_data <- function(m, ..) {
as.data.frame(m, stringsAsFactors = FALSE) %>%
tibble::rownames_to_column('id') %>%
dplyr::mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
dplyr::mutate(., group = dplyr::group_indices(., S))
}
So this is a groupped capture denoted by the parenthesis '(^S)(\d{2})(_)'
. There are 3 groups being captured. 1: (^S)
, 2:(\d{2})
, 3: (_)
. The first one says grab from the beginning (^
) and S
. The second group says grab after that where there are exactly 2 digits (\\d{2}
) and then the 3rd group says it must be followed by an underscore.
所以这是一个由括号'(^ S)(\ d {2})(_)'表示的分组捕获。有3组被捕获。 1:(^ S),2:(\ d {2}),3:(_)。第一个说从头开始抓取(^)和S.第二个小组说在那之后有2个数字(\\ d {2}),然后第3个小组说它必须后跟一个下划线。
So S27_T2_p414
would be matched by this but S004_T3_p785
would not.
因此S27_T2_p414将与此匹配,但S004_T3_p785不会。
For the replacment of '\10\2\3'
....If it matches '(^S)(\d{2})(_)'
we can use perl = TRUE
to replace the group capturing (denoted by parenthesis above. The \1
corresponds to (^S)
; the \2
corresponds to (\d{2})
AND \3
goes with (_)
. We can insert things in between the capture groups. This technique is called backreference. In this case I insert an extra zero between the first capture group and the second to ensure all numbers have 3 digits. This makes an assumption that at most you have 3 digits in the string after S
.
对于'\ 10 \ 2 \ 3'的替换....如果匹配'(^ S)(\ d {2})(_)'我们可以使用perl = TRUE来替换组捕获(用括号表示) \ 1对应于(^ S); \ 2对应于(\ d {2})AND \ 3与(_)对应。我们可以在捕获组之间插入东西。这种技术称为反向引用。这种情况我在第一个捕获组和第二个捕获组之间插入一个额外的零,以确保所有数字都有3个数字。这假设在S之后的字符串中最多有3个数字。