I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \"
and \"
in Function=\"SMAD5\"
. I also want to keep the empty strings: Function=\"\"
我想在一个复杂的字符串中保留一个字符串。我认为我可以使用regex来保存我需要的东西。基本上,我只希望在函数=\"SMAD5\"中保留“\”和“\”之间的信息。我还想保留空字符串:Function=\“\”
df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";",
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";",
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";",
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";",
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")
This should look like this:
这应该是这样的:
"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA
So far that What I was able to do:
到目前为止,我能做的是:
gsub('.*Function=\"',"",df)
[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";" "\";" "\";"
But I'm stuck with a bunch of \";"
. How can I remove them with one line?
但我被一堆" \"困住了。我如何用一行代码删除它们?
I tried this:
我试着这样的:
gsub('.*Function=\"' & '.\"*',"",test)
But it's giving me this error:
但它给了我一个错误:
Error in ".*Function=\"" & ".\"*" :
operations are possible only for numeric, logical or complex types
3 个解决方案
#1
2
You may use
你可以用
gsub(".*Function=\"([^\"]*).*","\\1",df)
See the regex demo
看到regex演示
Details:
细节:
-
.*
- any 0+ chars as many as possible up to the last... - .* -任何0+字符直到最后…
-
Function=\"
- aFunction="
substring - 函数=\" -一个函数="子字符串"
-
([^\"]*)
- capturing group 1 matching 0+ chars other than a"
- ([^ \]*)-捕获组1匹配0 +字符以外的一个“
-
.*
- and the rest of the string. - -和绳子的其余部分。
The \1
is the backreference restoring the contents of the Group 1 in the result.
\1是在结果中还原组1的内容的反向引用。
#2
1
With stringr we can capture groups too:
使用stringr我们也可以捕获组:
library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA NA NA
#3
0
The regular expression can be constructed more readably using rebus
.
可以使用rebus更容易地构造正则表达式。
rx <- 'Function="' %R%
capture(zero_or_more(negated_char_class('"')))
Then matching is as mentioned by Wiktor and sandipan.
就像Wiktor和sandipan提到的那样。
rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)
gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)
#1
2
You may use
你可以用
gsub(".*Function=\"([^\"]*).*","\\1",df)
See the regex demo
看到regex演示
Details:
细节:
-
.*
- any 0+ chars as many as possible up to the last... - .* -任何0+字符直到最后…
-
Function=\"
- aFunction="
substring - 函数=\" -一个函数="子字符串"
-
([^\"]*)
- capturing group 1 matching 0+ chars other than a"
- ([^ \]*)-捕获组1匹配0 +字符以外的一个“
-
.*
- and the rest of the string. - -和绳子的其余部分。
The \1
is the backreference restoring the contents of the Group 1 in the result.
\1是在结果中还原组1的内容的反向引用。
#2
1
With stringr we can capture groups too:
使用stringr我们也可以捕获组:
library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA NA NA
#3
0
The regular expression can be constructed more readably using rebus
.
可以使用rebus更容易地构造正则表达式。
rx <- 'Function="' %R%
capture(zero_or_more(negated_char_class('"')))
Then matching is as mentioned by Wiktor and sandipan.
就像Wiktor和sandipan提到的那样。
rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)
gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)