如何仅将信息保存在R中的复杂字符串中?

时间:2023-01-28 20:13:41

I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \" and \" in Function=\"SMAD5\". I also want to keep the empty strings: Function=\"\"

我想在一个复杂的字符串中保留一个字符串。我认为我可以使用regex来保存我需要的东西。基本上,我只希望在函数=\"SMAD5\"中保留“\”和“\”之间的信息。我还想保留空字符串:Function=\“\”

df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";", 
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";", 
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";", 
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";", 
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")

This should look like this:

这应该是这样的:

"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA

So far that What I was able to do:

到目前为止,我能做的是:

gsub('.*Function=\"',"",df)

[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";"      "\";"      "\";"     

But I'm stuck with a bunch of \";". How can I remove them with one line?

但我被一堆" \"困住了。我如何用一行代码删除它们?

I tried this:

我试着这样的:

gsub('.*Function=\"' & '.\"*',"",test)

But it's giving me this error:

但它给了我一个错误:

Error in ".*Function=\"" & ".\"*" : 
  operations are possible only for numeric, logical or complex types

3 个解决方案

#1


2  

You may use

你可以用

gsub(".*Function=\"([^\"]*).*","\\1",df)

See the regex demo

看到regex演示

Details:

细节:

  • .* - any 0+ chars as many as possible up to the last...
  • .* -任何0+字符直到最后…
  • Function=\" - a Function=" substring
  • 函数=\" -一个函数="子字符串"
  • ([^\"]*) - capturing group 1 matching 0+ chars other than a "
  • ([^ \]*)-捕获组1匹配0 +字符以外的一个“
  • .* - and the rest of the string.
  • -和绳子的其余部分。

The \1 is the backreference restoring the contents of the Group 1 in the result.

\1是在结果中还原组1的内容的反向引用。

#2


1  

With stringr we can capture groups too:

使用stringr我们也可以捕获组:

library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA      NA      NA     

#3


0  

The regular expression can be constructed more readably using rebus.

可以使用rebus更容易地构造正则表达式。

rx <- 'Function="' %R% 
  capture(zero_or_more(negated_char_class('"')))

Then matching is as mentioned by Wiktor and sandipan.

就像Wiktor和sandipan提到的那样。

rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)

gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)

#1


2  

You may use

你可以用

gsub(".*Function=\"([^\"]*).*","\\1",df)

See the regex demo

看到regex演示

Details:

细节:

  • .* - any 0+ chars as many as possible up to the last...
  • .* -任何0+字符直到最后…
  • Function=\" - a Function=" substring
  • 函数=\" -一个函数="子字符串"
  • ([^\"]*) - capturing group 1 matching 0+ chars other than a "
  • ([^ \]*)-捕获组1匹配0 +字符以外的一个“
  • .* - and the rest of the string.
  • -和绳子的其余部分。

The \1 is the backreference restoring the contents of the Group 1 in the result.

\1是在结果中还原组1的内容的反向引用。

#2


1  

With stringr we can capture groups too:

使用stringr我们也可以捕获组:

library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA      NA      NA     

#3


0  

The regular expression can be constructed more readably using rebus.

可以使用rebus更容易地构造正则表达式。

rx <- 'Function="' %R% 
  capture(zero_or_more(negated_char_class('"')))

Then matching is as mentioned by Wiktor and sandipan.

就像Wiktor和sandipan提到的那样。

rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)

gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)