字符串分割条件为R

I have this mystring with the delimiter "_". The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.

我有这个带分隔符“_”的mystring。这里的条件是，如果有两个或更多的分隔符，我想在第二个分隔符处拆分，如果只有一个分隔符，我想要分割。并得到如下所示的结果。

mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")

result

结果

"MODY_60.2"  "MODY_116.21" "MODY_116.3"  "MODY_116.4"

6 个解决方案

#1

You can do this using gsubfn

您可以使用gsubfn来实现这一点

library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"

This allows for cases when you have more than two "_", and you want to split on the second one, for example,

这允许当你有两个以上的“_”，而你想在第二个“_”上分裂时，例如，

mystring<-c("MODY_60.2.ReCal.sort.bam",
            "MODY_116.21_C4U.ReCal.sort.bam",
            "MODY_116.3_C2RX-1-10.ReCal.sort.bam",
            "MODY_116.4.ReCal.sort.bam",
            "MODY_116.4_asdfsadf_1212_asfsdf",
            "MODY_116.5.ReCal_asdfsadf_1212_asfsdf",  # split by second "_", leaving ".ReCal"
            "MODY")

gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"

In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.

在函数中，f x是原始字符串，y和z是下一个匹配项。因此，如果z不是一个“_”，那么它就会通过另一个字符串进行分割。

#2

With the stringr package:

stringr包:

str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

It also works with more than two delimiters.

它还可以使用两个以上的分隔符。

#3

Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.

Perl/PCRE具有分支重置特性，当您在不同的替代方案中有捕获组时，您可以重用一个组号，并将其视为一个捕获组。

IMO, this feature is elegant when you want to supply different alternatives.

在我看来，当你想提供不同的选择时，这个特性是很优雅的。

x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam', 
       'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
       'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')

sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"

#4

gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"

#5

^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$

You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.

您可以简单地使用gsub而不使用任何复杂的regex。只是取代\ \ 1。看到演示。

https://regex101.com/r/wL4aB6/1

#6

A little longer, but needs less regular expression knowledge:

稍微长一点，但需要较少的正则表达式知识:

library(stringr)
indx <- str_locate_all(mystring, "_")

for (i in seq_along(indx)) {
  if (nrow(indx[[i]]) == 1) {
    mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
  } else {
    mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
  }
}

#1