
时间:2021-03-23 22:08:37

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.


Here is some data:


x <- "I like_to see_how_too"
pat <- "_"

Desired outcome


Say I want to split on first occurrence of _:


[1] "I like"  "to see_how_too"

Say I want to split on second occurrence of _:


[1] "I like_to see"   "how_too"

Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.


Here's a solution that doesn't fit my parameters of single regex that works with strsplit


x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]

c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))

4 个解决方案



Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.

这是另一个使用gsubfn包和一些正则表达式的解决方案。要更改分隔符的第n个匹配项,您可以简单地交换放置在量词范围内的数字 - {n}。

x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like"  "to see_how_too"

If you would like the nth occurrence to be user defined, you could use the following:


n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too" 




Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.

由于R使用PCRE,您可以使用\ K从主匹配结果中删除与\ K之前的模式匹配的所有内容。

Below is the regex to split the string at the 3rd _



If you want to split at the nth occurrence of _, just change 2 to (n - 1).

如果要在第n次出现_时拆分,只需将2更改为(n - 1)。

Demo on regex101


That was the plan. However, strsplit seems to think differently.


Actual execution

Demo on ideone.com


x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)

# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible" 

It still fails to work on a stronger assertion \A

它仍然无法在强有力的断言上发挥作用\ A

strsplit(x,  "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible"


This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.


This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.

这将从先前的匹配中删除所有状态,并在尝试匹配余数的正则表达式时使我们处于干净状态。这使得在第一次匹配时停止strsplit功能的任务同时完成任务是不可能的。 strsplit中甚至没有参数来限制拆分的数量。



Rather than split you do match to get your split strings.


Try this regex:



Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.


RegEx Demo

Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:



Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.


  • (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
  • (* FAIL)表现得像一个失败的否定断言,是(?!)的同义词
  • (*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
  • (* SKIP)定义了一个点,超过该点,当子模式稍后失败时,不允许正则表达式引擎回溯
  • (*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
  • (* SKIP)(* FAIL)一起提供了一个很好的限制替代方案,你不能在上面的正则表达式中拥有可变长度的lookbehind。

RegEx Demo2

x <- "I like_to see_how_too"

strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)

## > strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how"    "too"   

## > strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too" 



This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.


It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:



k <- c(2, 4) # split at 2nd and 4th _

p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")



[1] "aa_bb" "cc_dd" "ee_ff"

If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.

如果允许空字段,则使用不在字符串中的任何其他字符序列,例如“\ 01”代替双下划线。

See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.




Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.

这是另一个使用gsubfn包和一些正则表达式的解决方案。要更改分隔符的第n个匹配项,您可以简单地交换放置在量词范围内的数字 - {n}。

x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like"  "to see_how_too"

If you would like the nth occurrence to be user defined, you could use the following:


n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too" 




Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.

由于R使用PCRE,您可以使用\ K从主匹配结果中删除与\ K之前的模式匹配的所有内容。

Below is the regex to split the string at the 3rd _



If you want to split at the nth occurrence of _, just change 2 to (n - 1).

如果要在第n次出现_时拆分,只需将2更改为(n - 1)。

Demo on regex101


That was the plan. However, strsplit seems to think differently.


Actual execution

Demo on ideone.com


x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)

# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible" 

It still fails to work on a stronger assertion \A

它仍然无法在强有力的断言上发挥作用\ A

strsplit(x,  "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible"


This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.


This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.

这将从先前的匹配中删除所有状态,并在尝试匹配余数的正则表达式时使我们处于干净状态。这使得在第一次匹配时停止strsplit功能的任务同时完成任务是不可能的。 strsplit中甚至没有参数来限制拆分的数量。



Rather than split you do match to get your split strings.


Try this regex:



Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.


RegEx Demo

Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:



Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.


  • (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
  • (* FAIL)表现得像一个失败的否定断言,是(?!)的同义词
  • (*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
  • (* SKIP)定义了一个点,超过该点,当子模式稍后失败时,不允许正则表达式引擎回溯
  • (*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
  • (* SKIP)(* FAIL)一起提供了一个很好的限制替代方案,你不能在上面的正则表达式中拥有可变长度的lookbehind。

RegEx Demo2

x <- "I like_to see_how_too"

strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)

## > strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how"    "too"   

## > strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too" 



This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.


It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:



k <- c(2, 4) # split at 2nd and 4th _

p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")



[1] "aa_bb" "cc_dd" "ee_ff"

If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.

如果允许空字段,则使用不在字符串中的任何其他字符序列,例如“\ 01”代替双下划线。

See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.
