在第一个/第n个分隔符出现时拆分

时间:2021-03-23 22:08:37

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.

我正在尝试一些我认为很容易的事情。我正在寻找一个单一的正则表达式解决方案(尽管其他人欢迎完整性)。我想拆分n次出现的分隔符。

Here is some data:

这是一些数据:

x <- "I like_to see_how_too"
pat <- "_"

Desired outcome

期望的结果

Say I want to split on first occurrence of _:

假设我想在第一次出现_时拆分:

[1] "I like"  "to see_how_too"

Say I want to split on second occurrence of _:

假设我想在第二次出现_时拆分:

[1] "I like_to see"   "how_too"

Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.

理想情况下,如果解决方案是正则表达式,则可以推广到第n次;该解决方案将使用strsplit与单个正则表达式。

Here's a solution that doesn't fit my parameters of single regex that works with strsplit

这是一个不适合我使用strsplit的单正则表达式参数的解决方案

x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]

c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))

4 个解决方案

#1


4  

Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.

这是另一个使用gsubfn包和一些正则表达式的解决方案。要更改分隔符的第n个匹配项,您可以简单地交换放置在量词范围内的数字 - {n}。

library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like"  "to see_how_too"

If you would like the nth occurrence to be user defined, you could use the following:

如果您希望第n个匹配项是用户定义的,则可以使用以下命令:

n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too" 

#2


3  

Non-Solution

Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.

由于R使用PCRE,您可以使用\ K从主匹配结果中删除与\ K之前的模式匹配的所有内容。

Below is the regex to split the string at the 3rd _

下面是将字符串拆分为3的正则表达式_

^[^_]*(?:_[^_]*){2}\K_

If you want to split at the nth occurrence of _, just change 2 to (n - 1).

如果要在第n次出现_时拆分,只需将2更改为(n - 1)。

Demo on regex101

在regex101上演示

That was the plan. However, strsplit seems to think differently.

那就是计划。然而,strsplit似乎有不同的看法。

Actual execution

Demo on ideone.com

在ideone.com上演示

x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)

# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible" 

It still fails to work on a stronger assertion \A

它仍然无法在强有力的断言上发挥作用\ A

strsplit(x,  "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible"

Explanation?

This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.

此行为暗示strsplit找到第一个匹配,执行子字符串以提取第一个标记和剩余部分,并在剩余部分中找到下一个匹配。

This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.

这将从先前的匹配中删除所有状态,并在尝试匹配余数的正则表达式时使我们处于干净状态。这使得在第一次匹配时停止strsplit功能的任务同时完成任务是不可能的。 strsplit中甚至没有参数来限制拆分的数量。

#3


2  

Rather than split you do match to get your split strings.

而不是分裂你匹配得到你的分裂字符串。

Try this regex:

试试这个正则表达式:

^((?:[^_]*_){1}[^_]*)_(.*)$

Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.

将1替换为n-1,在第n次出现下划线时尝试拆分。

RegEx Demo

Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:

更新:似乎R也支持PCRE,在这种情况下你也可以使用这个PCRE正则表达式进行拆分:

^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_

Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.

将1替换为n-1,在第n次出现下划线时尝试拆分。

  • (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
  • (* FAIL)表现得像一个失败的否定断言,是(?!)的同义词
  • (*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
  • (* SKIP)定义了一个点,超过该点,当子模式稍后失败时,不允许正则表达式引擎回溯
  • (*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
  • (* SKIP)(* FAIL)一起提供了一个很好的限制替代方案,你不能在上面的正则表达式中拥有可变长度的lookbehind。

RegEx Demo2

x <- "I like_to see_how_too"

strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)

## > strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how"    "too"   

## > strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too" 

#4


1  

This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.

这使用gsubfn来预处理输入字符串,以便strsplit可以处理它。主要优点是可以指定数字向量k,指示要分割的下划线。

It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:

它用双下划线替换由k定义的下划线的出现,然后在双下划线上拆分。在这个例子中,我们分为第2和第4下划线:

library(gsubfn)

k <- c(2, 4) # split at 2nd and 4th _

p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")

giving:

赠送:

[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"

If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.

如果允许空字段,则使用不在字符串中的任何其他字符序列,例如“\ 01”代替双下划线。

See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

有关使用带有proto对象的gusbfn来保持匹配之间状态的更多信息,请参阅gusbfn插图的第4节。

#1


4  

Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.

这是另一个使用gsubfn包和一些正则表达式的解决方案。要更改分隔符的第n个匹配项,您可以简单地交换放置在量词范围内的数字 - {n}。

library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like"  "to see_how_too"

If you would like the nth occurrence to be user defined, you could use the following:

如果您希望第n个匹配项是用户定义的,则可以使用以下命令:

n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too" 

#2


3  

Non-Solution

Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.

由于R使用PCRE,您可以使用\ K从主匹配结果中删除与\ K之前的模式匹配的所有内容。

Below is the regex to split the string at the 3rd _

下面是将字符串拆分为3的正则表达式_

^[^_]*(?:_[^_]*){2}\K_

If you want to split at the nth occurrence of _, just change 2 to (n - 1).

如果要在第n次出现_时拆分,只需将2更改为(n - 1)。

Demo on regex101

在regex101上演示

That was the plan. However, strsplit seems to think differently.

那就是计划。然而,strsplit似乎有不同的看法。

Actual execution

Demo on ideone.com

在ideone.com上演示

x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)

# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible" 

It still fails to work on a stronger assertion \A

它仍然无法在强有力的断言上发挥作用\ A

strsplit(x,  "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible"

Explanation?

This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.

此行为暗示strsplit找到第一个匹配,执行子字符串以提取第一个标记和剩余部分,并在剩余部分中找到下一个匹配。

This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.

这将从先前的匹配中删除所有状态,并在尝试匹配余数的正则表达式时使我们处于干净状态。这使得在第一次匹配时停止strsplit功能的任务同时完成任务是不可能的。 strsplit中甚至没有参数来限制拆分的数量。

#3


2  

Rather than split you do match to get your split strings.

而不是分裂你匹配得到你的分裂字符串。

Try this regex:

试试这个正则表达式:

^((?:[^_]*_){1}[^_]*)_(.*)$

Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.

将1替换为n-1,在第n次出现下划线时尝试拆分。

RegEx Demo

Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:

更新:似乎R也支持PCRE,在这种情况下你也可以使用这个PCRE正则表达式进行拆分:

^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_

Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.

将1替换为n-1,在第n次出现下划线时尝试拆分。

  • (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
  • (* FAIL)表现得像一个失败的否定断言,是(?!)的同义词
  • (*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
  • (* SKIP)定义了一个点,超过该点,当子模式稍后失败时,不允许正则表达式引擎回溯
  • (*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
  • (* SKIP)(* FAIL)一起提供了一个很好的限制替代方案,你不能在上面的正则表达式中拥有可变长度的lookbehind。

RegEx Demo2

x <- "I like_to see_how_too"

strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)

## > strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how"    "too"   

## > strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too" 

#4


1  

This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.

这使用gsubfn来预处理输入字符串,以便strsplit可以处理它。主要优点是可以指定数字向量k,指示要分割的下划线。

It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:

它用双下划线替换由k定义的下划线的出现,然后在双下划线上拆分。在这个例子中,我们分为第2和第4下划线:

library(gsubfn)

k <- c(2, 4) # split at 2nd and 4th _

p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")

giving:

赠送:

[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"

If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.

如果允许空字段,则使用不在字符串中的任何其他字符序列,例如“\ 01”代替双下划线。

See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

有关使用带有proto对象的gusbfn来保持匹配之间状态的更多信息,请参阅gusbfn插图的第4节。