分割字符矢量在数学比较符号在R

时间:2021-04-25 21:41:10

I would like to split expression with mathematical comparisons, e.g.

我想用数学上的比较来表达。

unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE))

The results are:

结果是:

[1] "var" "<"   "3"  
[1] "var" "="   "="   "5"  
[1] "var" ">"   "2"  

For the 2nd example above, I would like to get [1] "var" "==" "5", so the two = should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)

对于上面的第二个示例,我希望获得[1]“var”==“5”,因此两个=应该作为单个元素返回。我需要如何改变我的正则表达式来实现这一点?(我已经尝试过为“=”进行分组和量词,但是什么也没用——正则表达式不是我的朋友……)

3 个解决方案

#1


9  

You may use a PCRE regex to match the substrings you need:

您可以使用PCRE正则表达式来匹配所需的子字符串:

==|[<>]|(?:(?!==)[^<>])+

To also support !=, modify it as

还要支持!=,将其修改为

[!=]=|[<>]|(?:(?![=!]=)[^<>])+

See the regex demo.

查看演示正则表达式。

Details:

细节:

  • == - 2 = signs
  • = - 2 =符号
  • | - or
  • |——或者
  • [<>] - a < or >
  • - a <或>
  • | - or
  • |——或者
  • (?:(?!==)[^<>])+ - 1 or more chars other than < and > ([^<>]) that do not start a == char sequence (a tempered greedy token).
  • (?(? = =):[^ < >])+ - 1或更多除了 <和> 字符([^ < >])不开始= =字符序列(回火贪婪令牌)。

NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.

注意:通过添加更多的替代方案和调整调和的贪心令牌,这很容易扩展。

R test:

R测试:

> text <- "Text1==text2<text3><More here"
> res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE))
> res
[[1]]
[1] "Text1"     "=="        "text2"     "<"         "text3"     ">"        
[7] "<"         "More here"

#2


6  

Expanding from my idea in comments, just for the formatting:

从我的想法扩展到评论,只是为了格式:

tests=c("var==5","var<3","var.name>5")
regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests))

\w is [a-zA-Z0-9_] and \W it's opposite [^a-zA-Z0-9_], I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE).

[\ w - za - z0 - 9 _]\ w是相反的(^ a-zA-Z0-9_],我在评论包括扩大它。在字符类中,因为R不支持基regex中的字符类中的\w(需要使用perl=TRUE)。

So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot.

因此,regex搜索至少1个\w和。,然后至少1个\w(以匹配操作符),然后至少1个\w和dot。

Each step is captured, and this give:

每一步都被记录下来,这是:

[[1]]
[1] "var==5" "var"    "=="     "5"     

[[2]]
[1] "var<3" "var"   "<"     "3"    

[[3]]
[1] "var.name>5" "var.name"   ">"          "5"       

you may add * between each capture group if your entries could have space around the operator, if not the operator capture will get them.

您可以在每个捕获组之间添加*,如果您的条目可以在操作符周围有空间,如果操作符捕获将获得它们。

#3


5  

Using words' boundaries (\\b) and specifying 2 possibilities for the lookaround:

使用单词的边界(只\b)并为查找指定两种可能性:

unlist(strsplit("var==5", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE))
[1] "var" "=="  "5" 

unlist(strsplit("var<3", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE))
[1] "var" "<"   "3"
unlist(strsplit("var>2", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE))
[1] "var" ">"   "2"

Explanation:

解释:

Split at the end of the "word" and, after, there is either a non-alphanumeric character \\b[^a-zA-Z0-9] or it is the end of the "word" and, after, there is an alphanumeric character.

分裂的“单词”,之后,有一个非字母数字字符\ \ b[^ a-zA-Z0-9)或是“单词”,结束之后,有一个字母数字字符。

EDIT:

编辑:

Actually the above code would have unexpected results if the number at the end is 10 or more.
Another option is to use lookbehind and split when, before, there is either a non alphanum character followed by a word edge, or an alphanum character followed by a word edge:

实际上,如果最后的数字是10或更多,那么上面的代码将会产生意想不到的结果。另一种选择是使用lookbehind和split,在此之前,有一个非alphanum字符后跟一个单词edge,或一个字母数字字符后跟一个词边:

strsplit("var<20", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]]
#[1] "var" "<"   "20"
strsplit("var==20", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]]
#[1] "var" "=="  "20"
strsplit("var!=5", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]]
#[1] "var" "!="  "5"

EDIT2:

EDIT2:

Totally stealing @Tensibai way to define alphanum(+underscore)/non alphanum characters, the above regex can be simplify to: "(?<=((\\W\\b)|(\\w\\b)))"

完全采用@Tensibai的方式来定义字母(+下划线)/非字母字符,因此可以将上面的regex简化为:“(?<=(\W\ W\ b)|(\ W\ b)”。

#1


9  

You may use a PCRE regex to match the substrings you need:

您可以使用PCRE正则表达式来匹配所需的子字符串:

==|[<>]|(?:(?!==)[^<>])+

To also support !=, modify it as

还要支持!=,将其修改为

[!=]=|[<>]|(?:(?![=!]=)[^<>])+

See the regex demo.

查看演示正则表达式。

Details:

细节:

  • == - 2 = signs
  • = - 2 =符号
  • | - or
  • |——或者
  • [<>] - a < or >
  • - a <或>
  • | - or
  • |——或者
  • (?:(?!==)[^<>])+ - 1 or more chars other than < and > ([^<>]) that do not start a == char sequence (a tempered greedy token).
  • (?(? = =):[^ < >])+ - 1或更多除了 <和> 字符([^ < >])不开始= =字符序列(回火贪婪令牌)。

NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.

注意:通过添加更多的替代方案和调整调和的贪心令牌,这很容易扩展。

R test:

R测试:

> text <- "Text1==text2<text3><More here"
> res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE))
> res
[[1]]
[1] "Text1"     "=="        "text2"     "<"         "text3"     ">"        
[7] "<"         "More here"

#2


6  

Expanding from my idea in comments, just for the formatting:

从我的想法扩展到评论,只是为了格式:

tests=c("var==5","var<3","var.name>5")
regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests))

\w is [a-zA-Z0-9_] and \W it's opposite [^a-zA-Z0-9_], I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE).

[\ w - za - z0 - 9 _]\ w是相反的(^ a-zA-Z0-9_],我在评论包括扩大它。在字符类中,因为R不支持基regex中的字符类中的\w(需要使用perl=TRUE)。

So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot.

因此,regex搜索至少1个\w和。,然后至少1个\w(以匹配操作符),然后至少1个\w和dot。

Each step is captured, and this give:

每一步都被记录下来,这是:

[[1]]
[1] "var==5" "var"    "=="     "5"     

[[2]]
[1] "var<3" "var"   "<"     "3"    

[[3]]
[1] "var.name>5" "var.name"   ">"          "5"       

you may add * between each capture group if your entries could have space around the operator, if not the operator capture will get them.

您可以在每个捕获组之间添加*,如果您的条目可以在操作符周围有空间,如果操作符捕获将获得它们。

#3


5  

Using words' boundaries (\\b) and specifying 2 possibilities for the lookaround:

使用单词的边界(只\b)并为查找指定两种可能性:

unlist(strsplit("var==5", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE))
[1] "var" "=="  "5" 

unlist(strsplit("var<3", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE))
[1] "var" "<"   "3"
unlist(strsplit("var>2", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE))
[1] "var" ">"   "2"

Explanation:

解释:

Split at the end of the "word" and, after, there is either a non-alphanumeric character \\b[^a-zA-Z0-9] or it is the end of the "word" and, after, there is an alphanumeric character.

分裂的“单词”,之后,有一个非字母数字字符\ \ b[^ a-zA-Z0-9)或是“单词”,结束之后,有一个字母数字字符。

EDIT:

编辑:

Actually the above code would have unexpected results if the number at the end is 10 or more.
Another option is to use lookbehind and split when, before, there is either a non alphanum character followed by a word edge, or an alphanum character followed by a word edge:

实际上,如果最后的数字是10或更多,那么上面的代码将会产生意想不到的结果。另一种选择是使用lookbehind和split,在此之前,有一个非alphanum字符后跟一个单词edge,或一个字母数字字符后跟一个词边:

strsplit("var<20", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]]
#[1] "var" "<"   "20"
strsplit("var==20", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]]
#[1] "var" "=="  "20"
strsplit("var!=5", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]]
#[1] "var" "!="  "5"

EDIT2:

EDIT2:

Totally stealing @Tensibai way to define alphanum(+underscore)/non alphanum characters, the above regex can be simplify to: "(?<=((\\W\\b)|(\\w\\b)))"

完全采用@Tensibai的方式来定义字母(+下划线)/非字母字符,因此可以将上面的regex简化为:“(?<=(\W\ W\ b)|(\ W\ b)”。