使用regex在R中提取文本时出错

时间:2022-10-10 13:55:53

I have a text string as shown below:

我有一个文本字符串,如下所示:

txt = "(2) 1G–1G (0)"

And, dataframe:

dataframe:

DF <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)'))

I was trying to extract numbers within brackets in a way as shown below:

我试图在括号内提取数字,如下所示:

I want extracted result to be in this format:

我希望提取的结果是这样的格式:

  2 - 0

What I am using is this:

我使用的是:

gsub('.+\\(([0-9]+)\\) 1G–1G \\(([0-9]+)\\).*$', '\\1 \\2', txt)

But What I am getting from above is:

但我从上面得到的是:

 "(2) 1G–1G (0)"

I am not sure where is mistake. Can someone please explain why this code is not working the way I wanted it to work?

我不知道哪里出错了。有人能解释一下为什么这段代码不能正常工作吗?

3 个解决方案

#1


1  

You may use

你可以用

DF$txt <- trimws(gsub("[^()–]*\\(([0-9]+)\\)[^()–]*"," \\1 ",DF$txt))
## => [1] "2 – 0" "1 – 4" "2 – 0"

See the regex demo and the R demo online.

请在线查看regex演示和R演示。

Details

细节

  • [^()–]* - any 0+ chars other than (, ) and -
  • [^())* -任何0 +(,)和-以外的字符
  • \\( - a (
  • \ \(-(
  • ([0-9]+) - Group 1: one or more digits
  • ([0-9]+) -组1:一个或多个数字
  • \\) - a ) char
  • \) - a)字符
  • [^()–]* - any 0+ chars other than (, ) and -
  • [^())* -任何0 +(,)和-以外的字符

#2


1  

You could extract them using base R with regexec and regmatches like so:

你可以用regexec和regmatch来提取它们,比如:

(df <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)', 'somejunkhere')))

getNumbers <- function(col) {
  (result <- sapply(col, function(x) {
      m <- regexec("\\((\\d+)\\)[^()]*\\((\\d+)\\)", x, perl = TRUE)
      groups <- regmatches(x, m)
      (out <- ifelse(identical(groups[[1]], character(0)),
                    NA,
                    sprintf("%s - %s", groups[[1]][2], groups[[1]][3])))
    }))
}
df$extracted <- getNumbers(df$txt)
df

This yields

这个收益率

            txt extracted
1 (2) 1G–1G (0)     2 - 0
2 (1) 1G–1G (4)     1 - 4
3 (2) 1G–1G (0)     2 - 0
4  somejunkhere      <NA>

#3


1  

Do not understand why you would say it does not work:

不明白为什么你会说它不管用:

sub(".*\\((\\d+).*\\((\\d+).*","\\1-\\2",DF$txt)
 [1] "2-0" "1-4" "2-0"

or even:

甚至:

 transform(DF,extracted=sub(".*\\((\\d+).*\\((\\d+).*","\\1 - \\2",txt))
            txt extracted
1 (2) 1G–1G (0)     2 - 0
2 (1) 1G–1G (4)     1 - 4
3 (2) 1G–1G (0)     2 - 0

#1


1  

You may use

你可以用

DF$txt <- trimws(gsub("[^()–]*\\(([0-9]+)\\)[^()–]*"," \\1 ",DF$txt))
## => [1] "2 – 0" "1 – 4" "2 – 0"

See the regex demo and the R demo online.

请在线查看regex演示和R演示。

Details

细节

  • [^()–]* - any 0+ chars other than (, ) and -
  • [^())* -任何0 +(,)和-以外的字符
  • \\( - a (
  • \ \(-(
  • ([0-9]+) - Group 1: one or more digits
  • ([0-9]+) -组1:一个或多个数字
  • \\) - a ) char
  • \) - a)字符
  • [^()–]* - any 0+ chars other than (, ) and -
  • [^())* -任何0 +(,)和-以外的字符

#2


1  

You could extract them using base R with regexec and regmatches like so:

你可以用regexec和regmatch来提取它们,比如:

(df <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)', 'somejunkhere')))

getNumbers <- function(col) {
  (result <- sapply(col, function(x) {
      m <- regexec("\\((\\d+)\\)[^()]*\\((\\d+)\\)", x, perl = TRUE)
      groups <- regmatches(x, m)
      (out <- ifelse(identical(groups[[1]], character(0)),
                    NA,
                    sprintf("%s - %s", groups[[1]][2], groups[[1]][3])))
    }))
}
df$extracted <- getNumbers(df$txt)
df

This yields

这个收益率

            txt extracted
1 (2) 1G–1G (0)     2 - 0
2 (1) 1G–1G (4)     1 - 4
3 (2) 1G–1G (0)     2 - 0
4  somejunkhere      <NA>

#3


1  

Do not understand why you would say it does not work:

不明白为什么你会说它不管用:

sub(".*\\((\\d+).*\\((\\d+).*","\\1-\\2",DF$txt)
 [1] "2-0" "1-4" "2-0"

or even:

甚至:

 transform(DF,extracted=sub(".*\\((\\d+).*\\((\\d+).*","\\1 - \\2",txt))
            txt extracted
1 (2) 1G–1G (0)     2 - 0
2 (1) 1G–1G (4)     1 - 4
3 (2) 1G–1G (0)     2 - 0