I have a text string as shown below:
我有一个文本字符串,如下所示:
txt = "(2) 1G–1G (0)"
And, dataframe:
dataframe:
DF <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)'))
I was trying to extract numbers within brackets in a way as shown below:
我试图在括号内提取数字,如下所示:
I want extracted result to be in this format:
我希望提取的结果是这样的格式:
2 - 0
What I am using is this:
我使用的是:
gsub('.+\\(([0-9]+)\\) 1G–1G \\(([0-9]+)\\).*$', '\\1 \\2', txt)
But What I am getting from above is:
但我从上面得到的是:
"(2) 1G–1G (0)"
I am not sure where is mistake. Can someone please explain why this code is not working the way I wanted it to work?
我不知道哪里出错了。有人能解释一下为什么这段代码不能正常工作吗?
3 个解决方案
#1
1
You may use
你可以用
DF$txt <- trimws(gsub("[^()–]*\\(([0-9]+)\\)[^()–]*"," \\1 ",DF$txt))
## => [1] "2 – 0" "1 – 4" "2 – 0"
See the regex demo and the R demo online.
请在线查看regex演示和R演示。
Details
细节
-
[^()–]*
- any 0+ chars other than(
,)
and-
- [^())* -任何0 +(,)和-以外的字符
-
\\(
- a(
- \ \(-(
-
([0-9]+)
- Group 1: one or more digits - ([0-9]+) -组1:一个或多个数字
-
\\)
- a)
char - \) - a)字符
-
[^()–]*
- any 0+ chars other than(
,)
and-
- [^())* -任何0 +(,)和-以外的字符
#2
1
You could extract them using base R
with regexec
and regmatches
like so:
你可以用regexec和regmatch来提取它们,比如:
(df <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)', 'somejunkhere')))
getNumbers <- function(col) {
(result <- sapply(col, function(x) {
m <- regexec("\\((\\d+)\\)[^()]*\\((\\d+)\\)", x, perl = TRUE)
groups <- regmatches(x, m)
(out <- ifelse(identical(groups[[1]], character(0)),
NA,
sprintf("%s - %s", groups[[1]][2], groups[[1]][3])))
}))
}
df$extracted <- getNumbers(df$txt)
df
This yields
这个收益率
txt extracted
1 (2) 1G–1G (0) 2 - 0
2 (1) 1G–1G (4) 1 - 4
3 (2) 1G–1G (0) 2 - 0
4 somejunkhere <NA>
#3
1
Do not understand why you would say it does not work:
不明白为什么你会说它不管用:
sub(".*\\((\\d+).*\\((\\d+).*","\\1-\\2",DF$txt)
[1] "2-0" "1-4" "2-0"
or even:
甚至:
transform(DF,extracted=sub(".*\\((\\d+).*\\((\\d+).*","\\1 - \\2",txt))
txt extracted
1 (2) 1G–1G (0) 2 - 0
2 (1) 1G–1G (4) 1 - 4
3 (2) 1G–1G (0) 2 - 0
#1
1
You may use
你可以用
DF$txt <- trimws(gsub("[^()–]*\\(([0-9]+)\\)[^()–]*"," \\1 ",DF$txt))
## => [1] "2 – 0" "1 – 4" "2 – 0"
See the regex demo and the R demo online.
请在线查看regex演示和R演示。
Details
细节
-
[^()–]*
- any 0+ chars other than(
,)
and-
- [^())* -任何0 +(,)和-以外的字符
-
\\(
- a(
- \ \(-(
-
([0-9]+)
- Group 1: one or more digits - ([0-9]+) -组1:一个或多个数字
-
\\)
- a)
char - \) - a)字符
-
[^()–]*
- any 0+ chars other than(
,)
and-
- [^())* -任何0 +(,)和-以外的字符
#2
1
You could extract them using base R
with regexec
and regmatches
like so:
你可以用regexec和regmatch来提取它们,比如:
(df <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)', 'somejunkhere')))
getNumbers <- function(col) {
(result <- sapply(col, function(x) {
m <- regexec("\\((\\d+)\\)[^()]*\\((\\d+)\\)", x, perl = TRUE)
groups <- regmatches(x, m)
(out <- ifelse(identical(groups[[1]], character(0)),
NA,
sprintf("%s - %s", groups[[1]][2], groups[[1]][3])))
}))
}
df$extracted <- getNumbers(df$txt)
df
This yields
这个收益率
txt extracted
1 (2) 1G–1G (0) 2 - 0
2 (1) 1G–1G (4) 1 - 4
3 (2) 1G–1G (0) 2 - 0
4 somejunkhere <NA>
#3
1
Do not understand why you would say it does not work:
不明白为什么你会说它不管用:
sub(".*\\((\\d+).*\\((\\d+).*","\\1-\\2",DF$txt)
[1] "2-0" "1-4" "2-0"
or even:
甚至:
transform(DF,extracted=sub(".*\\((\\d+).*\\((\\d+).*","\\1 - \\2",txt))
txt extracted
1 (2) 1G–1G (0) 2 - 0
2 (1) 1G–1G (4) 1 - 4
3 (2) 1G–1G (0) 2 - 0