I have the following data frame:
我有以下数据框架:
df <- data.frame(city=c("in London", "in Manchester city", "in Sao Paolo"))
I am using str_extract and return the word after 'in' in a separate column.
我正在使用str_extract并在另一列中返回“in”后面的单词。
library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')
This works fine for me in 95% of cases. However, there are cases like "Sao Paolo" above where my regex would return "Sao" rather than the city name.
这在95%的情况下对我都是适用的。然而,上面有一些案例,比如“圣保罗”,我的regex将返回“Sao”而不是城市名称。
Can someone please help me with amending it to capture either:
有人能帮我修改一下吗?
1) everything to the end of the text string I am extracting from? OR
1)我正在提取的文本字符串末尾的所有内容?或
2) where there is more than one word after 'in', then return that too
2)当“in”后面有多个单词时,也要返回
Many thanks.
多谢。
4 个解决方案
#1
1
To match all the rest of the string after the first in
followed with a space, you can use
要匹配第一个in之后的所有字符串和空格,可以使用
(?<=in\\s).+
The lookbehind matches the in
preposition with a white space after it, but does not return it inside the match since lookbehinds are zero-width assertions.
lookbehind在介词后面加上一个空格,但不会在匹配中返回,因为lookbehind是零宽度断言。
#2
1
Does this one liner do it for you?
这条衬垫能为你做吗?
unlist(lapply(strsplit(c("in London", "in Sao Paulo", "in Manchester City"), "in "), function(x) x[2]))
[1] "London" "Sao Paulo" "Manchester City"
#3
0
You can try this:
你可以试试这个:
library(stringr)
df$onlyCity <- str_extract(df$city, '[^in ](.)*')
df
city onlyCity
1 in London London
2 in Manchester city Manchester city
3 in Sao Paolo Sao Paolo
#4
0
gsub("^in[ ]*(.*$)", "\\1", df$city)
[1] "London" "Manchester city" "Sao Paolo"
Assumes that your strings start with "in"
, followed by some number of spaces (won't fail with more than one), followed by the text of interest which is captured from the first non-whitespace character up to the end of the string.
假设您的字符串以“in”开头,后面跟着一些空格(不会超过一个空格),后面跟着从第一个非空格字符到字符串末尾捕获的相关文本。
#1
1
To match all the rest of the string after the first in
followed with a space, you can use
要匹配第一个in之后的所有字符串和空格,可以使用
(?<=in\\s).+
The lookbehind matches the in
preposition with a white space after it, but does not return it inside the match since lookbehinds are zero-width assertions.
lookbehind在介词后面加上一个空格,但不会在匹配中返回,因为lookbehind是零宽度断言。
#2
1
Does this one liner do it for you?
这条衬垫能为你做吗?
unlist(lapply(strsplit(c("in London", "in Sao Paulo", "in Manchester City"), "in "), function(x) x[2]))
[1] "London" "Sao Paulo" "Manchester City"
#3
0
You can try this:
你可以试试这个:
library(stringr)
df$onlyCity <- str_extract(df$city, '[^in ](.)*')
df
city onlyCity
1 in London London
2 in Manchester city Manchester city
3 in Sao Paolo Sao Paolo
#4
0
gsub("^in[ ]*(.*$)", "\\1", df$city)
[1] "London" "Manchester city" "Sao Paolo"
Assumes that your strings start with "in"
, followed by some number of spaces (won't fail with more than one), followed by the text of interest which is captured from the first non-whitespace character up to the end of the string.
假设您的字符串以“in”开头,后面跟着一些空格(不会超过一个空格),后面跟着从第一个非空格字符到字符串末尾捕获的相关文本。