使用stringr从R中的文本字符串中提取一个或多个单词

I have the following data frame:

我有以下数据框架:

df <- data.frame(city=c("in London", "in Manchester city", "in Sao Paolo"))

I am using str_extract and return the word after 'in' in a separate column.

我正在使用str_extract并在另一列中返回“in”后面的单词。

library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')

This works fine for me in 95% of cases. However, there are cases like "Sao Paolo" above where my regex would return "Sao" rather than the city name.

这在95%的情况下对我都是适用的。然而，上面有一些案例，比如“圣保罗”，我的regex将返回“Sao”而不是城市名称。

Can someone please help me with amending it to capture either:

有人能帮我修改一下吗?

1) everything to the end of the text string I am extracting from? OR

1)我正在提取的文本字符串末尾的所有内容?或

2) where there is more than one word after 'in', then return that too

2)当“in”后面有多个单词时，也要返回

Many thanks.

多谢。

4 个解决方案

#1

To match all the rest of the string after the first in followed with a space, you can use

要匹配第一个in之后的所有字符串和空格，可以使用

(?<=in\\s).+

The lookbehind matches the in preposition with a white space after it, but does not return it inside the match since lookbehinds are zero-width assertions.

lookbehind在介词后面加上一个空格，但不会在匹配中返回，因为lookbehind是零宽度断言。

#2

Does this one liner do it for you?

这条衬垫能为你做吗?

unlist(lapply(strsplit(c("in London", "in Sao Paulo", "in Manchester City"), "in "), function(x) x[2]))
[1] "London"          "Sao Paulo"       "Manchester City"

#3

You can try this:

你可以试试这个:

library(stringr)
df$onlyCity <- str_extract(df$city, '[^in ](.)*')
df
                city        onlyCity
1          in London          London
2 in Manchester city Manchester city
3       in Sao Paolo       Sao Paolo

#4

gsub("^in[ ]*(.*$)", "\\1", df$city)
[1] "London"          "Manchester city" "Sao Paolo"

Assumes that your strings start with "in", followed by some number of spaces (won't fail with more than one), followed by the text of interest which is captured from the first non-whitespace character up to the end of the string.

假设您的字符串以“in”开头，后面跟着一些空格(不会超过一个空格)，后面跟着从第一个非空格字符到字符串末尾捕获的相关文本。

#1