regex:获取两个单词之间的文本(R)

时间:2022-04-22 21:46:56

I have a text document and I'm trying to get the text between the words "abstract" and "keywords" (in R). This is the code I'm using:

我有一个文本文档,我试图在“抽象”和“关键字”(R)之间找到文本。

gsub(".*abstract\\s*|keywords.*", "\\1", string)

However, this didn't work because somewhere else in the text the word "abstract" occurred so I made it non-greedy like this (added ? in front of abstract)

但是,这并没有起作用,因为在文本的其他地方出现了“抽象”一词,所以我将它设置为非贪婪(添加了?)前面的文摘)

gsub(".*?abstract\\s*|keywords.*", "\\1", string)

But for some reason it now takes the text between "abstract" and "keywords" (which is what I want), but ALSO the text starting from the second "abstract" appearing in the text, all the way to the end. Any ideas?

但是出于某种原因,它现在把文本放在“抽象”和“关键字”之间(这是我想要的),同时也把文本从第二个“抽象”开始,一直到最后。什么好主意吗?

2 个解决方案

#1


1  

it doesn't look like you are capturing anything in your search term, you just need some ()'s in there to actually grab something so \\1 will return your target :

看起来你并没有在你的搜索词中捕捉到任何东西,你只是需要一些()来获取一些东西,所以\1会返回你的目标:

words <- c("these are some different abstract words that might be between keywords or they might just be bounded by abstract ideas")
gsub(".* abstract (.*) keywords.*", "\\1", words)
[1] "words that might be between"

#2


1  

I think this should give you what you are looking for:

我认为这应该会给你你想要的东西:

regmatches(string, gregexpr("(?<=abstract).*(?=keywords)", string, perl = TRUE))

What it does:

它所做的:

  • (?<=abstract) use the "look ahead" capabilities to find things after the word "abstract"
  • (?<=抽象)使用“展望未来”功能,在“抽象”一词后找到事物
  • .* match any number of keywords
  • .*匹配任意数量的关键字
  • (?=keywords) use the "look behind" for find things before the word "keywords"
  • 使用“look behind”在关键词之前查找
  • gregexpr looks for the given regular expression in string
  • gregexpr在字符串中查找给定的正则表达式
  • perl = TRUE allows for the "look ahead" and "look behind" functionality
  • perl = TRUE允许“展望未来”和“展望未来”功能
  • regmatches pulls out the matching piece of the string using the regular expression.
  • regmatches使用正则表达式提取匹配的字符串片段。

#1


1  

it doesn't look like you are capturing anything in your search term, you just need some ()'s in there to actually grab something so \\1 will return your target :

看起来你并没有在你的搜索词中捕捉到任何东西,你只是需要一些()来获取一些东西,所以\1会返回你的目标:

words <- c("these are some different abstract words that might be between keywords or they might just be bounded by abstract ideas")
gsub(".* abstract (.*) keywords.*", "\\1", words)
[1] "words that might be between"

#2


1  

I think this should give you what you are looking for:

我认为这应该会给你你想要的东西:

regmatches(string, gregexpr("(?<=abstract).*(?=keywords)", string, perl = TRUE))

What it does:

它所做的:

  • (?<=abstract) use the "look ahead" capabilities to find things after the word "abstract"
  • (?<=抽象)使用“展望未来”功能,在“抽象”一词后找到事物
  • .* match any number of keywords
  • .*匹配任意数量的关键字
  • (?=keywords) use the "look behind" for find things before the word "keywords"
  • 使用“look behind”在关键词之前查找
  • gregexpr looks for the given regular expression in string
  • gregexpr在字符串中查找给定的正则表达式
  • perl = TRUE allows for the "look ahead" and "look behind" functionality
  • perl = TRUE允许“展望未来”和“展望未来”功能
  • regmatches pulls out the matching piece of the string using the regular expression.
  • regmatches使用正则表达式提取匹配的字符串片段。