如何从谷歌搜索结果页面URL中提取关键词?

时间:2022-08-22 11:25:08

One of the variables in my dataset contains URLs of Google search results pages. I want to extract the search keywords from those URLs.

我的数据集中有一个变量包含谷歌搜索结果页面的url。我想从这些url中提取搜索关键字。

An example dataset:

一个示例数据集:

keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
                   url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")), 
              .Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))

So far I was able to extract the search keyword parts from the URLs with:

到目前为止,我可以从url中提取搜索关键字部分:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),paste, collapse=",")

However, this still doesn't give me the desired result. The above code gives the following result:

然而,这仍然不能给我想要的结果。以上代码给出如下结果:

> keyw$words
[1] "q=high+five"                           
[2] "q=high+five,q=high+five+with+handshake"
[3] "q=high+five,q=high+five+with+a+chair"  
[4] "q=five+fingers"                        
[5] "q=five+fingers,q=five+short+fingers+"  
[6] ""                                      

There are three problems with this output:

这个输出有三个问题:

  1. I only need the words as a string. Instead of q=high+five, I need high,five.
  2. 我只需要这些词作为字符串。而不是q=high+ 5,我需要high, 5。
  3. As rows 2, 3 & 5 show, the URL sometimes contains two parts with search keywords. As the first part is merely a reference to the previous search, I only need the second search query.
  4. 如第2、3和5行所示,URL有时包含两个带有搜索关键字的部分。由于第一部分只是对前一个搜索的引用,所以我只需要第二个搜索查询。
  5. When the URL is not a Google search page URL, it should return an NA.
  6. 当URL不是谷歌搜索页面URL时,它应该返回一个NA。

The desired result should be:

期望的结果应该是:

> keyw$words
[1] "high,five"                           
[2] "high,five,with,handshake"
[3] "high,five,with,a,chair"  
[4] "five,fingers"                        
[5] "five,short,fingers"
[6] NA

How do I solve this?

我怎么解决这个问题?

5 个解决方案

#1


11  

Another update after comment (looks too complex but it's the best I can achieve at this point :)):

更新后的评论(看起来太复杂了,但这是我目前能做到的最好的):

keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"   "five,fingers"            
[5] "five,short,fingers"       NA             

The change is the filter on input to str_extract_all, changed from the full vector by a "filtered" one to match a regex, any regex can go there to match more or less precisely what you wish.

该更改是对str_extract_all输入的过滤器,由一个“过滤”的完整向量更改为匹配一个regex,任何regex都可以根据您的意愿匹配或多或少地匹配。

Here the regex is:

这里的正则表达式是:

  • http litteraly http
  • http书面http
  • s? 0 or 1 s
  • 年代?0或1
  • [/]{2} exactly two slashes (using a character class avoid needing the ugly \\/ construction and get things more readable
  • [/]{2}正好有两个斜杠(使用字符类可以避免使用难看的\/构造,使内容更容易读懂)
  • [^/]* any number of not slash characters
  • ^ / *任意数量的不削减字符
  • google.*[/] match litteraly google followed by anything to the last /
  • 谷歌。*[/]匹配litteraly谷歌后面跟着任何东西直到最后/
  • .* finally match something or not after the last slash
  • .*最后在最后一个斜杠后匹配或不匹配

Replace * by + wherever you want to ensure there's a parameter (+ will require the preceding character to be present at least once)

在需要确保有参数的地方替换* by +(+将要求前面的字符至少出现一次)


Update heavily inspired by @BrodieG, will return NA if there's no match, but will still match any site if there's q= in the parameters.

受@BrodieG启发的更新,如果没有匹配,将返回NA,但如果参数中有q=,则仍然匹配任何站点。

Still with the same method:

还是用同样的方法:

> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA         

Regex demo

Regex演示

(The lookbehind (?<=) ensure there's q= or + somewhere before the word and the the negative lookahead (?!) ensure we can't find q= untill the end of line.

(lookbehind(?<=)确保在单词前的某个地方有q=或+,而负面的lookahead(?!)确保在一行结束前找不到q=。

The character class disallow the + sign to stop at each word.

字符类不允许在每个单词上停止+符号。

#2


8  

Or maybe this

或者也许这

gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
# [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
# [4] "five,fingers"             "five,short,fingers"  

#3


5  

Update (borrowing part of the regex from David):

更新(借用David的部分regex):

dat <- as.character(keyw$url)
pat <- "^https://www\\.google\\.nl/.*\\bq=([^&]*[^&+]).*"
sapply(
  regmatches(dat, regexec(pat, dat)),
  function(x) if(!length(x)) NA else gsub("\\+", ",", x[[2]])
)

Produces:

生产:

[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA   

Using:

使用:

pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"

takes into account all country specific google-domains (source)

考虑到所有国家特定的谷歌域名(来源)


Or:

或者:

gsub("\\+", ",", sub("^.*\\bq=([^&]*).*", "\\1", keyw$url))

Produces:

生产:

[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers,"     

Here we use greediness to make sure we skip everything up to the last q=... part, and then use the standard sub / \\1 trick to capture what we want. Finally, replace + with ,.

在这里,我们使用贪心来确保直到最后一个q=…部分,然后使用标准的sub / \1技巧来捕获我们想要的。最后,用,替换+。

#4


3  

I'd try with:

我试一试:

x<-as.character(keyw$url)
vapply(regmatches(x,gregexpr("(?<=q=)[^&]+",x,perl=TRUE)),
       function(y) paste(unique(unlist(strsplit(y,"\\+"))),collapse=","),"")
#[1] "high,five"                "high,five,with,handshake"
#[3] "high,five,with,a,chair"   "five,fingers"            
#[5] "five,fingers,short"

#5


3  

There's got to be a cleaner way, but maybe something like:

一定要有一个更干净的方法,但可能是:

sapply(strsplit(keyw$words, "q="), function(x) {
  x <- if (length(x) == 2) x[2] else x[3]
  gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
})
# [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
# [4] "five,fingers"             "five,short,fingers" 

Everything in one go:

一切都放在一个:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),function(x) {
  x <- if (length(x) == 2) x[2] else x[1]
  x <- gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
  gsub("q=","",x, fixed = TRUE)
})

#1


11  

Another update after comment (looks too complex but it's the best I can achieve at this point :)):

更新后的评论(看起来太复杂了,但这是我目前能做到的最好的):

keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"   "five,fingers"            
[5] "five,short,fingers"       NA             

The change is the filter on input to str_extract_all, changed from the full vector by a "filtered" one to match a regex, any regex can go there to match more or less precisely what you wish.

该更改是对str_extract_all输入的过滤器,由一个“过滤”的完整向量更改为匹配一个regex,任何regex都可以根据您的意愿匹配或多或少地匹配。

Here the regex is:

这里的正则表达式是:

  • http litteraly http
  • http书面http
  • s? 0 or 1 s
  • 年代?0或1
  • [/]{2} exactly two slashes (using a character class avoid needing the ugly \\/ construction and get things more readable
  • [/]{2}正好有两个斜杠(使用字符类可以避免使用难看的\/构造,使内容更容易读懂)
  • [^/]* any number of not slash characters
  • ^ / *任意数量的不削减字符
  • google.*[/] match litteraly google followed by anything to the last /
  • 谷歌。*[/]匹配litteraly谷歌后面跟着任何东西直到最后/
  • .* finally match something or not after the last slash
  • .*最后在最后一个斜杠后匹配或不匹配

Replace * by + wherever you want to ensure there's a parameter (+ will require the preceding character to be present at least once)

在需要确保有参数的地方替换* by +(+将要求前面的字符至少出现一次)


Update heavily inspired by @BrodieG, will return NA if there's no match, but will still match any site if there's q= in the parameters.

受@BrodieG启发的更新,如果没有匹配,将返回NA,但如果参数中有q=,则仍然匹配任何站点。

Still with the same method:

还是用同样的方法:

> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA         

Regex demo

Regex演示

(The lookbehind (?<=) ensure there's q= or + somewhere before the word and the the negative lookahead (?!) ensure we can't find q= untill the end of line.

(lookbehind(?<=)确保在单词前的某个地方有q=或+,而负面的lookahead(?!)确保在一行结束前找不到q=。

The character class disallow the + sign to stop at each word.

字符类不允许在每个单词上停止+符号。

#2


8  

Or maybe this

或者也许这

gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
# [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
# [4] "five,fingers"             "five,short,fingers"  

#3


5  

Update (borrowing part of the regex from David):

更新(借用David的部分regex):

dat <- as.character(keyw$url)
pat <- "^https://www\\.google\\.nl/.*\\bq=([^&]*[^&+]).*"
sapply(
  regmatches(dat, regexec(pat, dat)),
  function(x) if(!length(x)) NA else gsub("\\+", ",", x[[2]])
)

Produces:

生产:

[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA   

Using:

使用:

pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"

takes into account all country specific google-domains (source)

考虑到所有国家特定的谷歌域名(来源)


Or:

或者:

gsub("\\+", ",", sub("^.*\\bq=([^&]*).*", "\\1", keyw$url))

Produces:

生产:

[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers,"     

Here we use greediness to make sure we skip everything up to the last q=... part, and then use the standard sub / \\1 trick to capture what we want. Finally, replace + with ,.

在这里,我们使用贪心来确保直到最后一个q=…部分,然后使用标准的sub / \1技巧来捕获我们想要的。最后,用,替换+。

#4


3  

I'd try with:

我试一试:

x<-as.character(keyw$url)
vapply(regmatches(x,gregexpr("(?<=q=)[^&]+",x,perl=TRUE)),
       function(y) paste(unique(unlist(strsplit(y,"\\+"))),collapse=","),"")
#[1] "high,five"                "high,five,with,handshake"
#[3] "high,five,with,a,chair"   "five,fingers"            
#[5] "five,fingers,short"

#5


3  

There's got to be a cleaner way, but maybe something like:

一定要有一个更干净的方法,但可能是:

sapply(strsplit(keyw$words, "q="), function(x) {
  x <- if (length(x) == 2) x[2] else x[3]
  gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
})
# [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
# [4] "five,fingers"             "five,short,fingers" 

Everything in one go:

一切都放在一个:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),function(x) {
  x <- if (length(x) == 2) x[2] else x[1]
  x <- gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
  gsub("q=","",x, fixed = TRUE)
})