I'm trying to scrap a website I'm a newbie using regular expressions. I have a long character vector, this is the line that I'm aiming:
我正在尝试废弃一个网站,我是一个使用正则表达式的新手。我有一个长字符向量,这是我的目标线:
<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span></h3>\n
I want to extract the number that it is in between <span id=\"hitCount.top\">
and </span>
. In this case 10,079. My approach so far, though, not really working.
我想提取它在和< / span >。在这种情况下,10079年。不过,我的方法到目前为止还没有真正奏效。
x <- '<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span>'
m <- gregexpr(pattern="[<span id=\"hitCount.top\">].+[</span>]", x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
regmatches(x, m)
Any help will be appreciated.
如有任何帮助,我们将不胜感激。
2 个解决方案
#1
1
Just to illustrate how easy it may become if you are using XML
package:
只是为了说明如果使用XML包会变得多么容易:
> library("XML")
> url = "PATH_TO_HTML"
> parsed_doc = htmlParse(file=url, useInternalNodes = TRUE)
> h3title4 <- getNodeSet(doc = parsed_doc, path = "//h3[@class='title4']")
> plain_text <- sapply(h3title4, xmlValue)
> plain_text
[1] "Results: 10,079"
> sub("\\D*", "", plain_text)
[1] "10,079"
The sub("\\D*", "", plain_text)
line will remove the first chunk of 0+ non-digits in the input, that is, \D*
will match Results:
and will replace it with an empty string.
sub(“\D*”、“”、明文)行将删除输入中的第一块0+非数字,也就是说,\D*将匹配结果:并将其替换为一个空字符串。
The example HTML I used was
我使用的示例HTML是。
<html>
<body>
<h3 class="title4">Results: <span id="hitCount.top">10,079</span></h3>
<img width="10%" height="10%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Green-Up-Arrow.svg/2000px-Green-Up-Arrow.svg.png"/>
</body>
</html>
#2
1
Using stringr
library
使用stringr库
> library(stringr)
> str_extract(x, "(?<=<span id=\"hitCount.top\">)(.*?)(?=</span>)")
[1] "10,079"
Using gsub
(sub
can also be used here instead of gsub
)
使用gsub (sub也可以在这里使用,而不是gsub)
> gsub(".*<span id=\"hitCount.top\">(.*?)</span>.*", "\\1", x)
[1] "10,079"
#1
1
Just to illustrate how easy it may become if you are using XML
package:
只是为了说明如果使用XML包会变得多么容易:
> library("XML")
> url = "PATH_TO_HTML"
> parsed_doc = htmlParse(file=url, useInternalNodes = TRUE)
> h3title4 <- getNodeSet(doc = parsed_doc, path = "//h3[@class='title4']")
> plain_text <- sapply(h3title4, xmlValue)
> plain_text
[1] "Results: 10,079"
> sub("\\D*", "", plain_text)
[1] "10,079"
The sub("\\D*", "", plain_text)
line will remove the first chunk of 0+ non-digits in the input, that is, \D*
will match Results:
and will replace it with an empty string.
sub(“\D*”、“”、明文)行将删除输入中的第一块0+非数字,也就是说,\D*将匹配结果:并将其替换为一个空字符串。
The example HTML I used was
我使用的示例HTML是。
<html>
<body>
<h3 class="title4">Results: <span id="hitCount.top">10,079</span></h3>
<img width="10%" height="10%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Green-Up-Arrow.svg/2000px-Green-Up-Arrow.svg.png"/>
</body>
</html>
#2
1
Using stringr
library
使用stringr库
> library(stringr)
> str_extract(x, "(?<=<span id=\"hitCount.top\">)(.*?)(?=</span>)")
[1] "10,079"
Using gsub
(sub
can also be used here instead of gsub
)
使用gsub (sub也可以在这里使用,而不是gsub)
> gsub(".*<span id=\"hitCount.top\">(.*?)</span>.*", "\\1", x)
[1] "10,079"