With R, httr
and XML
you can scrape this site; the relevant HTML code is below.
有了R、httr和XML,您就可以对这个站点进行抓取;相关的HTML代码如下。
doc <- htmlTreeParse("http://www.mehaffyweber.com/Firm/Offices/", useInternal = TRUE)
<div id="content">
<img id="printLogo" style="padding-bottom:30px" src="/images/logo_print.jpg">
<div id="contentTitle">
<div style="height: 30px;">
<h1>Offices</h1>
<h3>Beaumont Location:</h3>
<p>
<p>
<br>
<h3>
<strong>Houston Location:</strong>
</h3>
<p>
<p>
<h3>
<strong>Austin Location:</strong>
</h3>
To extract only the cities where this company has offices, this XLPath 1.0 code works:
为了只提取公司办公地点所在的城市,XLPath 1.0代码可以工作:
(string <- xpathSApply(doc, "//h3", function(x) {
gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE))}))
I tried to paste the state to the city with a second anonymous function but failed:
我试图用第二个匿名函数将州粘贴到城市中,但失败了:
> (string <- xpathSApply(doc, "//h3", function(x) {
+ gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE))} &&
+ function(x) {paste0(xmlValue(x), " , TX")}))
Error in { : invalid 'x' type in 'x && y'
So did a simpler try when I did not repeat function(x)
我没有重复函数(x)
> (string <- xpathSApply(doc, "//h3", function(x) {
+ gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE)) &&
+ paste0(xmlValue(x), " , TX")}))
Error in gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE)) && paste0(xmlValue(x), :
invalid 'x' type in 'x && y'
DESIRED OUTPUT: How might I combine both anonymous functions and create this string?
期望的输出:如何组合匿名函数并创建此字符串?
[1] "Beaumont, TX" "Houston, TX" "Austin, TX"
[1] "Beaumont, TX"休斯顿,TX"奥斯汀,TX"
3 个解决方案
#1
1
A couple of things. htmlParse
is shorthand for htmlTreeParse(..., useInternal = TRUE)
. You have issues with encoding on this document so the RCurl
library will help to remove the strange encodings you are encountering.
几件事情。htmlParse是htmlTreeParse(…useInternal = TRUE)。您在这个文档上有编码的问题,所以RCurl库将帮助删除您遇到的奇怪的编码。
library(XML)
library(RCurl)
appHTML <- getURL("http://www.mehaffyweber.com/Firm/Offices/"
, .encoding = "UTF-8")
doc <- htmlParse(appHTML, encoding = "UTF-8")
xpathSApply
is a shorthand for two operations. It applies the xpath
to the doc
and gets the relevant nodes. Then each of this nodes is applied to the function the user stipulates. The x passing to the function is basically the output from:
xpathSApply是两个操作的简写。它将xpath应用到doc并获取相关节点。然后将每个节点应用到用户规定的函数中。传递给函数的x基本上是:
getNodeSet(doc, "//h3")
or in shorthand
或简写
doc["//h3"]
Each element of doc["//h3"]
is an internal XML node:
doc["//h3"]的每个元素都是一个内部XML节点:
> str(doc['//h3'])
List of 3
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
- attr(*, "class")= chr "XMLNodeSet"
So the x
in your function is just like an element of doc["//h3"]
. So you can experiment with doc["//h3"][[1]]
函数中的x就像doc["/ h3"]中的元素。可以用doc["/ h3"][[1]]进行实验
x<- doc['//h3'][[1]]
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
[1] "Beaumont, TX"
Then you can apply this logic in your function:
然后你可以把这个逻辑应用到你的函数中:
xpathSApply(doc, "//h3", function(x){
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
})
[1] "Beaumont, TX" "Houston, TX" "Austin, TX"
#2
1
If you're willing to use rvest
and stringr
it's a pretty simple solution:
如果你愿意使用rvest和stringr,这是一个非常简单的解决方案:
library(rvest)
library(stringr)
pg <- html("http://www.mehaffyweber.com/Firm/Offices/")
found <- pg %>%
html_nodes("#content") %>%
html_text() %>%
str_match_all("([[:alpha:]]+), Texas")
sprintf("%s, TX", found[[1]][,2])
## [1] "Beaumont, TX" "Houston, TX" "Austin, TX"
#3
1
You can use the following to get your desired result.
您可以使用以下内容来获得所需的结果。
string <- xpathSApply(doc, '//h3', function(x) {
paste0(sub('^([A-Z][a-z]+).*', '\\1', xmlValue(x)), ', TX')
})
# [1] "Beaumont, TX" "Houston, TX" "Austin, TX"
#1
1
A couple of things. htmlParse
is shorthand for htmlTreeParse(..., useInternal = TRUE)
. You have issues with encoding on this document so the RCurl
library will help to remove the strange encodings you are encountering.
几件事情。htmlParse是htmlTreeParse(…useInternal = TRUE)。您在这个文档上有编码的问题,所以RCurl库将帮助删除您遇到的奇怪的编码。
library(XML)
library(RCurl)
appHTML <- getURL("http://www.mehaffyweber.com/Firm/Offices/"
, .encoding = "UTF-8")
doc <- htmlParse(appHTML, encoding = "UTF-8")
xpathSApply
is a shorthand for two operations. It applies the xpath
to the doc
and gets the relevant nodes. Then each of this nodes is applied to the function the user stipulates. The x passing to the function is basically the output from:
xpathSApply是两个操作的简写。它将xpath应用到doc并获取相关节点。然后将每个节点应用到用户规定的函数中。传递给函数的x基本上是:
getNodeSet(doc, "//h3")
or in shorthand
或简写
doc["//h3"]
Each element of doc["//h3"]
is an internal XML node:
doc["//h3"]的每个元素都是一个内部XML节点:
> str(doc['//h3'])
List of 3
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
- attr(*, "class")= chr "XMLNodeSet"
So the x
in your function is just like an element of doc["//h3"]
. So you can experiment with doc["//h3"][[1]]
函数中的x就像doc["/ h3"]中的元素。可以用doc["/ h3"][[1]]进行实验
x<- doc['//h3'][[1]]
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
[1] "Beaumont, TX"
Then you can apply this logic in your function:
然后你可以把这个逻辑应用到你的函数中:
xpathSApply(doc, "//h3", function(x){
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
})
[1] "Beaumont, TX" "Houston, TX" "Austin, TX"
#2
1
If you're willing to use rvest
and stringr
it's a pretty simple solution:
如果你愿意使用rvest和stringr,这是一个非常简单的解决方案:
library(rvest)
library(stringr)
pg <- html("http://www.mehaffyweber.com/Firm/Offices/")
found <- pg %>%
html_nodes("#content") %>%
html_text() %>%
str_match_all("([[:alpha:]]+), Texas")
sprintf("%s, TX", found[[1]][,2])
## [1] "Beaumont, TX" "Houston, TX" "Austin, TX"
#3
1
You can use the following to get your desired result.
您可以使用以下内容来获得所需的结果。
string <- xpathSApply(doc, '//h3', function(x) {
paste0(sub('^([A-Z][a-z]+).*', '\\1', xmlValue(x)), ', TX')
})
# [1] "Beaumont, TX" "Houston, TX" "Austin, TX"