使用R从myneta刮痧台

时间:2022-11-19 21:01:08

I am trying to scrape a table from http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary to my R studio.

我试图从http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary到我的R工作室刮一张桌子。

Here's the code

这是代码

url<-'http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary'
webpage<-read_html(url)
candidate_info<- html_nodes(webpage,xpath='//*[@id="main"]/div/div[2]/div[2]/table')
candidate_info<- html_table(candidate_info)
head(candidate_info)

But getting no output, suggest what I am doing wrong?

但是没有输出,建议我做错了什么?

1 个解决方案

#1


2  

That site has some very broken HTML. But, it's workable.

该网站有一些非常破碎的HTML。但是,它是可行的。

I find it better to target nodes in a slightly less fragile way. The XPath below finds it by content of the table.

我发现以稍微不那么脆弱的方式定位节点会更好。下面的XPath通过表的内容找到它。

html_table() croaks (or took forever and I didn't want to wait) so I ended up building the table "manually".

html_table()croaks(或永远,我不想等待)所以我最终“手动”构建表。

library(rvest)

# helper to clean column names
mcga <- function(x) { make.unique(gsub("(^_|_$)", "", gsub("_+", "_",  gsub("[[:punct:][:space:]]+", "_", tolower(x)))), sep = "_") }

pg <- read_html("http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary")

# target the table
tab <- html_node(pg, xpath=".//table[contains(thead, 'Liabilities')]")

# get the rows so we can target columns
rows <- html_nodes(tab, xpath=".//tr[td[not(@colspan)]]")

# make a data frame
do.call(
  cbind.data.frame,
  c(lapply(1:8, function(i) {
    html_text(html_nodes(rows, xpath=sprintf(".//td[%s]", i)), trim=TRUE)
  }), list(stringsAsFactors=FALSE))
) -> xdf

# make nicer names
xdf <- setNames(xdf, mcga(html_text(html_nodes(tab, "th")))) # get the header to get column names

str(xdf)
## 'data.frame': 4823 obs. of  8 variables:
##  $ sno          : chr  "1" "2" "3" "4" ...
##  $ candidate    : chr  "A Hasiv" "A Wahid" "Aan Shikhar Shrivastava" "Aaptab Urf Aftab" ...
##  $ constituency : chr  "ARYA NAGAR" "GAINSARI" "GOSHAINGANJ" "MUBARAKPUR" ...
##  $ party        : chr  "BSP" "IND" "Satya Shikhar Party" "Islam Party Hind" ...
##  $ criminal_case: chr  "0" "0" "0" "0" ...
##  $ education    : chr  "12th Pass" "10th Pass" "Graduate" "Illiterate" ...
##  $ total_assets : chr  "Rs 3,94,24,827 ~ 3 Crore+" "Rs 75,106 ~ 75 Thou+" "Rs 41,000 ~ 41 Thou+" "Rs 20,000 ~ 20 Thou+" ...
##  $ liabilities  : chr  "Rs 58,46,335 ~ 58 Lacs+" "Rs 0 ~" "Rs 0 ~" "Rs 0 ~" ...

#1


2  

That site has some very broken HTML. But, it's workable.

该网站有一些非常破碎的HTML。但是,它是可行的。

I find it better to target nodes in a slightly less fragile way. The XPath below finds it by content of the table.

我发现以稍微不那么脆弱的方式定位节点会更好。下面的XPath通过表的内容找到它。

html_table() croaks (or took forever and I didn't want to wait) so I ended up building the table "manually".

html_table()croaks(或永远,我不想等待)所以我最终“手动”构建表。

library(rvest)

# helper to clean column names
mcga <- function(x) { make.unique(gsub("(^_|_$)", "", gsub("_+", "_",  gsub("[[:punct:][:space:]]+", "_", tolower(x)))), sep = "_") }

pg <- read_html("http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary")

# target the table
tab <- html_node(pg, xpath=".//table[contains(thead, 'Liabilities')]")

# get the rows so we can target columns
rows <- html_nodes(tab, xpath=".//tr[td[not(@colspan)]]")

# make a data frame
do.call(
  cbind.data.frame,
  c(lapply(1:8, function(i) {
    html_text(html_nodes(rows, xpath=sprintf(".//td[%s]", i)), trim=TRUE)
  }), list(stringsAsFactors=FALSE))
) -> xdf

# make nicer names
xdf <- setNames(xdf, mcga(html_text(html_nodes(tab, "th")))) # get the header to get column names

str(xdf)
## 'data.frame': 4823 obs. of  8 variables:
##  $ sno          : chr  "1" "2" "3" "4" ...
##  $ candidate    : chr  "A Hasiv" "A Wahid" "Aan Shikhar Shrivastava" "Aaptab Urf Aftab" ...
##  $ constituency : chr  "ARYA NAGAR" "GAINSARI" "GOSHAINGANJ" "MUBARAKPUR" ...
##  $ party        : chr  "BSP" "IND" "Satya Shikhar Party" "Islam Party Hind" ...
##  $ criminal_case: chr  "0" "0" "0" "0" ...
##  $ education    : chr  "12th Pass" "10th Pass" "Graduate" "Illiterate" ...
##  $ total_assets : chr  "Rs 3,94,24,827 ~ 3 Crore+" "Rs 75,106 ~ 75 Thou+" "Rs 41,000 ~ 41 Thou+" "Rs 20,000 ~ 20 Thou+" ...
##  $ liabilities  : chr  "Rs 58,46,335 ~ 58 Lacs+" "Rs 0 ~" "Rs 0 ~" "Rs 0 ~" ...