I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#
我试图从这个站点使用R来刮取数据:http://www.soccer24.com/kosovo/superliga/results/#
I can do the following:
我可以做以下事情:
library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")
but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is
但我很难知道如何获取数据。这是因为网站上的实际数据似乎是由Javascript生成的。我能做的是
html_text(doc)
but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.
但这会给人一种奇怪的文本模糊(这确实包含了数据,但却散布着奇怪的代码,而且根本不清楚我将如何解析它。
What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.
我想要提取的是所有比赛的比赛数据(日期,时间,球队,结果)。此站点无需其他数据。
Can anyone provide some hints as to how to extract that data from this site?
任何人都可以提供一些提示,如何从这个网站提取数据?
1 个解决方案
#1
8
Using Selenium
with phantomjs
使用Selenium与phantomjs
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)
if you want to press the more data button until it is not visible (all matches presumed showing):
如果你想按下更多数据按钮,直到它不可见(所有匹配假定显示):
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
webElem$clickElement()
Sys.sleep(5)
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])
Remove unwanted round data and use XML::readHTMLTable
for simplicity
删除不需要的舍入数据并使用XML :: readHTMLTable以简化操作
# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank Date hteam ateam score
1 01.04. 18:00 Ferronikeli Ferizaj 4 : 0
2 01.04. 18:00 Istogu Hajvalia 2 : 1
3 01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4 01.04. 18:00 Prishtina Drenica 3 : 0
5 31.03. 18:00 Besa Peje Drita 1 : 0
6 31.03. 18:00 Trepca 89 Vellaznimi 2 : 0
> tail(appData)
blank Date hteam ateam score
115 17.08. 22:00 Besa Peje Trepca 89 3 : 3
116 17.08. 22:00 Ferronikeli Hajvalia 2 : 5
117 17.08. 22:00 Trepca Mitrovice Ferizaj 1 : 0
118 17.08. 22:00 Vellaznimi Drenica 2 : 1
119 16.08. 22:00 Kosova Vushtrri Drita 0 : 1
120 16.08. 22:00 Prishtina Istogu 2 : 1
carry out further formatting as needed.
根据需要进行进一步格式化。
#1
8
Using Selenium
with phantomjs
使用Selenium与phantomjs
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)
if you want to press the more data button until it is not visible (all matches presumed showing):
如果你想按下更多数据按钮,直到它不可见(所有匹配假定显示):
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
webElem$clickElement()
Sys.sleep(5)
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])
Remove unwanted round data and use XML::readHTMLTable
for simplicity
删除不需要的舍入数据并使用XML :: readHTMLTable以简化操作
# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank Date hteam ateam score
1 01.04. 18:00 Ferronikeli Ferizaj 4 : 0
2 01.04. 18:00 Istogu Hajvalia 2 : 1
3 01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4 01.04. 18:00 Prishtina Drenica 3 : 0
5 31.03. 18:00 Besa Peje Drita 1 : 0
6 31.03. 18:00 Trepca 89 Vellaznimi 2 : 0
> tail(appData)
blank Date hteam ateam score
115 17.08. 22:00 Besa Peje Trepca 89 3 : 3
116 17.08. 22:00 Ferronikeli Hajvalia 2 : 5
117 17.08. 22:00 Trepca Mitrovice Ferizaj 1 : 0
118 17.08. 22:00 Vellaznimi Drenica 2 : 1
119 16.08. 22:00 Kosova Vushtrri Drita 0 : 1
120 16.08. 22:00 Prishtina Istogu 2 : 1
carry out further formatting as needed.
根据需要进行进一步格式化。