难以理解如何从该站点抓取数据(使用R)

时间:2022-11-25 09:01:26

I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#

我试图从这个站点使用R来刮取数据:http://www.soccer24.com/kosovo/superliga/results/#

I can do the following:

我可以做以下事情:

library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")

but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is

但我很难知道如何获取数据。这是因为网站上的实际数据似乎是由Javascript生成的。我能做的是

html_text(doc)

but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.

但这会给人一种奇怪的文本模糊(这确实包含了数据,但却散布着奇怪的代码,而且根本不清楚我将如何解析它。

What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.

我想要提取的是所有比赛的比赛数据(日期,时间,球队,结果)。此站点无需其他数据。

Can anyone provide some hints as to how to extract that data from this site?

任何人都可以提供一些提示,如何从这个网站提取数据?

1 个解决方案

#1


8  

Using Selenium with phantomjs

使用Selenium与phantomjs

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)

if you want to press the more data button until it is not visible (all matches presumed showing):

如果你想按下更多数据按钮,直到它不可见(所有匹配假定显示):

webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
  webElem$clickElement()
  Sys.sleep(5)
  webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])

Remove unwanted round data and use XML::readHTMLTable for simplicity

删除不需要的舍入数据并使用XML :: readHTMLTable以简化操作

# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank         Date           hteam            ateam score
1       01.04. 18:00     Ferronikeli          Ferizaj 4 : 0
2       01.04. 18:00          Istogu         Hajvalia 2 : 1
3       01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4       01.04. 18:00       Prishtina          Drenica 3 : 0
5       31.03. 18:00       Besa Peje            Drita 1 : 0
6       31.03. 18:00       Trepca 89       Vellaznimi 2 : 0

> tail(appData)
    blank         Date            hteam     ateam score
115       17.08. 22:00        Besa Peje Trepca 89 3 : 3
116       17.08. 22:00      Ferronikeli  Hajvalia 2 : 5
117       17.08. 22:00 Trepca Mitrovice   Ferizaj 1 : 0
118       17.08. 22:00       Vellaznimi   Drenica 2 : 1
119       16.08. 22:00  Kosova Vushtrri     Drita 0 : 1
120       16.08. 22:00        Prishtina    Istogu 2 : 1

carry out further formatting as needed.

根据需要进行进一步格式化。

#1


8  

Using Selenium with phantomjs

使用Selenium与phantomjs

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)

if you want to press the more data button until it is not visible (all matches presumed showing):

如果你想按下更多数据按钮,直到它不可见(所有匹配假定显示):

webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
  webElem$clickElement()
  Sys.sleep(5)
  webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])

Remove unwanted round data and use XML::readHTMLTable for simplicity

删除不需要的舍入数据并使用XML :: readHTMLTable以简化操作

# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank         Date           hteam            ateam score
1       01.04. 18:00     Ferronikeli          Ferizaj 4 : 0
2       01.04. 18:00          Istogu         Hajvalia 2 : 1
3       01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4       01.04. 18:00       Prishtina          Drenica 3 : 0
5       31.03. 18:00       Besa Peje            Drita 1 : 0
6       31.03. 18:00       Trepca 89       Vellaznimi 2 : 0

> tail(appData)
    blank         Date            hteam     ateam score
115       17.08. 22:00        Besa Peje Trepca 89 3 : 3
116       17.08. 22:00      Ferronikeli  Hajvalia 2 : 5
117       17.08. 22:00 Trepca Mitrovice   Ferizaj 1 : 0
118       17.08. 22:00       Vellaznimi   Drenica 2 : 1
119       16.08. 22:00  Kosova Vushtrri     Drita 0 : 1
120       16.08. 22:00        Prishtina    Istogu 2 : 1

carry out further formatting as needed.

根据需要进行进一步格式化。