I am trying to analyze some Formule 1 data. Wikipedia has a table with the data I want. I am importing the data into R with the code below:
我正在分析一些Formule 1的数据。*有一个表格,里面有我想要的数据。我将数据导入R,代码如下:
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
tabs <- readHTMLTable(tabs, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
head(pilots[, 1])
[1] "Abate, CarloCarlo Abate"
[2] "Abecassis, GeorgeGeorge Abecassis"
[3] "Acheson, KennyKenny Acheson"
[4] "Adamich, Andrea deAndrea de Adamich"
[5] "Adams, PhilippePhilippe Adams"
[6] "Ader, WaltWalt Ader"
However, the pilot names are strange. Notice how they are. I'd like them to be like this:
然而,飞行员的名字很奇怪。注意到他们是如何。我希望他们是这样的:
head(pilots[, 1])
[1] "Carlo Abate"
[2] "George Abecassis"
[3] "Kenny Acheson"
[4] "Andrea de Adamich"
[5] "Philippe Adams"
[6] "Walt Ader"
However, it seems I am not able to write a regex that can deal with this problem or find an argument for the function readHTMLTable
that ignores the sortkey value in the table I am interested. How can I solve my problem?
然而,我似乎无法编写一个regex来处理这个问题,也无法为忽略我感兴趣的表中的sortkey值的readHTMLTable函数找到一个参数。我如何解决我的问题?
1 个解决方案
#1
2
Use readHTMLTable
with a bespoke elFun
argument.
使用readHTMLTable和一个定制的elFun参数。
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
myFun <- function(x){
if(length(y <- getNodeSet(x, ".//a")) > 0){
# return data.frame
title <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "title")
href <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "href")
value <- xpathSApply(x, ".//a", fun = xmlValue)
return(paste(value, collapse = ","))
}
xmlValue(x, encoding = "UTF-8")
}
tabs <- readHTMLTable(tabs, elFun = myFun, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
> head(pilots[, 1])
[1] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich"
[5] "Philippe Adams" "Walt Ader"
> pilots[1,]
Name Country Seasons Championships Entries Starts Poles Wins Podiums Fastest laps Points[note]
1 Carlo Abate Italy 1962,1963 0 2 0 0 0 0 0 0
#1
2
Use readHTMLTable
with a bespoke elFun
argument.
使用readHTMLTable和一个定制的elFun参数。
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
myFun <- function(x){
if(length(y <- getNodeSet(x, ".//a")) > 0){
# return data.frame
title <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "title")
href <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "href")
value <- xpathSApply(x, ".//a", fun = xmlValue)
return(paste(value, collapse = ","))
}
xmlValue(x, encoding = "UTF-8")
}
tabs <- readHTMLTable(tabs, elFun = myFun, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
> head(pilots[, 1])
[1] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich"
[5] "Philippe Adams" "Walt Ader"
> pilots[1,]
Name Country Seasons Championships Entries Starts Poles Wins Podiums Fastest laps Points[note]
1 Carlo Abate Italy 1962,1963 0 2 0 0 0 0 0 0