I have a XML that contains data like this:
我有一个包含如下数据的XML:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1"
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27"
ViewCount="1647" Body="some text;" OwnerUserId="8"
LastActivityDate="2010-09-15T21:08:26.077"
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
[...]
(The dataset is a dump from stats.stackexchange.com)
(数据集是stats.stackexchange.com提供的转储)
How to get a data.frame with the attributes "Id" and "PostTypeId"?
如何获得具有“Id”和“PostTypeId”属性的data.frame ?
I have been trying with the XML library but I get to a point where I don't know how to unwrap the values:
我一直在尝试使用XML库,但到了一个我不知道如何展开值的地步:
library(XML)
xml <- xmlTreeParse("Posts.xml",useInternalNode=TRUE)
types <- getNodeSet(xml, '//row/@PostTypeId')
> types[1]
[[1]]
PostTypeId
"1"
attr(,"class")
[1] "XMLAttributeValue"
Which would be the proper R way of getting a projection of those two columns from the XML into a data.frame?
从XML将这两列投影到data.frame的正确方法是什么?
2 个解决方案
#1
2
Using rvest
(which is a wrapper around xml2
) you can do it as follows:
使用rvest(它是xml2的包装),您可以按照以下方式进行:
require(rvest)
require(magrittr)
doc <- xml('<posts>
<row Id="1" PostTypeId="1"
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27"
ViewCount="1647" Body="some text;" OwnerUserId="8"
LastActivityDate="2010-09-15T21:08:26.077"
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
</posts>')
rows <- doc %>% xml_nodes("row")
data.frame(
Id = rows %>% xml_attr("id"),
PostTypeId = rows %>% xml_attr("posttypeid")
)
Resulting in:
导致:
Id PostTypeId
1 1 1
If you take Comments.xml
with
如果你把评论。xml与
data.frame(
Id = rows %>% xml_attr("id"),
PostTypeId = rows %>% xml_attr("postid"),
score = rows %>% xml_attr("score")
)
You receive:
你收到:
> head(dat)
Id PostTypeId score
1 1 3 5
2 2 5 0
3 3 9 0
4 4 5 11
5 5 3 1
6 6 14 9
#2
2
This is actually a great use-case for the xmlEventParse
function in the XML
package. This is a 200+ MB file and the last thing you want to do is waste memory needlessly (XML parsing is notoriously memory intensive) and waste time going through nodes multiple times.
这实际上是XML包中的xmlEventParse函数的一个很好的用例。这是一个200+ MB的文件,您最不想做的事情就是不必要地浪费内存(XML解析是出了名的内存密集型),并浪费多次遍历节点的时间。
By using xmlEventParse
you can also filter what you do or do not need and you can also get a progress bar snuck in there so you can see what's going on.
通过使用xmlEventParse,你也可以过滤你所做的或不需要的东西,你也可以得到一个进度条,这样你就可以看到发生了什么。
library(XML)
library(data.table)
# get the # of <rows> quickly; you can approximate if you don't know the
# number or can't run this and then chop down the size of the data.frame
# afterwards
system("grep -c '<row' ~/Desktop/p1.xml")
## 128010
n <- 128010
# pre-populate a data.frame
# you could also just write this data out to a file and read it back in
# which would negate the need to use global variables or pre-allocate
# a data.frame
dat <- data.frame(id=rep(NA_character_, n),
post_type_id=rep(NA_character_, n),
stringsAsFactors=FALSE)
# setup a progress bar since there are alot of nodes
pb <- txtProgressBar(min=0, max=n, style=3)
# this function will be called for each <row>
# again, you could write to a file/database/whatever vs do this
# data.frame population
idx <- 1
process_row <- function(node, tribs) {
# update the progress bar
setTxtProgressBar(pb, idx)
# get our data (you can filter here)
dat[idx, "id"] <<- tribs["Id"]
dat[idx, "post_type_id"] <<- tribs["PostTypeId"]
# update the index
idx <<- idx + 1
}
# start the parser
info <- xmlEventParse("Posts.xml", list(row=process_row))
# close up the progress bar
close(pb)
head(dat)
## id post_type_id
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 2
## 6 6 1
#1
2
Using rvest
(which is a wrapper around xml2
) you can do it as follows:
使用rvest(它是xml2的包装),您可以按照以下方式进行:
require(rvest)
require(magrittr)
doc <- xml('<posts>
<row Id="1" PostTypeId="1"
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27"
ViewCount="1647" Body="some text;" OwnerUserId="8"
LastActivityDate="2010-09-15T21:08:26.077"
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
</posts>')
rows <- doc %>% xml_nodes("row")
data.frame(
Id = rows %>% xml_attr("id"),
PostTypeId = rows %>% xml_attr("posttypeid")
)
Resulting in:
导致:
Id PostTypeId
1 1 1
If you take Comments.xml
with
如果你把评论。xml与
data.frame(
Id = rows %>% xml_attr("id"),
PostTypeId = rows %>% xml_attr("postid"),
score = rows %>% xml_attr("score")
)
You receive:
你收到:
> head(dat)
Id PostTypeId score
1 1 3 5
2 2 5 0
3 3 9 0
4 4 5 11
5 5 3 1
6 6 14 9
#2
2
This is actually a great use-case for the xmlEventParse
function in the XML
package. This is a 200+ MB file and the last thing you want to do is waste memory needlessly (XML parsing is notoriously memory intensive) and waste time going through nodes multiple times.
这实际上是XML包中的xmlEventParse函数的一个很好的用例。这是一个200+ MB的文件,您最不想做的事情就是不必要地浪费内存(XML解析是出了名的内存密集型),并浪费多次遍历节点的时间。
By using xmlEventParse
you can also filter what you do or do not need and you can also get a progress bar snuck in there so you can see what's going on.
通过使用xmlEventParse,你也可以过滤你所做的或不需要的东西,你也可以得到一个进度条,这样你就可以看到发生了什么。
library(XML)
library(data.table)
# get the # of <rows> quickly; you can approximate if you don't know the
# number or can't run this and then chop down the size of the data.frame
# afterwards
system("grep -c '<row' ~/Desktop/p1.xml")
## 128010
n <- 128010
# pre-populate a data.frame
# you could also just write this data out to a file and read it back in
# which would negate the need to use global variables or pre-allocate
# a data.frame
dat <- data.frame(id=rep(NA_character_, n),
post_type_id=rep(NA_character_, n),
stringsAsFactors=FALSE)
# setup a progress bar since there are alot of nodes
pb <- txtProgressBar(min=0, max=n, style=3)
# this function will be called for each <row>
# again, you could write to a file/database/whatever vs do this
# data.frame population
idx <- 1
process_row <- function(node, tribs) {
# update the progress bar
setTxtProgressBar(pb, idx)
# get our data (you can filter here)
dat[idx, "id"] <<- tribs["Id"]
dat[idx, "post_type_id"] <<- tribs["PostTypeId"]
# update the index
idx <<- idx + 1
}
# start the parser
info <- xmlEventParse("Posts.xml", list(row=process_row))
# close up the progress bar
close(pb)
head(dat)
## id post_type_id
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 2
## 6 6 1