将大xml读入Dataframe r

I want to read my xml into a dataframe in r. My intial Datafile is 14 GB so my initial try to read the file didn't work out:

我想把我的xml读入r中的数据帧。我的初始数据文件是14 GB，所以我最初尝试读取该文件时没有用完：

f=xmlParse("Final.xml")
df=xmlToDataFrame(f)
r=xmlRoot(f)

The problem is that it is always running out of memory....

问题是它总是耗尽内存....

I've also seen the question:

我也看到了这个问题：

How to read large (~20 GB) xml file in R?

如何读取R中的大（~20 GB）xml文件？

I tried to use the approach from Martin Morgan, which i didn't 100% understood but tried to apply to my dataset.

我尝试使用Martin Morgan的方法，我没有100％理解，但试图应用于我的数据集。

libary(XML)
branchFunction <- function() {
store <- new.env() 
func <- function(x, ...) {
 ns <- getNodeSet(x, path = "//Sentiment")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ... 
# store[[some_key]] <- some_value
}
getStore <- function() { as.List(store) }
list(ROW = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
file = "Inputfile.xml", 
handlers = NULL, 
branches = myfunctions
))

myfunctions$getStore()

I would have to do that for every Column separately and the structure i'm getting from the ouptput is not useful.

我必须分别为每个列执行此操作，并且我从输出中获取的结构无用。

The Structure from my Data looks like:

我的数据结构如下：

<ROWSET>
<ROW>
    <Field1>21706</Field1>
    <PostId>19203</PostId>
    <ThreadId>38</ThreadId>
    <UserId>1397</UserId>
    <TimeStamp>1407351854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>egergeg</ReplyTo>
    <Content>dsfg</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
    <Field1>217</Field1>
    <PostId>1903</PostId>
    <ThreadId>8</ThreadId>
    <UserId>197</UserId>
    <TimeStamp>1407351854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>sdrwer</ReplyTo>
    <Content>wer</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
    <Field1>21306</Field1>
    <PostId>19103</PostId>
    <ThreadId>78</ThreadId>
    <UserId>13497</UserId>
    <TimeStamp>1407321854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>tzjtj</ReplyTo>
    <Content>rtgr</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
</ROWSET>

1 个解决方案

#1

In your case, since you deal with big datasets, you should indeed use xmlEventParse which relies on the SAX, ie the Simple API for XML.The advantage of this vs. using xmlParse is that you will not load the XML tree in R (which can cause memory leaks if data is really big...).

在您的情况下，由于您处理大数据集，您应该使用依赖于SAX的xmlEventParse，即XML的Simple API。这与使用xmlParse的优点是您不会在R中加载XML树（其中如果数据真的很大，会导致内存泄漏...）。

I don't have a big dataset in hands, so i cannot test in real conditions but you can try this code snippet:

我手上没有大数据集，所以我无法在实际条件下进行测试，但您可以尝试以下代码片段：

xmlDoc <- "Final.xml"
result <- NULL

#function to use with xmlEventParse
row.sax = function() {
    ROW = function(node){
            children <- xmlChildren(node)
            children[which(names(children) == "text")] <- NULL
            result <<- rbind(result, sapply(children,xmlValue))
          }
    branches <- list(ROW = ROW)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#and here is your data.frame
result <- as.data.frame(result, stringsAsFactors = F)

Let me know how it runs!

让我知道它是如何运行的！

#1