I'm trying to convert an xml file to a dataframe, but the format seems to be off. I've looked at different tutorials and, while I've been moderately succesful at getting the information I need using a for loop and navigating the parsed file, I've been told that this solution is not very efficient.
我正在尝试将xml文件转换为数据帧,但格式似乎已关闭。我已经看了不同的教程,虽然我已经成功获得了我需要的信息,使用for循环并导航解析后的文件,但我被告知这个解决方案效率不高。
I tried this code then:
我试过这段代码:
require(XML)
parsed<-xmlParse("SEWL.xml")
xmlToDataFrame(parsed)
But it gives an error: Error in [<-.data.frame
(*tmp*
, i, names(nodes[[i]]), value = c("\"LL18179\"\"2016/08\"0.32485.43896.59801.2131\"OK\"", : duplicate subscripts for columns
但它给出了一个错误:[< - 。data.frame(* tmp *,i,names(nodes [[i]]),value = c(“\”LL18179 \“\”2016/08 \“0.32485)出错.43896.59801.2131 \“OK \”“,:列的重复下标
This other code works, but the formatting is not what I need:
这个其他代码可以工作,但格式不是我需要的:
require(XML)
require(plyr)
pldf<-ldply(xmlToList("SEWL.xml"),data.frame)
The resulting dataframe is as follows:
结果数据框如下:
.id X..i.. text .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1
1 technician "John" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 location "CO" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 temp <NA> 21.3 celsius <NA> <NA> <NA> <NA> <NA>
4 runtype "routine" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 sample <NA> <NA> 2323 "LL18179" "2016/08" 0.3248 baseline 5.4389
6 sample <NA> <NA> 2323 "LL18179" "2016/08" 0.3248 baseline 5.4389
7 sample <NA> <NA> 8979237 "AA09453" "2016/03" 0.0117 baseline 5.6012
8 sample <NA> <NA> 8979237 "AA09453" "2016/03" 0.0117 baseline 5.6012
9 .attrs 2015_07_31_11_33_22 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 .attrs 20150731 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 .attrs 113322 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 std 6.5980 data 1.2131 "OK" laslum "ATR150607" "2017/05"
6 std 6.5980 data 1.2131 "OK" 3 "ATR150607" "2017/05"
7 std 1.1431 data 0.2041 "FAIL" absat <NA> <NA>
8 std 1.1431 data 0.2041 "FAIL" 2 <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1 <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA>
5 0.0673 baseline 4.9721 std 10.3851 data
6 0.0673 baseline 4.9721 std 10.3851 data
7 <NA> <NA> <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA> <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA> <NA> <NA>
test.calc.1 test.result.1 test..attrs.1
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 2.0886 "Warning" atr
6 2.0886 "Warning" 1
7 <NA> <NA> <NA>
8 <NA> <NA> <NA>
9 <NA> <NA> <NA>
10 <NA> <NA> <NA>
11 <NA> <NA> <NA>
This is the example XML file that I'm using:
这是我正在使用的示例XML文件:
<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>
And the dataframe that I'm hoping to get:
我希望获得的数据框:
experiment technician location temp runtype sample test order code validuntil baseline std data calc result date time
1 abc123 John CO 21.3 routine 2323 laslum 3 LL18179 2016/08 0.3248 5.4389 6.5980 1.2131 OK 20150731 113322
2 abc123 John CO 21.3 routine 2323 atr 1 ATR150607 2017/05 0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322
3 abc123 John CO 21.3 routine 8979237 absat 2 AA09453 2016/03 0.0117 5.6012 1.1431 0.2041 FAIL 20150731 113322
I don't need the exact same format, just something close enough so I can transform it into the example.
我不需要完全相同的格式,只需要足够接近的东西,以便将其转换为示例。
1 个解决方案
#1
6
We provide two approaches to parsing the XML. The first (performing a triple iteration over experiment/sample/test) would likely run faster but the second (using a single loop over the test nodes and at each test node reaching back up through the tree to grab its ancestors) has simpler code.
我们提供了两种解析XML的方法。第一个(对实验/样本/测试执行三次迭代)可能会运行得更快,但第二个(在测试节点上使用单个循环并在每个测试节点上通过树返回以获取其祖先)具有更简单的代码。
1) Using Lines
in the Note at the end we implement a triple xpathApply/xpathSApply iteration over experiment/sample/test nodes. e
, s
and t
represent the current such node, respectively.
1)最后在Note中使用Lines,我们在实验/样本/测试节点上实现三次xpathApply / xpathSApply迭代。 e,s和t分别代表当前这样的节点。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//experiment", function(e) {
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
t(do.call(cbind, xpathApply(e, "sample", function(s) {
sample <- xmlAttrs(s)[["id"]]
xpathSApply(s, "test", function(t) {
c(sample = sample,
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]])
)})}))),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)}))
giving:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
2) This is an alternate approach in which we loop only over the test nodes and then reach upward into the parent and grandparent to get the corresponding sample and experiement info.
2)这是一种替代方法,我们只在测试节点上循环,然后向上到父母和祖父母,以获得相应的样本和经验信息。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
s <- xmlParent(t) # s is sample node
e <- xmlParent(s) # e is experiment node
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
sample = xmlAttrs(s)[["id"]],
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]]),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)
}))
giving:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
Note 1:
As an aside if you read the input XML file, SEWL.xml, into Excel it will do a reasonable job of putting it into a tabular format although some further processing would be needed to get it into precisely into the form in the question.
顺便说一句,如果您将输入XML文件SEWL.xml读入Excel,它将合理地将其放入表格格式中,尽管需要进一步处理才能将其精确地放入问题中的表单中。
Note 2:
The input Lines
as an R object is:
作为R对象的输入行是:
Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>'
#1
6
We provide two approaches to parsing the XML. The first (performing a triple iteration over experiment/sample/test) would likely run faster but the second (using a single loop over the test nodes and at each test node reaching back up through the tree to grab its ancestors) has simpler code.
我们提供了两种解析XML的方法。第一个(对实验/样本/测试执行三次迭代)可能会运行得更快,但第二个(在测试节点上使用单个循环并在每个测试节点上通过树返回以获取其祖先)具有更简单的代码。
1) Using Lines
in the Note at the end we implement a triple xpathApply/xpathSApply iteration over experiment/sample/test nodes. e
, s
and t
represent the current such node, respectively.
1)最后在Note中使用Lines,我们在实验/样本/测试节点上实现三次xpathApply / xpathSApply迭代。 e,s和t分别代表当前这样的节点。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//experiment", function(e) {
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
t(do.call(cbind, xpathApply(e, "sample", function(s) {
sample <- xmlAttrs(s)[["id"]]
xpathSApply(s, "test", function(t) {
c(sample = sample,
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]])
)})}))),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)}))
giving:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
2) This is an alternate approach in which we loop only over the test nodes and then reach upward into the parent and grandparent to get the corresponding sample and experiement info.
2)这是一种替代方法,我们只在测试节点上循环,然后向上到父母和祖父母,以获得相应的样本和经验信息。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
s <- xmlParent(t) # s is sample node
e <- xmlParent(s) # e is experiment node
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
sample = xmlAttrs(s)[["id"]],
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]]),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)
}))
giving:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
Note 1:
As an aside if you read the input XML file, SEWL.xml, into Excel it will do a reasonable job of putting it into a tabular format although some further processing would be needed to get it into precisely into the form in the question.
顺便说一句,如果您将输入XML文件SEWL.xml读入Excel,它将合理地将其放入表格格式中,尽管需要进一步处理才能将其精确地放入问题中的表单中。
Note 2:
The input Lines
as an R object is:
作为R对象的输入行是:
Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>'