
时间:2022-06-26 04:19:57

let's say, I have this xml file:


<?xml version="1.0" encoding="UTF-8" ?>
    <event date="2009-09-30" time="10:00:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:15:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:30:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:45:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="11:00:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="11:15:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="08:00:00" value="1.0" flag="2"></event>
    <event date="2009-09-30" time="08:15:00" value="2.6" flag="2"></event>
    <event date="2009-09-30" time="09:00:00" value="6.3" flag="2"></event>
    <event date="2009-09-30" time="09:15:00" value="4.4" flag="2"></event>
    <event date="2009-09-30" time="09:30:00" value="3.9" flag="2"></event>
    <event date="2009-09-30" time="09:45:00" value="2.0" flag="2"></event>
    <event date="2009-09-30" time="10:00:00" value="1.7" flag="2"></event>
    <event date="2009-09-30" time="10:15:00" value="2.3" flag="2"></event>
    <event date="2009-09-30" time="10:30:00" value="2.0" flag="2"></event>
    <event date="2009-09-30" time="10:00:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:15:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:30:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:45:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="11:00:00" value="0.0" flag="2"></event>

and let's say I want to do something with its series elements and that I would like to put in practice the advice 'vectorize the vectorizable'... I import the XML library and do the following:


R> library("XML")
R> doc <- xmlTreeParse('/home/mario/Desktop/sample.xml')
R> TimeSeriesNode <- xmlRoot(doc)
R> seriesNodes <- xmlElementsByTagName(TimeSeriesNode, "series")
R> length(seriesNodes)
[1] 3
R> (function(x){length(xmlElementsByTagName(x[['series']], 'event'))}
+ )(seriesNodes)
[1] 6

and I don't understand why I should only get the result of applying the function to the first element: I had expected three values, just as the length of seriesNodes, something like this:


R> mapply(length, seriesNodes)
series series series 
     7     10      6 

oops! I already came with the answer: "use mapply":


R> mapply(function(x){length(xmlElementsByTagName(x, 'event'))}, seriesNodes)
series series series 
     6      9      5 

but then I see the following problem: the R-inferno tells me that I'm "loop-hiding", not "vectorizing"! can I avoid looping at all? ...

但后来我看到了以下问题:R-inferno告诉我,我是“循环隐藏”,而不是“矢量化”!我可以避免循环吗? ...

2 个解决方案



You could also use xpathApply or xpathSApply-- these functions extract node sets using an XPath specification and then execute a function each set. Both of these functions are provided by the XML package. In order to use these functions, the XML document must be parsed using xmlInternalTreeParse or with the useInternalNodes option of xmlTreeParse set to be true:


require( XML )

countEvents <- function( series ){

  events <- xmlElementsByTagName( series, 'event' )
  return( length( events ) ) 


doc <- xmlTreeParse( "sample.xml", useInternalNodes = T )

xpathSApply( doc, '/TimeSeries/series', countEvents )
[1] 6 9 5

I don't know if it is any "faster", but the code is definitely cleaner and very explicit to anyone who knows the XPath syntax and how an apply function operates.




Since seriesNodes is a list of nodes, there is no easy way to avoid the implicit looping. Simple operations like getting the length are not computationally intensive, so I wouldn't lose any sleep over not being able to vectorise.


Note that you can use sapply(seriesNodes, length), instead of mapply, since there is only one argument to the length function.


The "proper R way" to do things is to use (s|m)apply calls to extract vectors of useful bits of data, then analyse those in the usual manner.

做事的“正确的R方式”是使用(s | m)应用调用来提取有用数据位的向量,然后以通常的方式分析它们。

Finally, if you really are desperate to vectorise counting events, use names(unlist(seriesNodes)) and then count the occurances of "series.children.event.name" in between each occurance of "series.name". This is undoubtedly uglier, and possibly slower than the sapply call.




You could also use xpathApply or xpathSApply-- these functions extract node sets using an XPath specification and then execute a function each set. Both of these functions are provided by the XML package. In order to use these functions, the XML document must be parsed using xmlInternalTreeParse or with the useInternalNodes option of xmlTreeParse set to be true:


require( XML )

countEvents <- function( series ){

  events <- xmlElementsByTagName( series, 'event' )
  return( length( events ) ) 


doc <- xmlTreeParse( "sample.xml", useInternalNodes = T )

xpathSApply( doc, '/TimeSeries/series', countEvents )
[1] 6 9 5

I don't know if it is any "faster", but the code is definitely cleaner and very explicit to anyone who knows the XPath syntax and how an apply function operates.




Since seriesNodes is a list of nodes, there is no easy way to avoid the implicit looping. Simple operations like getting the length are not computationally intensive, so I wouldn't lose any sleep over not being able to vectorise.


Note that you can use sapply(seriesNodes, length), instead of mapply, since there is only one argument to the length function.


The "proper R way" to do things is to use (s|m)apply calls to extract vectors of useful bits of data, then analyse those in the usual manner.

做事的“正确的R方式”是使用(s | m)应用调用来提取有用数据位的向量,然后以通常的方式分析它们。

Finally, if you really are desperate to vectorise counting events, use names(unlist(seriesNodes)) and then count the occurances of "series.children.event.name" in between each occurance of "series.name". This is undoubtedly uglier, and possibly slower than the sapply call.
