When would I want to use the xmlParse
function versus the xmlTreeParse
function? Also, when are parameter values useInternalNodes=TRUE
or asText=TRUE
useful?
我什么时候想使用xmlParse函数而不是xmlTreeParse函数?另外,什么时候参数值useInternalNodes = TRUE或asText = TRUE有用吗?
For example:
例如:
library("XML")
nct_url <- "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
xml_doc <- xmlParse(nct_url, useInternalNodes=TRUE)
vs.
与
doc <- xmlTreeParse(getURL(nct_url), useInternalNodes=TRUE)
top <- xmlRoot(doc)
top[["keyword"]]
xmlValue(top[["start_date"]])
xmlValue(top[["location"]])
People seem to use the xmlTreeParse
function for getting a non-repeating node via the $doc$children$... traversal. But I am not sure I understand when each approach is best. Parsing XML is one of the reasons to almost abandon R and learn Python. Lack of for-dummies examples without being forced to buy a book.
人们似乎使用xmlTreeParse函数通过$ doc $ children $ ...遍历来获取非重复节点。但我不确定每种方法最好的时候都能理解。解析XML是几乎放弃R并学习Python的原因之一。在没有*购买书籍的情况下缺乏傻瓜的例子。
1 个解决方案
#1
12
I am not an XML specialist so this answer is based on my own experience with XML package.
我不是XML专家,所以这个答案是基于我自己的XML包经验。
-
xmlParse
is a version ofxmlTreeParse
where argumentuseInternalNodes
is set to TRUE. - xmlParse是xmlTreeParse的一个版本,其中参数useInternalNodes设置为TRUE。
- If you want to get an R object use
xmlTreeParse
. This can be not very efficient and unnecessary if you want just to extract partial part of the xml document. - 如果要获取R对象,请使用xmlTreeParse。如果您只想提取xml文档的部分部分,这可能不是非常有效和不必要的。
- If you don't want to get an R object, just a c pointer, use
xmlParse
. But you should know somexpath
bases to manipulate the result. - 如果你不想获得一个R对象,只需要一个c指针,请使用xmlParse。但是你应该知道一些xpath基础来操纵结果。
- Use
asText=TRUE
if you have a text not a file or an url as input. - 如果您的文本不是文件或网址作为输入,请使用asText = TRUE。
Here an example where I show the difference between the 2 functions:
这里有一个例子,我展示了两个函数之间的区别:
txt <- "<doc>
<el> aa </el>
</doc>"
library(XML)
res <- xmlParse(txt,asText=TRUE)
res.tree <- xmlTreeParse(txt,asText=TRUE)
Now inspecting the 2 objects:
现在检查2个对象:
class(res)
[1] "XMLInternalDocument" "XMLAbstractDocument"
> class(res.tree)
[1] "XMLDocument" "XMLAbstractDocument"
You see that res is an internal document. It is pointer to a C object. res.tree is an R object. You can get its attributes like this :
您看到res是内部文档。它是指向C对象的指针。 res.tree是一个R对象。你可以得到这样的属性:
res.tree$doc$children
$doc
<doc>
<el>aa</el>
</doc>
For res, you should use a valid xpath
request and one of theses functions ( xpathApply
, xpathSApply
,getNodeSet
) to inspect it. for example:
对于res,您应该使用有效的xpath请求和其中一个函数(xpathApply,xpathSApply,getNodeSet)来检查它。例如:
xpathApply(res,'//el')
Once you create a valid Xml Node , you can apply xmlValue
, xmlGetAttr
,..to extract node information. So here this 2 statements are equivalent:
创建有效的Xml节点后,可以应用xmlValue,xmlGetAttr,..来提取节点信息。所以这两个陈述是等价的:
## we have already an R object, just apply xmlValue to the right child
xmlValue(res.tree$doc$children$doc)
## xpathSApply create an R object and pass it to
xpathSApply(res,'//el',xmlValue)
#1
12
I am not an XML specialist so this answer is based on my own experience with XML package.
我不是XML专家,所以这个答案是基于我自己的XML包经验。
-
xmlParse
is a version ofxmlTreeParse
where argumentuseInternalNodes
is set to TRUE. - xmlParse是xmlTreeParse的一个版本,其中参数useInternalNodes设置为TRUE。
- If you want to get an R object use
xmlTreeParse
. This can be not very efficient and unnecessary if you want just to extract partial part of the xml document. - 如果要获取R对象,请使用xmlTreeParse。如果您只想提取xml文档的部分部分,这可能不是非常有效和不必要的。
- If you don't want to get an R object, just a c pointer, use
xmlParse
. But you should know somexpath
bases to manipulate the result. - 如果你不想获得一个R对象,只需要一个c指针,请使用xmlParse。但是你应该知道一些xpath基础来操纵结果。
- Use
asText=TRUE
if you have a text not a file or an url as input. - 如果您的文本不是文件或网址作为输入,请使用asText = TRUE。
Here an example where I show the difference between the 2 functions:
这里有一个例子,我展示了两个函数之间的区别:
txt <- "<doc>
<el> aa </el>
</doc>"
library(XML)
res <- xmlParse(txt,asText=TRUE)
res.tree <- xmlTreeParse(txt,asText=TRUE)
Now inspecting the 2 objects:
现在检查2个对象:
class(res)
[1] "XMLInternalDocument" "XMLAbstractDocument"
> class(res.tree)
[1] "XMLDocument" "XMLAbstractDocument"
You see that res is an internal document. It is pointer to a C object. res.tree is an R object. You can get its attributes like this :
您看到res是内部文档。它是指向C对象的指针。 res.tree是一个R对象。你可以得到这样的属性:
res.tree$doc$children
$doc
<doc>
<el>aa</el>
</doc>
For res, you should use a valid xpath
request and one of theses functions ( xpathApply
, xpathSApply
,getNodeSet
) to inspect it. for example:
对于res,您应该使用有效的xpath请求和其中一个函数(xpathApply,xpathSApply,getNodeSet)来检查它。例如:
xpathApply(res,'//el')
Once you create a valid Xml Node , you can apply xmlValue
, xmlGetAttr
,..to extract node information. So here this 2 statements are equivalent:
创建有效的Xml节点后,可以应用xmlValue,xmlGetAttr,..来提取节点信息。所以这两个陈述是等价的:
## we have already an R object, just apply xmlValue to the right child
xmlValue(res.tree$doc$children$doc)
## xpathSApply create an R object and pass it to
xpathSApply(res,'//el',xmlValue)