如何读取XML输入文件,操作一些节点(删除并重命名一些)并将输出写入新的XML输出文件?

时间:2021-02-27 14:05:32

I need to read an XML file from internet and re-shape it. Here is the XML file and the code I have so far.

我需要从互联网上读取一个XML文件并重新塑造它。这是我到目前为止的XML文件和代码。

library(XML)
url='http://ClinicalTrials.gov/show/NCT00001400?displayxml=true'  
doc = xmlParse(url,useInternalNode=TRUE)

I was able to use some functions within the XML package with sucess(e.g., getNodeSet), but I am not an expert and there are some examples on the internet but I was not able to crack this problem myself. I also know some XPath but this was 4 years ago and I am not an expert on sapply and similar functions.

我能够使用XML包中的某些函数(例如,getNodeSet),但我不是专家,在互联网上有一些例子,但我自己无法解决这个问题。我也知道一些XPath,但这是4年前,我不是一个关于sapply和类似功能的专家。

But my goal is this:

但我的目标是:

  1. I need to remove a whole set of XML children branches about location, for example: <location> ... anything </location>. There can be multiple nodes with location data. I simply don't need that detail in the output. The XML file above always complies to an XSD schema. The root node is called <clinical_study>.

    我需要删除一整套关于位置的XML子分支,例如: ......任何 。可以有多个具有位置数据的节点。我根本不需要输出中的细节。上面的XML文件始终符合XSD架构。根节点称为

  2. The resulted simplified file should be written into a new XML file called "data-changed.xml".

    生成的简化文件应写入名为“data-changed.xml”的新XML文件中。

  3. I also need to rename and move one branch from old nested place of

    我还需要重命名并从旧的嵌套位置移动一个分支

    <eligibility> <criteria> <textblock> Inclusion criteria are xyz </textblock/>...

    包含标准是xyz ...

  4. In new output ("data-changed.xml") the structure should say a different XML node and be directly under root node:

    在新输出(“data-changed.xml”)中,结构应该表示不同的XML节点并且直接位于根节点下:

    <eligibility_criteria> Inclusion criteria are xyz </eligibility_criteria>

    包含标准是xyz

So I need to:

所以我需要:

  • read the XML into memory
  • 将XML读入内存
  • manipulate the tree (prune it somewhere)
  • 操纵树(在某处修剪)
  • move some XML nodes to a new place and under a new name and
  • 将一些XML节点移动到一个新的位置并以新的名称和
  • write the resulting XML output file.
  • 编写生成的XML输出文件。

Any ideas are greatly appreciated?

任何想法都非常感谢?

Also, if you know about a nice (recent !) tutorial on XML parsing within R (or book chapter which tackles it, please share the reference). (I read the vignettes by Duncan and these are too advanced (too concise)).

另外,如果您了解一个关于R内部XML解析的好(最近的!)教程(或者解决它的书籍章节,请分享参考)。 (我读过Duncan的小插曲,这些太过于先进(太简洁))。

3 个解决方案

#1


5  

Code to remove all location nodes:

删除所有位置节点的代码:

r <- xmlRoot(doc)
removeNodes(r[names(r) == "location"])

#2


4  

The quick answer to your question on how to apply an xpath to an xml file is to use xpathSApply. This works for me:

关于如何将xpath应用于xml文件的问题的快速答案是使用xpathSApply。这对我有用:

library(XML)
nct_url <- "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
xml_doc <- xmlParse(nct_url, useInternalNode=TRUE)
elig_path <- "/clinical_study/eligibility/criteria/textblock" 
elig_text <- xpathSApply(xml_doc, elig_path, xmlValue)

I'm doing some work with clinicaltrials.gov XML files, using R and its XML package. The package is tricky, and I only partially understand it. I've written a function to help deal with missing nodes in the XML:

我正在使用R及其XML包在clinicaltrials.gov XML文件上做一些工作。包裹很棘手,我只是部分理解它。我编写了一个函数来帮助处理XML中缺少的节点:

findbyxpath <- function(xmlfile, xpath) {
  xmldoc <- xmlParse(xmlfile)
  result <- try(xpathSApply(xmldoc, xpath, xmlValue))
  if(length(result) == 0) { # check for empty list, returned if node not found
    return("")
  } else {
    return(result)
  }
}

I use xml in files downloaded from clinicaltrials.gov ahead of time, so file is one of those. Then my example would instead look like this:

我在提前从clinicaltrials.gov下载的文件中使用xml,因此文件就是其中之一。然后我的例子看起来像这样:

file <- "NCT00112281.xml"
elig_text <- findbyxpath(file, elig_path)

Hope this helps.

希望这可以帮助。

#3


2  

This is what XSLT is designed for. It's a little bit of a learning curve, but once mastered, it's by far the most effective way of doing this kind of work. And you can translate your English rules directly into XSLT rules: for example, your first rule that says strip all location elements and their children is simply:

这就是XSLT的设计目标。这是一个学习曲线,但一旦掌握,它是迄今为止做这种工作最有效的方式。您可以将您的英语规则直接翻译成XSLT规则:例如,您的第一条规则表明剥离所有位置元素及其子项只是:

<xsl:template match="location"/>

and the rule about moving content to be under the new root node might be:

有关将内容移动到新根节点下的规则可能是:

<xsl:template match="/">
  <new-root-node>
    <xsl:copy-of select="//eligibility/criteria"/>
    <xsl:apply-templates/>
  </new-root-node>
</xsl:template>

This is just a flavour of course - you haven't specified your transformation rules precisely enough to translate into accurate code.

这当然是一种风格 - 您没有精确指定转换规则以转换为准确的代码。

#1


5  

Code to remove all location nodes:

删除所有位置节点的代码:

r <- xmlRoot(doc)
removeNodes(r[names(r) == "location"])

#2


4  

The quick answer to your question on how to apply an xpath to an xml file is to use xpathSApply. This works for me:

关于如何将xpath应用于xml文件的问题的快速答案是使用xpathSApply。这对我有用:

library(XML)
nct_url <- "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
xml_doc <- xmlParse(nct_url, useInternalNode=TRUE)
elig_path <- "/clinical_study/eligibility/criteria/textblock" 
elig_text <- xpathSApply(xml_doc, elig_path, xmlValue)

I'm doing some work with clinicaltrials.gov XML files, using R and its XML package. The package is tricky, and I only partially understand it. I've written a function to help deal with missing nodes in the XML:

我正在使用R及其XML包在clinicaltrials.gov XML文件上做一些工作。包裹很棘手,我只是部分理解它。我编写了一个函数来帮助处理XML中缺少的节点:

findbyxpath <- function(xmlfile, xpath) {
  xmldoc <- xmlParse(xmlfile)
  result <- try(xpathSApply(xmldoc, xpath, xmlValue))
  if(length(result) == 0) { # check for empty list, returned if node not found
    return("")
  } else {
    return(result)
  }
}

I use xml in files downloaded from clinicaltrials.gov ahead of time, so file is one of those. Then my example would instead look like this:

我在提前从clinicaltrials.gov下载的文件中使用xml,因此文件就是其中之一。然后我的例子看起来像这样:

file <- "NCT00112281.xml"
elig_text <- findbyxpath(file, elig_path)

Hope this helps.

希望这可以帮助。

#3


2  

This is what XSLT is designed for. It's a little bit of a learning curve, but once mastered, it's by far the most effective way of doing this kind of work. And you can translate your English rules directly into XSLT rules: for example, your first rule that says strip all location elements and their children is simply:

这就是XSLT的设计目标。这是一个学习曲线,但一旦掌握,它是迄今为止做这种工作最有效的方式。您可以将您的英语规则直接翻译成XSLT规则:例如,您的第一条规则表明剥离所有位置元素及其子项只是:

<xsl:template match="location"/>

and the rule about moving content to be under the new root node might be:

有关将内容移动到新根节点下的规则可能是:

<xsl:template match="/">
  <new-root-node>
    <xsl:copy-of select="//eligibility/criteria"/>
    <xsl:apply-templates/>
  </new-root-node>
</xsl:template>

This is just a flavour of course - you haven't specified your transformation rules precisely enough to translate into accurate code.

这当然是一种风格 - 您没有精确指定转换规则以转换为准确的代码。