如何从XML节点中提取特定的元素

时间:2022-07-11 21:19:42

I have something like this:

我有这样的东西:

<ValuesPeaks>
  <Peak Start="244" Stop="248" Max="245" XValue="149" YValue="100.0000"/>
  <Peak Start="361" Stop="368" Max="366" XValue="173.2" YValue="96.2713"/>
<ValuesPeaks>

Except they are a lot longer and I have about 300 sets of <ValuesPeaks>. How can I extract only the XValue and YValue elements of everything? I thought I can do xpathSApply('//ValuesPeaks[XValue]',xmlValue), but its not working. I then thought I can do toString.XMLNode() then use regexpr() and substr() to obtain what I want but that seems inefficient. I think I'm missing something. Please share your expertise. Thanks.

但是它们要长得多我有大约300套 。如何只提取所有元素的XValue和YValue ?我认为我可以使用xpathSApply('/ ValuesPeaks[XValue],xmlValue),但它不起作用。然后,我想我可以执行toString.XMLNode(),然后使用regexpr()和substr()来获取我想要的东西,但这似乎没有效率。我想我漏掉了什么。请分享你的专业知识。谢谢。

p<-list.files()[[1]]
library(XML)
x<-xmlParse(p)
getNodeSet(x,'//Data/RESULT/*/*/*/ValuesPeaks/Peak')
f<-xpathSApply(x,'//Data/RESULT/*/*/*/ValuesPeaks/Peak')
t<-toString.XMLNode(f)

2 个解决方案

#1


2  

There are a few ways to extract those attributes. It all depends on what you want the result to look like. Here are a couple of examples.

有几种方法可以提取这些属性。这完全取决于你希望结果是什么样子。这里有几个例子。

The first uses xmlAttrs() and subsets the results.

第一个使用xmlAttrs()并对结果进行子集设置。

xpathApply(doc, "//ValuesPeaks//*", function(x) xmlAttrs(x)[c("XValue", "YValue")])
# [[1]]
#     XValue     YValue 
#      "149" "100.0000" 
#
# [[2]]
#    XValue    YValue 
#   "173.2" "96.2713" 

The second is likely more efficient. It uses an XPath statement to get the two relevant attributes.

第二种可能更有效。它使用XPath语句获取两个相关属性。

xpathSApply(doc, "//ValuesPeaks//@*[name()='XValue' or name()='YValue']")
#    XValue     YValue     XValue     YValue 
#     "149" "100.0000"    "173.2"  "96.2713" 

You could even do

你甚至可以做的

sapply(unname(xmlToList(doc)), "[", c("XValue", "YValue"))
#        [,1]       [,2]     
# XValue "149"      "173.2"  
# YValue "100.0000" "96.2713"

Data:

数据:

txt <- '<ValuesPeaks>
  <Peak Start="244" Stop="248" Max="245" XValue="149" YValue="100.0000"/>
  <Peak Start="361" Stop="368" Max="366" XValue="173.2" YValue="96.2713"/>
</ValuesPeaks>'
library(XML)
doc <- xmlParse(txt)

#2


2  

Your XML is malformed (the second ValuePeaks tag needs a / to make it a closing tag), which causes xml2::read_xml to complain. read_html actually automatically fixes it though, so you can do

您的XML格式不正确(第二个valuepeak标记需要a /使其成为结束标记),这导致xml2::read_xml抱怨。read_html实际上会自动修复它,所以你可以这么做

library(xml2)
library(tidyverse)

x <- '<ValuesPeaks>
  <Peak Start="244" Stop="248" Max="245" XValue="149" YValue="100.0000"/>
  <Peak Start="361" Stop="368" Max="366" XValue="173.2" YValue="96.2713"/>
<ValuesPeaks>' 

df <- x %>% 
    read_html() %>% 
    xml_find_all('//peak') %>% {
        data_frame(xvalue = xml_attr(., 'xvalue'), 
                   yvalue = xml_attr(., 'yvalue'))
    } %>% 
    type_convert()

df
#> # A tibble: 2 x 2
#>   xvalue   yvalue
#>    <dbl>    <dbl>
#> 1  149.0 100.0000
#> 2  173.2  96.2713

#1


2  

There are a few ways to extract those attributes. It all depends on what you want the result to look like. Here are a couple of examples.

有几种方法可以提取这些属性。这完全取决于你希望结果是什么样子。这里有几个例子。

The first uses xmlAttrs() and subsets the results.

第一个使用xmlAttrs()并对结果进行子集设置。

xpathApply(doc, "//ValuesPeaks//*", function(x) xmlAttrs(x)[c("XValue", "YValue")])
# [[1]]
#     XValue     YValue 
#      "149" "100.0000" 
#
# [[2]]
#    XValue    YValue 
#   "173.2" "96.2713" 

The second is likely more efficient. It uses an XPath statement to get the two relevant attributes.

第二种可能更有效。它使用XPath语句获取两个相关属性。

xpathSApply(doc, "//ValuesPeaks//@*[name()='XValue' or name()='YValue']")
#    XValue     YValue     XValue     YValue 
#     "149" "100.0000"    "173.2"  "96.2713" 

You could even do

你甚至可以做的

sapply(unname(xmlToList(doc)), "[", c("XValue", "YValue"))
#        [,1]       [,2]     
# XValue "149"      "173.2"  
# YValue "100.0000" "96.2713"

Data:

数据:

txt <- '<ValuesPeaks>
  <Peak Start="244" Stop="248" Max="245" XValue="149" YValue="100.0000"/>
  <Peak Start="361" Stop="368" Max="366" XValue="173.2" YValue="96.2713"/>
</ValuesPeaks>'
library(XML)
doc <- xmlParse(txt)

#2


2  

Your XML is malformed (the second ValuePeaks tag needs a / to make it a closing tag), which causes xml2::read_xml to complain. read_html actually automatically fixes it though, so you can do

您的XML格式不正确(第二个valuepeak标记需要a /使其成为结束标记),这导致xml2::read_xml抱怨。read_html实际上会自动修复它,所以你可以这么做

library(xml2)
library(tidyverse)

x <- '<ValuesPeaks>
  <Peak Start="244" Stop="248" Max="245" XValue="149" YValue="100.0000"/>
  <Peak Start="361" Stop="368" Max="366" XValue="173.2" YValue="96.2713"/>
<ValuesPeaks>' 

df <- x %>% 
    read_html() %>% 
    xml_find_all('//peak') %>% {
        data_frame(xvalue = xml_attr(., 'xvalue'), 
                   yvalue = xml_attr(., 'yvalue'))
    } %>% 
    type_convert()

df
#> # A tibble: 2 x 2
#>   xvalue   yvalue
#>    <dbl>    <dbl>
#> 1  149.0 100.0000
#> 2  173.2  96.2713