将XML的所有字段(和子字段)导入为dataframe

时间:2021-12-30 18:15:39

To do some analysis I want to import a XML to a dataframe using R and the XML package. Example of XML file:

要做一些分析,我想使用R和XML包将XML导入数据框。 XML文件示例:

<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
    <code>123456</code>
    <search_key>TestKey</search_key>
    <date>September 14, 2012 04:15</date>
    <result>Found</result>
    <link>http://www.test.com/fakeurl</link>
    <price>100.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>Name Test</name>
    <results>
        <result position="1">
            <c_name>CTest1</c_name>
            <c_price>599.49</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>599.49</c_total_price>
            <c_rating>8.3</c_rating>
            <c_delivery/>
        </result><result position="2">
            <c_name>CTest2</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>654.0</c_total_price>
            <c_rating>9.8</c_rating>
            <c_delivery/>
        </result>
        <result position="3">
            <c_name>CTest3</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>654.0</c_total_price>
            <c_rating>8.8</c_rating>
            <c_delivery/>
        </result>
    </results>
</watcher>
</watchers>

I want to have the rows of the dataframe containing the following fields:

我想让dataframe的行包含以下字段:

shop_name   created_at  code    search_key  date    result
link    price   shipping    origposition    name    
position    c_name  c_price c_shipping  c_total_price   
c_rating    c_delivery

This means that the child nodes must be taken into account as well, which would result in a dataframe of three rows in this example (since the results show 3 positions). The fields

这意味着也必须考虑子节点,这将导致在该示例中三行的数据帧(因为结果显示3个位置)。田野

shop_name   created_at  code    search_key
date    result  link    price   shipping    
origposition    name

are the same for each of these rows.

对于这些行中的每一行都是相同的。

I am able to go through the XML file, but I am unable to get a dataframe with the fields i want. When I convert the dataframe to a dataframe I get the following fields:

我能够浏览XML文件,但我无法获得包含我想要的字段的数据框。当我将数据帧转换为数据帧时,我得到以下字段:

"code"       "search_key"      "date"     "result"  
"link" "price"      "shipping"   "origposition"  
"name"    "results"     

Here the fields

在这里的领域

shop_name   created_at

are missing at the beginning and the 'results' are put together in a String under the column "results".

在开头缺少,'结果'在“结果”列下的字符串中放在一起。

It must be possible to get the wanted dataframe, but I do not know how to do this exactly.

必须有可能得到想要的数据帧,但我不知道如何准确地做到这一点。

UPDATE

The solution provided by @MvG works brilliantly on the test XML file stated above. However the column 'result' can also have the value "Not Found". Entries with this value will miss certain fields (always the same filed) and therefore yield a "number of columns of arguments do not match"-error when running the solution. I would like these entries to be put in the dataframe as well, with the fields that are not present left empty. I do not understand how to incorporate this scenario.

@MvG提供的解决方案在上面提到的测试XML文件上运行得非常出色。但是,“结果”列也可以具有“未找到”值。具有此值的条目将丢失某些字段(始终是相同的字段),因此在运行解决方案时会产生“参数列数不匹配”-error。我希望这些条目也放在数据框中,不存在的字段留空。我不明白如何合并这个场景。

test.xml

<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
    <code>123456</code>
    <search_key>TestKey</search_key>
    <date>September 14, 2012 04:15</date>
    <result>Found</result>
    <link>http://www.test.com/fakeurl</link>
    <price>100.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>Name Test</name>
    <results>
        <result position="1">
            <c_name>CTest1</c_name>
            <c_price>599.49</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>599.49</c_total_price>
            <c_rating>8.3</c_rating>
            <c_delivery/>
        </result><result position="2">
            <c_name>CTest2</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
        <c_total_price>654.0</c_total_price>
        <c_rating>9.8</c_rating>
        <c_delivery/>
    </result>
    <result position="3">
        <c_name>CTest3</c_name>
        <c_price>654.0</c_price>
        <c_shipping>0.0</c_shipping>
        <c_total_price>654.0</c_total_price>
        <c_rating>8.8</c_rating>
        <c_delivery/>
    </result>
</results>
</watcher>
<watcher channel="Shopping">
    <code>12804</code>
    <search_key></search_key>
    <date></date>
    <result>Not found</result>
    <link>https://www.test.com/testing1323p</link>
    <price>0.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>MOOVM6002020</name>
    <results>
    </results>
</watcher>
</watchers>

2 个解决方案

#1


3  

Here is a more generic approach. Every node is classified as one of three cases:

这是一种更通用的方法。每个节点都被归类为三种情况之一:

  • If the node name is of kind rows, then the data frames from child nodes will result in different rows of the result.
  • 如果节点名称是种类行,则子节点的数据帧将导致结果的不同行。

  • If the node name is of kind cols, then the data frames from child nodes will result in different columns of the result.
  • 如果节点名称是类型cols,则来自子节点的数据帧将导致结果的不同列。

  • If the node name is of kind value, then a data frame with a single value will be constructed, using the node name as the column name and the node value as the column value.
  • 如果节点名称是kind值,则将使用节点名称作为列名称并将节点值作为列值来构造具有单个值的数据帧。

  • For all three cases, attributes of the node will be added to the data frame.
  • 对于所有三种情况,节点的属性将添加到数据框中。

The call for your application is given towards the bottom.

您的申请要求在底部。

library(XML)

zeroColSingleRow <- function() {
  res <- data.frame(dummy=NA)
  res$dummy <- NULL
  stopifnot(nrow(res) == 1, ncol(res) == 0)
  return (res)
}

xml2df <- function(node, classifier) {
  if (! inherits(node, c("XMLInternalElementNode", "XMLElementNode"))) {
    return (zeroColSingleRow())
  }
  kind <- classifier(node)
  if (kind == "rows") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      names <- unique(unlist(lapply(cdf, colnames)))
      cdf <- lapply(cdf, function(i) {
        missing <- setdiff(names, colnames(i))
        if (length(missing) > 0) {
          i[missing] <- NA
        }
        return (i)
      })
      res <- do.call(rbind, cdf)
    }
  }
  else if (kind == "cols") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      res <- cdf[[1]]
      if (length(cdf) > 1) {
        for (i in 2:length(cdf)) {
          res <- merge(res, cdf[[i]], by=NULL)
        }
      }
    }
  }
  else {
    stopifnot(kind == "value")
    res <- data.frame(xmlValue(node))
    names(res) <- xmlName(node)
  }
  if (ncol(res) == 0) {
    res <- zeroColSingleRow()
  }
  attr <- xmlAttrs(node)
  if (length(attr) > 0) {
    attr <- do.call(data.frame, as.list(attr))
    res <- merge(attr, res, by=NULL)
  }
  rownames(res) <- NULL
  return(res)
}

doc<-xmlParse("test.xml")

xml2df(xmlRoot(doc), function(node) {
  name <- xmlName(node)
  if (name %in% c("watchers", "results"))
    return("rows")
  # make sure to treat results/result different from watcher/result
  if (name %in% c("watcher", "result") &&
      xmlName(xmlParent(node)) == paste0(name, "s"))
    return("cols")
  return("value")
})

#2


-1  

Here is one possibility:

这是一种可能性:

attr2df <- function(n) do.call(data.frame, as.list(xmlAttrs(n)))
cbind(attr2df(xmlRoot(doc)), 
  do.call(rbind, xpathApply(doc, "//watcher", function(w) {
    x <- xmlToDataFrame(nodes = list(w))
    x$results<-NULL
    cbind(attr2df(w), x,
          xmlToDataFrame(nodes = getNodeSet(w, "results/result")))
  } ))
)

Iterate over all watchers. For each watcher, read its subtree to data frame x, and read its result nodes to another data frame. Remove the results from the first data frame, then bind the columns of both together, and throw in the attributes from the watcher as well. This application will yield one data.frame per watcher, and the outer rbind cal will combine them to a single data frame. The outermost cbind will add the attributes of the root node.

迭代所有观察者。对于每个观察者,将其子树读取到数据帧x,并将其结果节点读取到另一个数据帧。从第一个数据框中删除结果,然后将两者的列绑定在一起,并从观察者中提取属性。此应用程序将为每个观察者生成一个data.frame,外部rbind cal将它们组合到单个数据帧。最外面的cbind将添加根节点的属性。

The result will have these names:

结果将具有以下名称:

 [1] "shop_name"     "created_at"    "channel"       "code"         
 [5] "search_key"    "date"          "result"        "link"         
 [9] "price"         "shipping"      "position"      "name"         
[13] "c_name"        "c_price"       "c_shipping"    "c_total_price"
[17] "c_rating"      "c_delivery"   

#1


3  

Here is a more generic approach. Every node is classified as one of three cases:

这是一种更通用的方法。每个节点都被归类为三种情况之一:

  • If the node name is of kind rows, then the data frames from child nodes will result in different rows of the result.
  • 如果节点名称是种类行,则子节点的数据帧将导致结果的不同行。

  • If the node name is of kind cols, then the data frames from child nodes will result in different columns of the result.
  • 如果节点名称是类型cols,则来自子节点的数据帧将导致结果的不同列。

  • If the node name is of kind value, then a data frame with a single value will be constructed, using the node name as the column name and the node value as the column value.
  • 如果节点名称是kind值,则将使用节点名称作为列名称并将节点值作为列值来构造具有单个值的数据帧。

  • For all three cases, attributes of the node will be added to the data frame.
  • 对于所有三种情况,节点的属性将添加到数据框中。

The call for your application is given towards the bottom.

您的申请要求在底部。

library(XML)

zeroColSingleRow <- function() {
  res <- data.frame(dummy=NA)
  res$dummy <- NULL
  stopifnot(nrow(res) == 1, ncol(res) == 0)
  return (res)
}

xml2df <- function(node, classifier) {
  if (! inherits(node, c("XMLInternalElementNode", "XMLElementNode"))) {
    return (zeroColSingleRow())
  }
  kind <- classifier(node)
  if (kind == "rows") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      names <- unique(unlist(lapply(cdf, colnames)))
      cdf <- lapply(cdf, function(i) {
        missing <- setdiff(names, colnames(i))
        if (length(missing) > 0) {
          i[missing] <- NA
        }
        return (i)
      })
      res <- do.call(rbind, cdf)
    }
  }
  else if (kind == "cols") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      res <- cdf[[1]]
      if (length(cdf) > 1) {
        for (i in 2:length(cdf)) {
          res <- merge(res, cdf[[i]], by=NULL)
        }
      }
    }
  }
  else {
    stopifnot(kind == "value")
    res <- data.frame(xmlValue(node))
    names(res) <- xmlName(node)
  }
  if (ncol(res) == 0) {
    res <- zeroColSingleRow()
  }
  attr <- xmlAttrs(node)
  if (length(attr) > 0) {
    attr <- do.call(data.frame, as.list(attr))
    res <- merge(attr, res, by=NULL)
  }
  rownames(res) <- NULL
  return(res)
}

doc<-xmlParse("test.xml")

xml2df(xmlRoot(doc), function(node) {
  name <- xmlName(node)
  if (name %in% c("watchers", "results"))
    return("rows")
  # make sure to treat results/result different from watcher/result
  if (name %in% c("watcher", "result") &&
      xmlName(xmlParent(node)) == paste0(name, "s"))
    return("cols")
  return("value")
})

#2


-1  

Here is one possibility:

这是一种可能性:

attr2df <- function(n) do.call(data.frame, as.list(xmlAttrs(n)))
cbind(attr2df(xmlRoot(doc)), 
  do.call(rbind, xpathApply(doc, "//watcher", function(w) {
    x <- xmlToDataFrame(nodes = list(w))
    x$results<-NULL
    cbind(attr2df(w), x,
          xmlToDataFrame(nodes = getNodeSet(w, "results/result")))
  } ))
)

Iterate over all watchers. For each watcher, read its subtree to data frame x, and read its result nodes to another data frame. Remove the results from the first data frame, then bind the columns of both together, and throw in the attributes from the watcher as well. This application will yield one data.frame per watcher, and the outer rbind cal will combine them to a single data frame. The outermost cbind will add the attributes of the root node.

迭代所有观察者。对于每个观察者,将其子树读取到数据帧x,并将其结果节点读取到另一个数据帧。从第一个数据框中删除结果,然后将两者的列绑定在一起,并从观察者中提取属性。此应用程序将为每个观察者生成一个data.frame,外部rbind cal将它们组合到单个数据帧。最外面的cbind将添加根节点的属性。

The result will have these names:

结果将具有以下名称:

 [1] "shop_name"     "created_at"    "channel"       "code"         
 [5] "search_key"    "date"          "result"        "link"         
 [9] "price"         "shipping"      "position"      "name"         
[13] "c_name"        "c_price"       "c_shipping"    "c_total_price"
[17] "c_rating"      "c_delivery"