使用R将重复的子节点XML转换为整齐的数据集

时间:2022-03-06 23:36:21

I am trying to build a data mash-up from a wide variety o security controls in R. I have had great success with the devices that output CSV, JSON, etc, but XML is really tripping me up. You will quickly see that I am not the boss R developer I wish to be, but I greatly appreciate any help once could provide. Here is a simplified version of the XML I am trying to parse.

我正在尝试从R中的各种安全控件构建数据混搭。我在输出CSV,JSON等设备方面取得了巨大成功,但XML确实让我感到沮丧。你很快就会发现我不是我想成为的老板R开发者,但我非常感谢曾经提供过的任何帮助。这是我试图解析的XML的简化版本。

 <devices>
    <host id="169274" persistent_id="21741">
      <ip>some_IP_here</ip>
      <hostname>Some_DNS_name_here </hostname>
      <netbiosname>Some_NetBios_Name_here</netbiosname>
      <hscore>663</hscore>
      <howner>4</howner>
      <assetvalue>4</assetvalue>
      <os>Unix Variant</os>
      <nbtshares/>
      <fndvuln id="534" port="80" proto="tcp"/>
      <fndvuln id="1191" port="22" proto="tcp"/>
    </host>
    <host id="169275" persistent_id="21003">
      <ip>some_IP_here</ip>
      <hostname>Some_DNS_name_here </hostname>
      <netbiosname>Some_NetBios_Name_here</netbiosname>
      <hscore>0</hscore>
      <howner>4</howner>
      <assetvalue>4</assetvalue>
      <os>OS Undetermined</os>
      <nbtshares/>
      <fndvuln id="5452" port="ip" proto="ip"/>
      <fndvuln id="5092" port="123" proto="udp"/>
      <fndvuln id="16157" port="123" proto="udp"/>
    </host>
</devices>

The end result that I am hoping to achieve is a tidy R dataframe that I can use for analysis. It a perfect world it would like as follows

我希望实现的最终结果是一个整洁的R数据帧,我可以用它来进行分析。它是一个完美的世界,如下所示

host           ip            hostname            netbiosname     VulnID   port   protocol
1 169274 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  534      80     tcp
2 169274 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  1191     22     tcp
3 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  5452     ip     ip
4 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  5092     123    udp
5 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  16157    123    udp

On the simplest level, I have no problem parsing the XML and extracting the data I need to build the basic dataframe. However, I struggle with how to iterate through the parsed XML and essentially create a separate line for each time the fndvuln element appears in parent XML node.

在最简单的层面上,我没有问题解析XML并提取构建基本数据帧所需的数据。但是,我很难解决如何迭代解析的XML,并且每次fndvuln元素出现在父XML节点中时都会创建一个单独的行。

So far, I am guessing it is best to load each element individually and then bind them at the end. I am thinking this would allow me to use sapply to run through the various instances of fndvuln and create a separate entry. so far, I have this for the basic structure:

到目前为止,我猜测最好分别加载每个元素,然后在最后绑定它们。我想这将允许我使用sapply来运行fndvuln的各种实例并创建一个单独的条目。到目前为止,我有这个基本结构:

library(XML)

setwd("My_file_location_here")

xmlfile <- "vuln.xml"
xmldoc <- xmlParse(xmlfile)
vuln <-getNodeSet(xmldoc, "//host")
x <- lapply(vuln, function(x)  data.frame(host = xpathSApply(x, "." , xmlGetAttr, "id"),
                                        ip = xpathSApply(x, ".//ip", xmlValue),
                                        hostname = xpathSApply(x, ".//hostname", xmlValue),
                                        netbiosname = xpathSApply(x, ".//netbiosname", xmlValue) ))

do.call("rbind", x)

Which basically gives me this:

这基本上给了我这个:

    host           ip            hostname            netbiosname
1 169274 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here
2 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here

Not sure how I would go about doing the rest. Also, because this device will kick out quite a hefty XML file, knowing how to do this efficiently would be my end goal.

我不确定如何做其余的事情。此外,因为这个设备会发出相当大的XML文件,知道如何有效地做到这一点将是我的最终目标。

1 个解决方案

#1


0  

The host, ip, hostname, etc will be repeated when you add the fndvuln elements to your data.frame (try data.frame("a", 1:3))

将fndvuln元素添加到data.frame时,将重复host,ip,hostname等(尝试data.frame(“a”,1:3))

x <- lapply(vuln, function(x)  data.frame(
    host = xpathSApply(x, "." , xmlGetAttr, "id"),
     ip  = xpathSApply(x, ".//ip", xmlValue),
hostname = xpathSApply(x, ".//hostname", xmlValue),
  VulnID = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "id"),
   port  = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "port") ))

do.call("rbind", x)
    host           ip            hostname VulnID port
1 169274 some_IP_here Some_DNS_name_here     534   80
2 169274 some_IP_here Some_DNS_name_here    1191   22
3 169275 some_IP_here Some_DNS_name_here    5452   ip
4 169275 some_IP_here Some_DNS_name_here    5092  123
5 169275 some_IP_here Some_DNS_name_here   16157  123

#1


0  

The host, ip, hostname, etc will be repeated when you add the fndvuln elements to your data.frame (try data.frame("a", 1:3))

将fndvuln元素添加到data.frame时,将重复host,ip,hostname等(尝试data.frame(“a”,1:3))

x <- lapply(vuln, function(x)  data.frame(
    host = xpathSApply(x, "." , xmlGetAttr, "id"),
     ip  = xpathSApply(x, ".//ip", xmlValue),
hostname = xpathSApply(x, ".//hostname", xmlValue),
  VulnID = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "id"),
   port  = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "port") ))

do.call("rbind", x)
    host           ip            hostname VulnID port
1 169274 some_IP_here Some_DNS_name_here     534   80
2 169274 some_IP_here Some_DNS_name_here    1191   22
3 169275 some_IP_here Some_DNS_name_here    5452   ip
4 169275 some_IP_here Some_DNS_name_here    5092  123
5 169275 some_IP_here Some_DNS_name_here   16157  123