I am working with several XML files that I want to compare, each containing about 200-300 different 'xml:ids'. Let's say there are three files contain the following xml:ids:
我正在处理几个我要比较的XML文件,每个文件包含大约200-300个不同的'xml:ids'。假设有三个文件包含以下xml:ids:
file1.xml
<?xml version="1.0" encoding="UTF-8"?>
<div>
<p xml:id= "F23_1b">1</p>
<p xml:id= "F54_34a">3</p>
</div>
file2.xml
<?xml version="1.0" encoding="UTF-8"?>
<div>
<p xml:id= "F23_1b">7</p>
<p xml:id= "F54_34a">8</p>
<p xml:id= "F54_63d">12</p>
</div>
file3.xml
<?xml version="1.0" encoding="UTF-8"?>
<div>
<p xml:id= "F143_32a">5</p>
<p xml:id= "F175_23c">6</p>
<p xml:id= "F95_1a">14</p>
<p xml:id= "F89_9d">15</p>
</div>
Now my goal is to compare these different files concerning a) the present xml:ids and b) their respective values (see table below). I started by using R's XML package / XPath to create a list for each file:
现在我的目标是比较这些不同的文件有关a)现在的xml:ids和b)它们各自的值(见下表)。我首先使用R的XML包/ XPath为每个文件创建一个列表:
file1 <- xmlTreeParse("file1.xml", useInternalNodes = T)
a <- xpathSApply(file1, "//*[@xml:id]", xmlGetAttr, "xml:id")
file2 <- xmlTreeParse("file1.xml", useInternalNodes = T)
a <- xpathSApply(file1, "//*[@xml:id]", xmlGetAttr, "xml:id")
file3 <- xmlTreeParse("file1.xml", useInternalNodes = T)
a <- xpathSApply(file1, "//*[@xml:id]", xmlGetAttr, "xml:id")
Now in a second step, I'd like to combine the results in one data frame but – and this is my main problem – the lists don't share the same length. At first I thought that I might just look for the longest list and add 'empty values' for xml:ids that are present in it but not in the shorter ones but I quickly realized that this approach would ignore ids which do only exist in the shorter lists.
现在,在第二步中,我想将结果合并到一个数据框中,但这是我的主要问题 - 列表的长度不同。起初我以为我可能只是寻找最长的列表并为其中存在的xml:ids添加“空值”但不在较短的列表中添加“空值”但我很快意识到这种方法会忽略只存在于较短的清单。
In the end, I'd love to have a data frame that can easily be exported (to .csv), looking similar to this table:
最后,我希望有一个可以轻松导出(到.csv)的数据框,看起来类似于这个表:
|------------||-----------||-----------||-----------|
| xml:ids || file1 || file2 || file3 |
|------------||-----------||-----------||-----------|
|------------||-----------||-----------||-----------|
| F23_1b || 1 || 7 || NULL |
|------------||-----------||-----------||-----------|
| F54_34a || 3 || 8 || NULL |
|------------||-----------||-----------||-----------|
| F54_63d || NULL || 12 || NULL |
|------------||-----------||-----------||-----------|
| F143_32a || NULL || NULL || 5 |
|------------||-----------||-----------||-----------|
| F175_23c || NULL || NULL || 6 |
|------------||-----------||-----------||-----------|
| F95_1a || NULL || NULL || 14 |
|------------||-----------||-----------||-----------|
| F89_9d || NULL || NULL || 15 |
|------------||-----------||-----------||-----------|
Do you have any suggestions concerning my problem?
你对我的问题有什么建议吗?
3 个解决方案
#1
2
If you use xml2 and purrr, it might look something like
如果你使用xml2和purrr,它可能看起来像
library(tidyverse)
library(xml2)
xml_data <- sprintf('file%s.xml', 1:3) %>% # make filepaths
map_df(~read_xml(.x) %>% # iterate over filenames; read xml
xml_find_all('//p') %>% # select p nodes
map_df(function(.y) { # iterate over nodes and combine to data frame of...
list(file = basename(.x), # the filename,
id = xml_attr(.y, 'id'), # the id attribute, and
value = as.integer(xml_text(.y))) # the node value.
}))
xml_data
#> # A tibble: 9 x 3
#> file id value
#> <chr> <chr> <int>
#> 1 file1.xml F23_1b 1
#> 2 file1.xml F54_34a 3
#> 3 file2.xml F23_1b 7
#> 4 file2.xml F54_34a 8
#> 5 file2.xml F54_63d 12
#> 6 file3.xml F143_32a 5
#> 7 file3.xml F175_23c 6
#> 8 file3.xml F95_1a 14
#> 9 file3.xml F89_9d 15
If you really want to spread it to wide form, from here it's pretty typical:
如果你真的想把它扩展到更广泛的形式,从这里它是非常典型的:
xml_data %>%
mutate(file = sub('.xml$', '', file)) %>%
spread(file, value)
#> # A tibble: 7 x 4
#> id file1 file2 file3
#> <chr> <int> <int> <int>
#> 1 F143_32a NA NA 5
#> 2 F175_23c NA NA 6
#> 3 F23_1b 1 7 NA
#> 4 F54_34a 3 8 NA
#> 5 F54_63d NA 12 NA
#> 6 F89_9d NA NA 15
#> 7 F95_1a NA NA 14
#2
0
Here is a solution using xml2
and dplyr::full_join
:
这是使用xml2和dplyr :: full_join的解决方案:
# Read XML files
library(xml2);
fn <- paste0("file", 1:3, ".xml");
files <- lapply(fn, read_xml);
# Extract node attributes and values, store as data.frame
lst <- lapply(files, function(x)
cbind.data.frame(
id = xml_attr(xml_children(x), "id"),
val = as.numeric(xml_text(xml_children(x))),
stringsAsFactors = F))
# Outer full join on all data.frame's in list
df <- Reduce(function(x, y) dplyr::full_join(x, y, by = "id"), lst)
colnames(df)[2:ncol(df)] <- fn;
df;
# id file1.xml file2.xml file3.xml
#1 F23_1b 1 7 NA
#2 F54_34a 3 8 NA
#3 F54_63d NA 12 NA
#4 F143_32a NA NA 5
#5 F175_23c NA NA 6
#6 F95_1a NA NA 14
#7 F89_9d NA NA 15
Explanation: Read XML files with xml2::read_xml
; extract node attributes and values with xml_attr
and xml_text
, respectively, and store as list
of data.frame
s; perform full outer join on data.frame
s in list
.
说明:使用xml2 :: read_xml读取XML文件;分别使用xml_attr和xml_text提取节点属性和值,并存储为data.frames列表;在列表中的data.frames上执行完全外连接。
#3
0
I don't know how flexible you are in your choice of technology, but here is a solution in XSLT 3.0
我不知道您在选择技术方面有多灵活,但这是XSLT 3.0中的解决方案
<xsl:variable name="doc1" select="doc('file1.xml')"/>
<xsl:variable name="doc2" select="doc('file2.xml')"/>
<xsl:variable name="doc3" select="doc('file3.xml')"/>
<xsl:merge>
<xsl:merge-source for-each-source="($doc1, $doc2, doc3)" select=".//p[@xml:id]">
<xsl:merge-key select="@xml:id" sort-before-merge="yes"/>
</xsl:merge-source>
<xsl:merge-action>
<tr>
<td>{current-merge-key()}</td>
<xsl:for-each select="($doc1, $doc2, doc3)">
<td>{(current-merge-group()[(/) is current()], 'NA')[1]}</td>
</xsl:for-each>
</tr>
</xsl:merge-action>
</xsl:merge>
Not tested. Easily generalised to N input documents.
未经测试。轻松推广到N个输入文档。
#1
2
If you use xml2 and purrr, it might look something like
如果你使用xml2和purrr,它可能看起来像
library(tidyverse)
library(xml2)
xml_data <- sprintf('file%s.xml', 1:3) %>% # make filepaths
map_df(~read_xml(.x) %>% # iterate over filenames; read xml
xml_find_all('//p') %>% # select p nodes
map_df(function(.y) { # iterate over nodes and combine to data frame of...
list(file = basename(.x), # the filename,
id = xml_attr(.y, 'id'), # the id attribute, and
value = as.integer(xml_text(.y))) # the node value.
}))
xml_data
#> # A tibble: 9 x 3
#> file id value
#> <chr> <chr> <int>
#> 1 file1.xml F23_1b 1
#> 2 file1.xml F54_34a 3
#> 3 file2.xml F23_1b 7
#> 4 file2.xml F54_34a 8
#> 5 file2.xml F54_63d 12
#> 6 file3.xml F143_32a 5
#> 7 file3.xml F175_23c 6
#> 8 file3.xml F95_1a 14
#> 9 file3.xml F89_9d 15
If you really want to spread it to wide form, from here it's pretty typical:
如果你真的想把它扩展到更广泛的形式,从这里它是非常典型的:
xml_data %>%
mutate(file = sub('.xml$', '', file)) %>%
spread(file, value)
#> # A tibble: 7 x 4
#> id file1 file2 file3
#> <chr> <int> <int> <int>
#> 1 F143_32a NA NA 5
#> 2 F175_23c NA NA 6
#> 3 F23_1b 1 7 NA
#> 4 F54_34a 3 8 NA
#> 5 F54_63d NA 12 NA
#> 6 F89_9d NA NA 15
#> 7 F95_1a NA NA 14
#2
0
Here is a solution using xml2
and dplyr::full_join
:
这是使用xml2和dplyr :: full_join的解决方案:
# Read XML files
library(xml2);
fn <- paste0("file", 1:3, ".xml");
files <- lapply(fn, read_xml);
# Extract node attributes and values, store as data.frame
lst <- lapply(files, function(x)
cbind.data.frame(
id = xml_attr(xml_children(x), "id"),
val = as.numeric(xml_text(xml_children(x))),
stringsAsFactors = F))
# Outer full join on all data.frame's in list
df <- Reduce(function(x, y) dplyr::full_join(x, y, by = "id"), lst)
colnames(df)[2:ncol(df)] <- fn;
df;
# id file1.xml file2.xml file3.xml
#1 F23_1b 1 7 NA
#2 F54_34a 3 8 NA
#3 F54_63d NA 12 NA
#4 F143_32a NA NA 5
#5 F175_23c NA NA 6
#6 F95_1a NA NA 14
#7 F89_9d NA NA 15
Explanation: Read XML files with xml2::read_xml
; extract node attributes and values with xml_attr
and xml_text
, respectively, and store as list
of data.frame
s; perform full outer join on data.frame
s in list
.
说明:使用xml2 :: read_xml读取XML文件;分别使用xml_attr和xml_text提取节点属性和值,并存储为data.frames列表;在列表中的data.frames上执行完全外连接。
#3
0
I don't know how flexible you are in your choice of technology, but here is a solution in XSLT 3.0
我不知道您在选择技术方面有多灵活,但这是XSLT 3.0中的解决方案
<xsl:variable name="doc1" select="doc('file1.xml')"/>
<xsl:variable name="doc2" select="doc('file2.xml')"/>
<xsl:variable name="doc3" select="doc('file3.xml')"/>
<xsl:merge>
<xsl:merge-source for-each-source="($doc1, $doc2, doc3)" select=".//p[@xml:id]">
<xsl:merge-key select="@xml:id" sort-before-merge="yes"/>
</xsl:merge-source>
<xsl:merge-action>
<tr>
<td>{current-merge-key()}</td>
<xsl:for-each select="($doc1, $doc2, doc3)">
<td>{(current-merge-group()[(/) is current()], 'NA')[1]}</td>
</xsl:for-each>
</tr>
</xsl:merge-action>
</xsl:merge>
Not tested. Easily generalised to N input documents.
未经测试。轻松推广到N个输入文档。