读取XML文件以进行特殊字符解析时出错

时间:2021-01-22 15:45:33

I am currently reading through a folder of static XML files (thousands of them).

我目前正在阅读静态XML文件的文件夹(数千个)。

Most of them were formatted correctly, but there are some special characters that I'd like to mask. As an example, one XML file has the invalid XML code as shown below:

他们中的大多数格式正确,但有一些我想掩盖的特殊字符。例如,一个XML文件具有无效的XML代码,如下所示:

<?xml version="1.0" encoding="utf-8"?>
    <INQUIRY version="4.0">
        <AUTHENTICATION>
            <LICENSEKEY>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</LICENSEKEY> 
            <PASSWORD>YYYYYYYYYYY</PASSWORD> 
        </AUTHENTICATION>
        <QUERY>
            <TRACKID>1-1-1</TRACKID> 
            <TYPE>VALID</TYPE>
            <CHANNEL>INTERNET</CHANNEL>
            <INQUIRYTYPE>O</INQUIRYTYPE>
            <DATA>
                <NAME>BARNES & NOBLE</NAME>
            </DATA>
        </QUERY>
    </INQUIRY>

I attempt to swap out the & with the code:

我试图用代码替换&:

install.packages("XML")
library(XML)

location <- "C:/Users/Desktop/temp"
filenames=dir(location)

for (i in 1:length(filenames)){
   tmp <- gsub("&", "&amp;", readLines(paste0(location,"/",filenames[i])))
   data <- xmlParse(tmp)
   TMP<-xmlToDataFrame(nodes=getNodeSet(data,"//DATA"))
   DATAX_DF<-rbind(TMP,DATAX_DF)
}

Resulting in the following:

导致以下结果:

Warning message:
In readLines(paste0(location, "/", filenames[i])) :
  incomplete final line found on 'C:/Users/Desktop/tmp/1-1-1_req.XML'

What is another work around to replace the ampersand and/or any ideas on why it reads the final line as incomplete so that I don't receive warnings?

另外还有什么工作可以替换&符号和/或任何关于为什么它读取最后一行不完整的想法以便我不会收到警告?

2 个解决方案

#1


0  

First of all the XML needs an & instead of & as per Section 4.6 Predefined Entities.of Extensible Markup Language (XML) 1.0 (Fifth Edition)

首先,XML需要&而不是&根据第4.6节“可扩展标记语言的预定义实体(XML)1.0(第五版)”

An xml validator can be found here w3schools xml validator

可以在这里找到xml验证器w3schools xml验证器

<DATA>
     <NAME>BARNES &amp; NOBLE</NAME>
</DATA>

Secondly the variable DF. I'm not sure DF can be bound to a data set if its empty(first call)?

其次是变量DF。如果DF为空(第一次调用),我不确定DF是否可以绑定到数据集?

This works

Two identical xml files as per above with fix (As described above)

两个相同的xml文件,如上所述,带有修复程序(如上所述)

for (i in 1:length(filenames)){ 
    data <- xmlParse(paste0(location,"/",filenames[i]))   
    TMP<-xmlToDataFrame(nodes=getNodeSet(data,"//DATA")) 
    if (i==1) {   
        DF<-TMP 
    } else { 
        DF<-rbind(TMP,DF) 
    } 
}

The result is

结果是

 str(DF)
'data.frame':   2 obs. of  1 variable:
 $ NAME: Factor w/ 1 level "BARNES & NOBLE": 1 1

I Hope this is what you're looking for?

我希望这是你在找什么?

All the best

祝一切顺利

#2


0  

Assuming you can pre-process or modify your data, try replacing the & with the following:

假设您可以预处理或修改数据,请尝试将&替换为以下内容:

&amp;

#1


0  

First of all the XML needs an & instead of & as per Section 4.6 Predefined Entities.of Extensible Markup Language (XML) 1.0 (Fifth Edition)

首先,XML需要&而不是&根据第4.6节“可扩展标记语言的预定义实体(XML)1.0(第五版)”

An xml validator can be found here w3schools xml validator

可以在这里找到xml验证器w3schools xml验证器

<DATA>
     <NAME>BARNES &amp; NOBLE</NAME>
</DATA>

Secondly the variable DF. I'm not sure DF can be bound to a data set if its empty(first call)?

其次是变量DF。如果DF为空(第一次调用),我不确定DF是否可以绑定到数据集?

This works

Two identical xml files as per above with fix (As described above)

两个相同的xml文件,如上所述,带有修复程序(如上所述)

for (i in 1:length(filenames)){ 
    data <- xmlParse(paste0(location,"/",filenames[i]))   
    TMP<-xmlToDataFrame(nodes=getNodeSet(data,"//DATA")) 
    if (i==1) {   
        DF<-TMP 
    } else { 
        DF<-rbind(TMP,DF) 
    } 
}

The result is

结果是

 str(DF)
'data.frame':   2 obs. of  1 variable:
 $ NAME: Factor w/ 1 level "BARNES & NOBLE": 1 1

I Hope this is what you're looking for?

我希望这是你在找什么?

All the best

祝一切顺利

#2


0  

Assuming you can pre-process or modify your data, try replacing the & with the following:

假设您可以预处理或修改数据,请尝试将&替换为以下内容:

&amp;