如何在使用Jsoup操作xml时保留DOCTYPE声明

时间:2021-07-02 20:20:58

I have a XML document which starts the following way:

我有一个XML文档,它以下列方式启动:

<?xml version="1.0"?>
<!DOCTYPE  viewdef [
<!ENTITY nbsp   "&#160;"> <!-- no-break space = non-breaking space U+00A0 ISOnum -->
<!ENTITY copy   "&#169;"> <!-- copyright sign, U+00A9 ISOnum -->
<!ENTITY amp    "&#038;"> <!-- ampersand -->
<!ENTITY shy    "&#173;"> <!-- soft hyphen -->
]>

I am parsing the document with Jsoup 1.8.2 in the following way:

我用以下方式使用Jsoup 1.8.2解析文档:

public static void convertXml(String inFile, String outFile) throws Exception {
    String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8")); 
    Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
    FileUtils.writeStringToFile(new File(outFile), document.html(), "UTF-8");           
}

I expect the output file to be the same as the input in this case, but Jsoup generates this instead:

我希望输出文件在这种情况下与输入相同,但Jsoup生成了这个:

<?xml version="1.0"?> <!DOCTYPE viewdef> 
<!-- no-break space = non-breaking space U+00A0 ISOnum --> 
<!--ENTITY copy   "&#169;"--> 
<!-- copyright sign, U+00A9 ISOnum --> 
<!--ENTITY amp    "&#038;"--> 
<!-- ampersand --> 
<!--ENTITY shy    "&#173;"--> 
<!-- soft hyphen --> ]&gt;

Is this a bug or is there any way to preserve the original DOCTYPE declaration?

这是一个错误还是有任何方法可以保留原始的DOCTYPE声明?

1 个解决方案

#1


0  

Before parsing xmlString with Jsoup, replace the DOCTYPE sequence manually with something then add it back in the final document.

在使用Jsoup解析xmlString之前,先用手动替换DOCTYPE序列,然后将其添加回最终文档中。

SAMPLE CODE

private final static String DOCTYPE_SEQUENCE = "<doctype-sequence/>";
private final static Pattern patern = Pattern.compile("(?i)<!DOCTYPE[\s\S]+]>");

public static void convertXml(String inFile, String outFile) throws Exception {
    String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8")); 

    // * Remove the doctype sequence if found
    String doctype = "";
    Matcher matcher = pattern.matcher(xmlString);
    if (matcher.find()) {
        doctype = matcher.group(0);
        xmlString = xmlString.replace( doctype, DOCTYPE_SEQUENCE);
    }

    // * 
    Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
    FileUtils.writeStringToFile(new File(outFile), document.html().replace(DOCTYPE_SEQUENCE, doctype), "UTF-8");           
}

The pattern variable is outside of convertXml for avoiding multiple pattern compilation.

模式变量在convertXml之外,以避免多个模式编译。

#1


0  

Before parsing xmlString with Jsoup, replace the DOCTYPE sequence manually with something then add it back in the final document.

在使用Jsoup解析xmlString之前,先用手动替换DOCTYPE序列,然后将其添加回最终文档中。

SAMPLE CODE

private final static String DOCTYPE_SEQUENCE = "<doctype-sequence/>";
private final static Pattern patern = Pattern.compile("(?i)<!DOCTYPE[\s\S]+]>");

public static void convertXml(String inFile, String outFile) throws Exception {
    String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8")); 

    // * Remove the doctype sequence if found
    String doctype = "";
    Matcher matcher = pattern.matcher(xmlString);
    if (matcher.find()) {
        doctype = matcher.group(0);
        xmlString = xmlString.replace( doctype, DOCTYPE_SEQUENCE);
    }

    // * 
    Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
    FileUtils.writeStringToFile(new File(outFile), document.html().replace(DOCTYPE_SEQUENCE, doctype), "UTF-8");           
}

The pattern variable is outside of convertXml for avoiding multiple pattern compilation.

模式变量在convertXml之外,以避免多个模式编译。