When Jsoup encounters certain types of HTML (either complex or incorrect) it may emit HTML that is badly formed. An example is:
当Jsoup遇到某些类型的HTML(复杂或不正确)时,它可能会发出格式错误的HTML。一个例子是:
<html>
<head>
<meta name="x" content="y is "bad" here">
</head>
<body/>
</html>
where the quotes should have been escaped. When Jsoup parses this it emits:
引号应该被转义的地方。当Jsoup解析它时它会发出:
<html>
<head>
<meta name="x" content="y is " bad"="" here"="" />
</head>
<body></body>
</html>
which is not conformant HTML or XML. This is problematic as it will fail at the next parser down the chain.
这不符合HTML或XML。这是有问题的,因为它将在链下一个解析器失败。
Is there any way of ensuring that Jsoup either emits an error message or (like HtmlTidy) can output well-formed XML even if it has lost some information (after all we cannot now be sure what is correct).
有没有办法确保Jsoup发出错误消息或(如HtmlTidy)可以输出格式良好的XML,即使它丢失了一些信息(毕竟我们现在无法确定什么是正确的)。
UPDATE: The code that fails is:
更新:失败的代码是:
@Test
public void testJsoupParseMetaBad() {
String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
Document doc = Jsoup.parse(s);
String ss = doc.toString();
Assert.assertEquals("<html> <head> <meta name=\"x\" content=\"y is \""
+" bad\"=\"\" here\"=\"\" /> </head> <body></body> </html>", ss);
}
I am using:
我在用:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
</dependency>
Others seem to have the same problem: JSoup - Quotations inside attributes The answer there doesn't help me as I have to accept what I am given
其他人似乎有同样的问题:JSoup - 内部属性的引用答案那里没有帮助我,因为我必须接受我给予的
1 个解决方案
#1
1
The problem is when you parse because jsoup is creating 3 attributes from:
问题是当你解析时,因为jsoup正在创建3个属性:
content="y is "bad" here"
and the name of the attributes contains quote " character. Jsoup does escape values for the attributes but not its name.
并且属性的名称包含引号“character.Jsoup确实转义属性的值,但不转义它的名称。
Since you are building the html doc from a string you could get the error on parse phase. There is a method that is getting a org.jsoup.parser.Parser as argument. The default parse method is not tracking errors.
由于您是从字符串构建html doc,因此可以在解析阶段获得错误。有一种方法是将org.jsoup.parser.Parser作为参数。默认的解析方法不是跟踪错误。
String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
Parser parser = Parser.htmlParser(); // or Parser.xmlParser
parser.setTrackErrors(100);
Document doc = Jsoup.parse(s, "", parser);
System.out.println(parser.getErrors());
Output:
输出:
[37: Unexpected character 'a' in input state [AfterAttributeValue_quoted], 40: Unexpected character ' ' in input state [AttributeName], 46: Unexpected character '>' in input state [AttributeName]]
[37:输入状态中的意外字符'a'[AfterAttributeValue_quoted],40:输入状态[AttributeName]中的意外字符',46:输入状态[AttributeName]中的意外字符'>'
In case you don't want to change the parse and just want to get a valid output you could just remove invalid attributes:
如果您不想更改解析并且只想获得有效输出,则可以删除无效属性:
public static void fixIt(Document doc) {
Elements els = doc.getAllElements();
for(Element el:els){
Attributes attributes = el.attributes();
Set<String> remove = new HashSet<>();
for(Attribute a:attributes){
if(isForbidden(a.getKey())){
remove.add(a.getKey());
}
}
for(String k:remove){
el.removeAttr(k);
}
}
}
public static boolean isForbidden(String el) {
return el.contains("\""); //TODO
}
#1
1
The problem is when you parse because jsoup is creating 3 attributes from:
问题是当你解析时,因为jsoup正在创建3个属性:
content="y is "bad" here"
and the name of the attributes contains quote " character. Jsoup does escape values for the attributes but not its name.
并且属性的名称包含引号“character.Jsoup确实转义属性的值,但不转义它的名称。
Since you are building the html doc from a string you could get the error on parse phase. There is a method that is getting a org.jsoup.parser.Parser as argument. The default parse method is not tracking errors.
由于您是从字符串构建html doc,因此可以在解析阶段获得错误。有一种方法是将org.jsoup.parser.Parser作为参数。默认的解析方法不是跟踪错误。
String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
Parser parser = Parser.htmlParser(); // or Parser.xmlParser
parser.setTrackErrors(100);
Document doc = Jsoup.parse(s, "", parser);
System.out.println(parser.getErrors());
Output:
输出:
[37: Unexpected character 'a' in input state [AfterAttributeValue_quoted], 40: Unexpected character ' ' in input state [AttributeName], 46: Unexpected character '>' in input state [AttributeName]]
[37:输入状态中的意外字符'a'[AfterAttributeValue_quoted],40:输入状态[AttributeName]中的意外字符',46:输入状态[AttributeName]中的意外字符'>'
In case you don't want to change the parse and just want to get a valid output you could just remove invalid attributes:
如果您不想更改解析并且只想获得有效输出,则可以删除无效属性:
public static void fixIt(Document doc) {
Elements els = doc.getAllElements();
for(Element el:els){
Attributes attributes = el.attributes();
Set<String> remove = new HashSet<>();
for(Attribute a:attributes){
if(isForbidden(a.getKey())){
remove.add(a.getKey());
}
}
for(String k:remove){
el.removeAttr(k);
}
}
}
public static boolean isForbidden(String el) {
return el.contains("\""); //TODO
}