Let's say i have a html fragment like this:
假设我有一个像这样的html片段:
<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>
What i want to extract from that is:
我想从中提取的是:
foo bar foobar baz
So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html? As you can see in the title, i want to use jsoup for the parsing.
所以我的问题是:如何从html中删除所有包装标签,并获得与html中相同顺序的文本?正如您在标题中看到的,我想使用jsoup进行解析。
Example for accented html (note the 'á' character):
重音html的示例(注意'á'字符):
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
What i want:
我想要的是:
Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok
This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.
这个html不是静态的,通常我只想要一个通用html片段的每个文本都以解码的人类可读形式,宽度换行符。
3 个解决方案
#1
47
With Jsoup:
使用Jsoup:
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
System.out.println(doc.text());
Output:
输出:
foo bar foobar baz
If you want only the text of p-tag, use this instead of doc.text()
:
如果只需要p-tag的文本,请使用此代替doc.text():
doc.select("p").text();
... or only body:
......或只是身体:
doc.body().text();
Linebreak:
final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
+ "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("p") )
{
System.out.println(element.text());
// eg. you can use a StringBuilder and append lines here ...
}
Output:
输出:
Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok
#2
11
Using Regex: -
使用正则表达式: -
String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
str = str.replaceAll("<[^>]*>", "");
System.out.println(str);
OUTPUT: -
输出: -
foo bar foobar baz
Using Jsoup: -
使用Jsoup: -
Document doc = Jsoup.parse(str);
String text = doc.text();
#3
2
Actually, the correct way to clean with Jsoup is through a Whitelist
实际上,用Jsoup清理的正确方法是通过白名单
...
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
Whitelist wl = new Whitelist().none()
String cleanText = new Jsoup().clean(doc ,wl)
If you want to still preserve some tags:
如果你想保留一些标签:
Whitelist wl = new Whitelist().relaxed().removeTags("a")
#1
47
With Jsoup:
使用Jsoup:
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
System.out.println(doc.text());
Output:
输出:
foo bar foobar baz
If you want only the text of p-tag, use this instead of doc.text()
:
如果只需要p-tag的文本,请使用此代替doc.text():
doc.select("p").text();
... or only body:
......或只是身体:
doc.body().text();
Linebreak:
final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
+ "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("p") )
{
System.out.println(element.text());
// eg. you can use a StringBuilder and append lines here ...
}
Output:
输出:
Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok
#2
11
Using Regex: -
使用正则表达式: -
String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
str = str.replaceAll("<[^>]*>", "");
System.out.println(str);
OUTPUT: -
输出: -
foo bar foobar baz
Using Jsoup: -
使用Jsoup: -
Document doc = Jsoup.parse(str);
String text = doc.text();
#3
2
Actually, the correct way to clean with Jsoup is through a Whitelist
实际上,用Jsoup清理的正确方法是通过白名单
...
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
Whitelist wl = new Whitelist().none()
String cleanText = new Jsoup().clean(doc ,wl)
If you want to still preserve some tags:
如果你想保留一些标签:
Whitelist wl = new Whitelist().relaxed().removeTags("a")