jsoup - 删除所有格式和链接标记，仅保留文本

Let's say i have a html fragment like this:

假设我有一个像这样的html片段：

<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>

What i want to extract from that is:

我想从中提取的是：

foo bar foobar baz

So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html? As you can see in the title, i want to use jsoup for the parsing.

所以我的问题是：如何从html中删除所有包装标签，并获得与html中相同顺序的文本？正如您在标题中看到的，我想使用jsoup进行解析。

Example for accented html (note the 'á' character):

重音html的示例（注意'á'字符）：

<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>

What i want:

我想要的是：

Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok

This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.

这个html不是静态的，通常我只想要一个通用html片段的每个文本都以解码的人类可读形式，宽度换行符。

3 个解决方案

#1

With Jsoup:

使用Jsoup：

final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);

System.out.println(doc.text());

Output:

输出：

foo bar foobar baz

If you want only the text of p-tag, use this instead of doc.text():

如果只需要p-tag的文本，请使用此代替doc.text（）：

doc.select("p").text();

... or only body:

......或只是身体：

doc.body().text();

Linebreak:

final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
        + "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("p") )
{
    System.out.println(element.text());
    // eg. you can use a StringBuilder and append lines here ...
}

Output:

输出：

Tarthatatlan biztonsági viszonyok  
Tarthatatlan biztonsági viszonyok

#2

Using Regex: -

使用正则表达式： -

String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
str = str.replaceAll("<[^>]*>", "");
System.out.println(str);

OUTPUT: -

输出： -

  foo   bar  foobar  baz

Using Jsoup: -

使用Jsoup： -

Document doc = Jsoup.parse(str); 
String text = doc.text();

#3

Actually, the correct way to clean with Jsoup is through a Whitelist

实际上，用Jsoup清理的正确方法是通过白名单

...
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
Whitelist wl = new Whitelist().none()
String cleanText = new Jsoup().clean(doc ,wl)

If you want to still preserve some tags:

如果你想保留一些标签：

Whitelist wl = new Whitelist().relaxed().removeTags("a")

#1

With Jsoup:

使用Jsoup：

final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);

System.out.println(doc.text());

Output:

输出：

foo bar foobar baz

If you want only the text of p-tag, use this instead of doc.text():

如果只需要p-tag的文本，请使用此代替doc.text（）：

doc.select("p").text();

... or only body:

......或只是身体：

doc.body().text();

Linebreak:

final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
        + "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("p") )
{
    System.out.println(element.text());
    // eg. you can use a StringBuilder and append lines here ...
}

Output:

输出：

Tarthatatlan biztonsági viszonyok  
Tarthatatlan biztonsági viszonyok

#2

Using Regex: -

使用正则表达式： -

String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
str = str.replaceAll("<[^>]*>", "");
System.out.println(str);

OUTPUT: -

输出： -

  foo   bar  foobar  baz

Using Jsoup: -

使用Jsoup： -

Document doc = Jsoup.parse(str); 
String text = doc.text();

#3

Actually, the correct way to clean with Jsoup is through a Whitelist

实际上，用Jsoup清理的正确方法是通过白名单

...
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
Whitelist wl = new Whitelist().none()
String cleanText = new Jsoup().clean(doc ,wl)

If you want to still preserve some tags:

如果你想保留一些标签：

Whitelist wl = new Whitelist().relaxed().removeTags("a")

秒客网

jsoup - 删除所有格式和链接标记，仅保留文本

3 个解决方案

#1

Linebreak:

#2

#3

#1

Linebreak:

#2

#3

相关文章