如何解析一个html页面并使用html标签的类名或id取出文本?

时间:2022-08-14 20:23:04

I want to parse one html page and take out text using class name or id of a html tag.

我想解析一个html页面,并使用类名或html标记的id取出文本。

Apache tika or jsoup? Suggest me any tool name which have more control to manipulate and take out texts using specific tags, id or class names of a html page.

Apache tika还是jsoup?建议我使用html页面的特定标签,id或类名来操纵和取出文本的任何工具名称。

1 个解决方案

#1


1  

I made you an example of the three use cases using Jsoup, please see the comments in code:

我使用Jsoup为您提供了三个用例的示例,请参阅代码中的注释:

  • get div elements by class name
  • 按类名获取div元素
  • get all div elements by tag name
  • 按标签名称获取所有div元素
  • get element by id
  • 按id获取元素
String html = "...";
Document doc = Jsoup.parse(html);

// get div elements by class name 
Elements divs = doc.select("div.myclass");
for (Element div : divs) {
    // print containing text
    System.out.println(div.text());
}

// get all div elements by tag name
divs = doc.getElementsByTag("div");
for (Element div : divs) {
    // print containing text
    System.out.println(div.text());
}

// get element by id
String id = "...";
Element element = doc.getElementById(id);
System.out.println(element.text());

#1


1  

I made you an example of the three use cases using Jsoup, please see the comments in code:

我使用Jsoup为您提供了三个用例的示例,请参阅代码中的注释:

  • get div elements by class name
  • 按类名获取div元素
  • get all div elements by tag name
  • 按标签名称获取所有div元素
  • get element by id
  • 按id获取元素
String html = "...";
Document doc = Jsoup.parse(html);

// get div elements by class name 
Elements divs = doc.select("div.myclass");
for (Element div : divs) {
    // print containing text
    System.out.println(div.text());
}

// get all div elements by tag name
divs = doc.getElementsByTag("div");
for (Element div : divs) {
    // print containing text
    System.out.println(div.text());
}

// get element by id
String id = "...";
Element element = doc.getElementById(id);
System.out.println(element.text());