I want to parse one html page and take out text using class name or id of a html tag.
我想解析一个html页面,并使用类名或html标记的id取出文本。
Apache tika or jsoup? Suggest me any tool name which have more control to manipulate and take out texts using specific tags, id or class names of a html page.
Apache tika还是jsoup?建议我使用html页面的特定标签,id或类名来操纵和取出文本的任何工具名称。
1 个解决方案
#1
1
I made you an example of the three use cases using Jsoup
, please see the comments in code:
我使用Jsoup为您提供了三个用例的示例,请参阅代码中的注释:
- get div elements by class name
- 按类名获取div元素
- get all div elements by tag name
- 按标签名称获取所有div元素
- get element by id
- 按id获取元素
String html = "...";
Document doc = Jsoup.parse(html);
// get div elements by class name
Elements divs = doc.select("div.myclass");
for (Element div : divs) {
// print containing text
System.out.println(div.text());
}
// get all div elements by tag name
divs = doc.getElementsByTag("div");
for (Element div : divs) {
// print containing text
System.out.println(div.text());
}
// get element by id
String id = "...";
Element element = doc.getElementById(id);
System.out.println(element.text());
#1
1
I made you an example of the three use cases using Jsoup
, please see the comments in code:
我使用Jsoup为您提供了三个用例的示例,请参阅代码中的注释:
- get div elements by class name
- 按类名获取div元素
- get all div elements by tag name
- 按标签名称获取所有div元素
- get element by id
- 按id获取元素
String html = "...";
Document doc = Jsoup.parse(html);
// get div elements by class name
Elements divs = doc.select("div.myclass");
for (Element div : divs) {
// print containing text
System.out.println(div.text());
}
// get all div elements by tag name
divs = doc.getElementsByTag("div");
for (Element div : divs) {
// print containing text
System.out.println(div.text());
}
// get element by id
String id = "...";
Element element = doc.getElementById(id);
System.out.println(element.text());