Jsoup解析和遍历一个HTML文档(一)

时间:2022-05-24 07:38:14

  官方文档如:

Problem
You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it
's well formed, or to modify it. The String may have come from user input, a file, or from the web.

Solution
Use the
static Jsoup.parse(String html) method, or Jsoup.parse(String html, String baseUri) if the page came from the web, and you want to get at absolute URLs (see [working-with-urls]).

String html
= "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc
= Jsoup.parse(html);
其解析器能够尽最大可能从你提供的HTML文档来创见一个干净的解析结果,无论HTML的格式是否完整。比如它可以处理:

没有关闭的标签 (比如:
<p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
隐式标签 (比如. 它可以自动将
<td>Table data</td>包装成<table><tr><td>?)
创建可靠的文档结构(html标签包含head 和 body,在head只出现恰当的元素)
一个文档的对象模型
文档由多个Elements和TextNodes组成 (以及其它辅助nodes:详细可查看:nodes
package tree).
其继承结构如下:Document继承Element继承Node. TextNode继承 Node.
一个Element包含一个子节点集合,并拥有一个父Element。他们还提供了一个唯一的子元素过滤列表。

附上实例:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class test {

public static void main (String[] args){
try{
Document document
=Jsoup.connect("http://www.baidu.com/").get();
//==========================================
//直接抓取页面元素模块
//==========================================
//抓取文章title标签
String title=document.title();
//抓取文章text标签内容
String text=document.text();
//获取Html文件中的body元素
Element body=document.body();
//获取a标签
Elements aArray=body.getElementsByTag("a");
//类选择器
Elements classArray=body.getElementsByClass("s_form");//此处为类名,截取的div的类名
//获取属性
Elements attributesArray=body.getElementsByAttribute("href");
//获取子元素
Elements children=body.children();
//==========================================
//选择器模块
//==========================================
Elements aSelect=document.select("a[href]");

System.out.println(
"页面标题: "+title+"\n 页面内容: "+text+
"\n body:\n"+ body);
System.out.println(
"=================================================");

System.out.println(
"所有a标签:\n"+aArray);
System.out.println(
"=================================================");
System.out.println(
"div:\n"+classArray);
System.out.println(
"=================================================");
System.out.println(
"href:\n"+attributesArray);
System.out.println(
"=================================================");
System.out.println(
"children:\n"+children);
System.out.println(
"=================================================");
System.out.println(
"aSelect:\n"+aSelect);

}
catch (IOException e){
e.printStackTrace();
}

}

}

运行结果:

页面标题: 百度一下,你就知道
页面内容: 百度一下,你就知道 新闻 hao123 地图 视频 贴吧 登录 更多产品 关于百度 About Baidu
?2017?Baidu?使用百度前必读? 意见反馈?京ICP证030173号?
body:
<body link="#0000cc">
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div class="s_form">
<div class="s_form_wrapper">
<div id="lg">
<img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">
</div>
<form id="form" name="f" action="//www.baidu.com/s" class="fm">
<input type="hidden" name="bdorz_come" value="1">
<input type="hidden" name="ie" value="utf-8">
<input type="hidden" name="f" value="8">
<input type="hidden" name="rsv_bp" value="1">
<input type="hidden" name="rsv_idx" value="1">
<input type="hidden" name="tn" value="baidu">
<span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off" autofocus></span>
<span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn"></span>
</form>
</div>
</div>
<div id="u1">
<a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a>
<a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a>
<a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a>
<noscript>
<a href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login" class="lb">登录</a>
</noscript>
<script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script>
<a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多产品</a>
</div>
</div>
</div>
<div id="ftCon">
<div id="ftConw">
<p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p>
<p id="cp">?2017&nbsp;Baidu&nbsp;<a href="http://www.baidu.com/duty/">使用百度前必读</a>&nbsp; <a href="http://jianyi.baidu.com/" class="cp-feedback">意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src="//www.baidu.com/img/gs.gif"> </p>
</div>
</div>
</div>
</body>
=================================================
所有a标签:
<a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a>
<a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a>
<a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a>
<a href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login" class="lb">登录</a>
<a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多产品</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.baidu.com/duty/">使用百度前必读</a>
<a href="http://jianyi.baidu.com/" class="cp-feedback">意见反馈</a>
=================================================
div:
<div class="s_form">
<div class="s_form_wrapper">
<div id="lg">
<img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">
</div>
<form id="form" name="f" action="//www.baidu.com/s" class="fm">
<input type="hidden" name="bdorz_come" value="1">
<input type="hidden" name="ie" value="utf-8">
<input type="hidden" name="f" value="8">
<input type="hidden" name="rsv_bp" value="1">
<input type="hidden" name="rsv_idx" value="1">
<input type="hidden" name="tn" value="baidu">
<span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off" autofocus></span>
<span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn"></span>
</form>
</div>
</div>
=================================================
href:
<a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a>
<a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a>
<a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a>
<a href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login" class="lb">登录</a>
<a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多产品</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.baidu.com/duty/">使用百度前必读</a>
<a href="http://jianyi.baidu.com/" class="cp-feedback">意见反馈</a>
=================================================
children:
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div class="s_form">
<div class="s_form_wrapper">
<div id="lg">
<img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">
</div>
<form id="form" name="f" action="//www.baidu.com/s" class="fm">
<input type="hidden" name="bdorz_come" value="1">
<input type="hidden" name="ie" value="utf-8">
<input type="hidden" name="f" value="8">
<input type="hidden" name="rsv_bp" value="1">
<input type="hidden" name="rsv_idx" value="1">
<input type="hidden" name="tn" value="baidu">
<span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off" autofocus></span>
<span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn"></span>
</form>
</div>
</div>
<div id="u1">
<a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a>
<a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a>
<a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a>
<noscript>
<a href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login" class="lb">登录</a>
</noscript>
<script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script>
<a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多产品</a>
</div>
</div>
</div>
<div id="ftCon">
<div id="ftConw">
<p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p>
<p id="cp">?2017&nbsp;Baidu&nbsp;<a href="http://www.baidu.com/duty/">使用百度前必读</a>&nbsp; <a href="http://jianyi.baidu.com/" class="cp-feedback">意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src="//www.baidu.com/img/gs.gif"> </p>
</div>
</div>
</div>
=================================================
aSelect:
<a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a>
<a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a>
<a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a>
<a href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login" class="lb">登录</a>
<a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多产品</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.baidu.com/duty/">使用百度前必读</a>
<a href="http://jianyi.baidu.com/" class="cp-feedback">意见反馈</a>