jsoup解析html文档

jsoup 是一款 java 的 html 解析器，可直接解析某个 URL 地址、html 文本内容。它提供了一套非常简单省力的 API，可通过 DOM，CSS 以及类似于 jQuery 的操作方法来取出和操作数据。下面就通过实例来简单介绍下如何使用 jsoup 来进行常用的 HTML 解析。

使用jsoup解析html的4大步骤：

1.在项目中添加jsoup的jar包，下载地址：http://jsoup.org/download

2.在项目中创建个java文件，首先获取url，file或文本的html文档Document

3.解析并提取html元素（可用dom或css选择器检索筛选元素）

4.对数据进行处理

jsoup实例代码如下：

        // Jsoup来操作html文本串
@Test
public void testJsoupByString(){
String html = "<p><a href=\"https://www.baidu.com\">百度一下，你就知道</a></p>";
Document doc = Jsoup.parse(html);
Elements ele = doc.getElementsByTag("p");
String pstr = ele.get(0).getElementsByTag("a").get(0).getElementsByAttribute("href").text();
System.out.println(pstr);
}

// Jsoup来抓取url内部的html，解析并处理
@Test
public void testJsoupByUrl() throws IOException{
Elements e = Jsoup.connect("http://www.csdn.com").get().getElementsByTag("a");
for(Element el : e){
if(el.text().startsWith("C")){
System.out.println(el.html());
}
}
}

// Jsoup来处理文件
@Test
public void testJsoupByFile() throws IOException{
File file = new File("C:/LC/lc_workspace/SSI/WebRoot/index.jsp");
Document doc = Jsoup.parse(file, "utf-8");
Elements el = doc.getElementsByTag("select");
for(Element e : el){
System.out.println(e.html());
}

}

jsoup来解析处理html就是这么简单，jsoup是基于MIT协议的，在任何的项目中都可以简单方便的使用。

注：本文只是简单的介绍下jsoup解析处理html的方法，jsoup最强大的地方就是选择器（selector），更详细的文档说明请参考：http://jsoup.org/apidocs/

秒客网

jsoup解析html文档

相关文章