Android中利用jsoup解析html页面

Android 中使用:

添加依赖

    implementation 'org.jsoup:jsoup:1.10.1'

直接上代码:

package com.loaderman.jsoupdemo;

import android.os.Bundle;

import android.support.v7.app.AppCompatActivity;

import android.view.View;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import java.io.IOException;

public class Main2Activity extends AppCompatActivity {

    @Override

    protected void onCreate(Bundle savedInstanceState) {

        super.onCreate(savedInstanceState);

        setContentView(R.layout.activity_main2);

        findViewById(R.id.btn).setOnClickListener(new View.OnClickListener() {

            @Override

            public void onClick(View view) {

                new Thread(new Runnable() {

                    @Override

                    public void run() {

                        try {

                            Document doc = (Document) Jsoup.connect("http://192.168.0.195:8088/news.html").get();//解析html

                            Elements links = doc.select("ul[class=w_newslistpage_list]");//获取li标签且class为w_newslistpage_list的标签

                            for (Element link : links) {

                                Elements li = link.select("li");//查找li标签

                                for (Element element : li) {//遍历

                                    Elements select = element.select("a[title]");//查找a标签且带有title属性的标签

                                    if (select!=null&&select.size()>0){

                                        String linkHref = select.get(0).attr("href");//获取href值

                                        String linkText = select.get(0).text();//获取text

                                        System.out.println("爬虫结果 1 -->  " + linkHref +linkText);

                                    }

                                    Elements select1 = link.select("span[class=date]");//获取span标签且class为date的标签

                                    if (select1!=null&&select1.size()>0){

                                        String date = select1.get(0).text();

                                        System.out.println("爬虫结果 2--> " + date);

                                    }

                                }

                            }

                        } catch (IOException e) {

                            e.printStackTrace();

                        }

                    }

                }).start();

            }

        });

    }

}

小结如下:

解析和遍历一个HTML文档

如何解析一个HTML文档：

String html = "<html><head><title>First parse</title></head>"

  + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document doc = Jsoup.parse(html);

其解析器能够尽最大可能从你提供的HTML文档来创见一个干净的解析结果，无论HTML的格式是否完整。比如它可以处理：

没有关闭的标签 (比如： Lorem Ipsum parses to Lorem Ipsum)
隐式标签 (比如. 它可以自动将 <td>Table data</td>包装成<table><tr><td>?)
创建可靠的文档结构（html标签包含head 和 body，在head只出现恰当的元素）

一个文档的对象模型

文档由多个Elements和TextNodes组成 (
其继承结构如下：Document继承Element继承Node. TextNode继承 Node.
一个Element包含一个子节点集合，并拥有一个父Element。他们还提供了一个唯一的子元素过滤列表。

从一个文件加载一个文档

File input = new File("/tmp/input.html");

Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

说明

parse(File in, String charsetName, String baseUri) 这个方法用来加载和解析一个HTML文件。如在加载文件的时候发生错误，将抛出IOException，应作适当处理。

baseUri 参数用于解决文件中URLs是相对路径的问题。如果不需要可以传入一个空的字符串。

另外还有一个方法parse(File in, String charsetName) ，它使用文件的路径做为 baseUri。这个方法适用于如果被解析文件位于网站的本地文件系统，且相关链接也指向该文件系统。

使用选择器语法来查找元素

File input = new File("/tmp/input.html");

Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); //带有href属性的a元素

Elements pngs = doc.select("img[src$=.png]");

  //扩展名为.png的图片

Element masthead = doc.select("div.masthead").first();

  //class等于masthead的div标签

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素

说明

jsoup elements对象支持类似于CSS (或jquery)的选择器语法，来实现非常强大和灵活的查找功能。.

这个select 方法在Document, Element,或Elements对象中都可以使用。且是上下文相关的，因此可实现指定元素的过滤，或者链式选择访问。

Select方法将返回一个Elements集合，并提供一组方法来抽取和处理结果。

Selector选择器概述

tagname: 通过标签查找元素，比如：a
ns|tag: 通过标签在命名空间查找元素，比如：可以用 fb|name 语法来查找 <fb:name> 元素
#id: 通过ID查找元素，比如：#logo
.class: 通过class名称查找元素，比如：.masthead
[attribute]: 利用属性查找元素，比如：[href]
[^attr]: 利用属性名前缀来查找元素，比如：可以用[^data-] 来查找带有HTML5 Dataset属性的元素
[attr=value]: 利用属性值来查找元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配属性值开头、结尾或包含属性值来查找元素，比如：[href*=/path/]
[attr~=regex]: 利用属性值匹配正则表达式来查找元素，比如： img[src~=(?i)\.(png|jpe?g)]
*: 这个符号将匹配所有元素

Selector选择器组合使用

el#id: 元素+ID，比如： div#logo
el.class: 元素+class，比如： div.masthead
el[attr]: 元素+class，比如： a[href]
任意组合，比如：a[href].highlight
ancestor child: 查找某个元素下子元素，比如：可以用.body p 查找在"body"元素下的所有 p元素
parent > child: 查找某个父元素下的直接子元素，比如：可以用div.content > p 查找 p 元素，也可以用body > * 查找body标签下所有直接子元素
siblingA + siblingB: 查找在A元素之前第一个同级元素B，比如：div.head + div
siblingA ~ siblingX: 查找A元素之前的同级X元素，比如：h1 ~ p
el, el, el:多个选择器组合，查找匹配任一选择器的唯一元素，例如：div.masthead, div.logo

伪选择器selectors

:lt(n): 查找哪些元素的同级索引值（它的位置在DOM树中是相对于它的父节点）小于n，比如：td:lt(3) 表示小于三列的元素
:gt(n):查找哪些元素的同级索引值大于n，比如： div p:gt(2)表示哪些div中有包含2个以上的p元素
:eq(n): 查找哪些元素的同级索引值与n相等，比如：form input:eq(1)表示包含一个input标签的Form元素
:has(seletor): 查找匹配选择器包含元素的元素，比如：div:has(p)表示哪些div包含了p元素
:not(selector): 查找与选择器不匹配的元素，比如： div:not(.logo) 表示不包含 class="logo" 元素的所有 div 列表
:contains(text): 查找包含给定文本的元素，搜索不区分大不写，比如： p:contains(jsoup)
:containsOwn(text): 查找直接包含给定文本的元素
:matches(regex): 查找哪些元素的文本匹配指定的正则表达式，比如：div:matches((?i)login)
:matchesOwn(regex): 查找自身包含文本匹配指定正则表达式的元素
注意：上述伪选择器索引是从0开始的，也就是说第一个元素索引值为0，第二个元素index为1等

具体api如下:

CSS-like element selector, that finds elements matching a query.

Selector syntax

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

Pattern	Matches	Example
`*`	any element	`*`
`tag`	elements with the given tag name	`div`
`*\|E`	elements of type E in any namespace ns	`*\|name` finds `<fb:name>` elements
`ns\|E`	elements of type E in the namespace ns	`fb\|name` finds `<fb:name>` elements
`#id`	elements with attribute ID of "id"	`div#wrap`, `#logo`
`.class`	elements with a class name of "class"	`div.left`, `.result`
`[attr]`	elements with an attribute named "attr" (with any value)	`a[href]`, `[title]`
`[^attrPrefix]`	elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets	`[^data-]`, `div[^data-]`
`[attr=val]`	elements with an attribute named "attr", and value equal to "val"	`img[width=500]`, `a[rel=nofollow]`
`[attr="val"]`	elements with an attribute named "attr", and value equal to "val"	`span[hello="Cleveland"][goodbye="Columbus"]`, `a[rel="nofollow"]`
`[attr^=valPrefix]`	elements with an attribute named "attr", and value starting with "valPrefix"	`a[href^=http:]`
`[attr$=valSuffix]`	elements with an attribute named "attr", and value ending with "valSuffix"	`img[src$=.png]`
`[attr*=valContaining]`	elements with an attribute named "attr", and value containing "valContaining"	`a[href*=/search/]`
`[attr~=regex]`	elements with an attribute named "attr", and value matching the regular expression	`img[src~=(?i)\\.(png\|jpe?g)]`
	The above may be combined in any order	`div.header[title]`
	Combinators
`E F`	an F element descended from an E element	`div a`, `.logo h1`
`E > F`	an F direct child of E	`ol > li`
`E + F`	an F element immediately preceded by sibling E	`li + li`, `div.head + div`
`E ~ F`	an F element preceded by sibling E	`h1 ~ p`
`E, F, G`	all matching elements E, F, or G	`a[href], div, h3`
	Pseudo selectors
`:lt(n)`	elements whose sibling index is less than n	`td:lt(3)` finds the first 3 cells of each row
`:gt(n)`	elements whose sibling index is greater than n	`td:gt(1)` finds cells after skipping the first two
`:eq(n)`	elements whose sibling index is equal to n	`td:eq(0)` finds the first cell of each row
`:has(selector)`	elements that contains at least one element matching the selector	`div:has(p)` finds divs that contain p elements
`:not(selector)`	elements that do not match the selector. See also `Elements.not(String)`	`div:not(.logo)` finds all divs that do not have the "logo" class. `div:not(:has(div))` finds divs that do not contain divs.
`:contains(text)`	elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.	`p:contains(jsoup)` finds p elements containing the text "jsoup".
`:matches(regex)`	elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.	`td:matches(\\d+)` finds table cells containing digits. `div:matches((?i)login)` finds divs containing the text, case insensitively.
`:containsOwn(text)`	elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.	`p:containsOwn(jsoup)` finds p elements with own text "jsoup".
`:matchesOwn(regex)`	elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.	`td:matchesOwn(\\d+)` finds table cells directly containing digits. `div:matchesOwn((?i)login)` finds divs containing the text, case insensitively.
`:containsData(data)`	elements that contains the specified data. The contents of `script` and `style` elements, and `comment` nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants.	`script:contains(jsoup)` finds script elements containing the data "jsoup".
	The above may be combined in any order and with other selectors	`.light:contains(name):eq(0)`
`:matchText`	treats text nodes as elements, and so allows you to match against and select text nodes. Note that using this selector will modify the DOM, so you may want to `clone` your document before using.	`p:matchText:firstChild` with input `<p>One<br />Two</p>` will return one `PseudoTextElement` with text "`One`".
Structural pseudo selectors
`:root`	The element that is the root of the document. In HTML, this is the `html` element	`:root`
`:nth-child(an+b)`	elements that have `an+b-1` siblings before it in the document tree, for any positive integer or zero value of `n`, and has a parent element. For values of `a` and `b` greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The `a` and `b` values must be integers (positive, negative, or zero). The index of the first child of an element is 1. In addition to this, `:nth-child()` can take `odd` and `even` as arguments instead. `odd` has the same signification as `2n+1`, and `even` has the same signification as `2n`.	`tr:nth-child(2n+1)` finds every odd row of a table. `:nth-child(10n-1)` the 9th, 19th, 29th, etc, element. `li:nth-child(5)` the 5h li
`:nth-last-child(an+b)`	elements that have `an+b-1` siblings after it in the document tree. Otherwise like `:nth-child()`	`tr:nth-last-child(-n+2)` the last two rows of a table
`:nth-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-of-type(2n+1)`
`:nth-last-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-last-of-type(2n+1)`
`:first-child`	elements that are the first child of some other element.	`div > p:first-child`
`:last-child`	elements that are the last child of some other element.	`ol > li:last-child`
`:first-of-type`	elements that are the first sibling of its type in the list of children of its parent element	`dl dt:first-of-type`
`:last-of-type`	elements that are the last sibling of its type in the list of children of its parent element	`tr > td:last-of-type`
`:only-child`	elements that have a parent element and whose parent element hasve no other element children
`:only-of-type`	an element that has a parent element and whose parent element has no other element children with the same expanded element name
`:empty`	elements that have no children at all

秒客网