HTML页面解析组件-Jsoup使用

时间:2022-11-01 10:09:15

原文地址: http://blog.sina.com.cn/s/blog_7227719a0100lpix.html


java端解析HTML页面内容

Jsoup把HTML的解析变为DOM的方式,类似于在HTML页面中直接用JS操作。

使用方法:

Document doc = Jsoup.parse(new URL(“http://www.baidu.com”),5000);

这是从一个URL地址获取HTML页面内容,然后直接处理成一个DOM的对象。当然,也可以传入已有的HTML页面String,
甚至于File对象,输入流对象。

元素用Element对象封装
元素集合用Elements对象封装(LinkedHashSet)
Elements elems = doc.getElementsByTagName_r("A");

Elemens elems = doc.getElemensByName("name”);
。。。
最方便的莫过于类似于XPATH的select方法
Elements elems  = doc.select(“A[href^=http]”); //href 以http开头
更多规则:

Selector overview
  • tagname: find elements by tag, e.g. a
  • ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
  • #id: find elements by ID, e.g. #logo
  • .class: find elements by class name, e.g. .masthead
  • [attribute]: elements with attribute, e.g. [href]
  • [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
  • [attr=value]: elements with attribute value, e.g. [width=500]
  • [attr^=value][attr$=value][attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
  • [attr~=regex]: elements with attribute values that match the regular expression;e.g.img[src~=(?i)\.(png|jpe?g)]
  • *: all elements, e.g. *
Selector combinations
  • el#id: elements with ID, e.g. div#logo
  • el.class: elements with class, e.g. div.masthead
  • el[attr]: elements with attribute, e.g. a[href]
  • Any combination, e.g. a[href].highlight
  • ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
  • parent > child: child elements that descend directly from parent, e.g. div.content > pfinds p elements; and body > * finds the direct children of the body tag
  • siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g.div.head + div
  • siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
  • el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
Pseudo selectors
  • :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
  • :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
  • :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
  • :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
  • :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
  • :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
  • :containsOwn(text): find elements that directly contain the given text
  • :matches(regex): find elements whose text matches the specified regular expression; e.g.div:matches((?i)login)
  • :matchesOwn(regex): find elements whose own text matches the specified regular expression
  • Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

See the Selector API reference for the full supported list and details.

优点:

1、使用非常简单,类似于JS操作DOM,很直观,熟悉
2、选择器很强大,可以很方便的查找元素