最近在项目中遇到了java和python爬虫进行程序调用和接口对接的问题, 刚开始也是调试了好久才得出点门道.
而后,自己也发现了爬虫的好玩之处,边想着用java来写个爬虫玩玩,虽说是个不起眼的demo,但还是想记录一下这个小爬虫,便于以后的查阅.
直接上代码:
1 import org.jsoup.Connection; 2 import org.jsoup.Jsoup; 3 import org.jsoup.nodes.Document; 4 import org.jsoup.nodes.Element; 5 import org.jsoup.select.Elements; 6 import org.springframework.util.StringUtils; 7 8 import java.io.IOException; 9 import java.util.ArrayList; 10 import java.util.List; 11 12 public class MySpider { 13 public static void main(String[] args) { 14 List<NewsEntity> list = new ArrayList<NewsEntity>(); 15 Connection connect = Jsoup.connect("http://top.baidu.com/buzz?b=1&fr=tph_right"); //百度风云榜网址 16 connect.userAgent("Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"); //模拟火狐浏览器访问网页 17 try { 18 Document document = connect.get(); //建立连接,获取网页内容为文档对象 19 Element main = document.getElementById("main"); //获取需要爬去部位的根元素 20 Elements url = main.select("div[class=mainBody]").select("table[class=list-table]") 21 .select("tbody").select("tr"); //css选择器 22 int i = 0; 23 for (Element element : url) { 24 NewsEntity entity = new NewsEntity(); 25 String attr_url = element.select("td[class=keyword]").select("a[class=list-title]").attr("href"); 26 String text = element.select("td[class=keyword]").select("a[class=list-title]").text(); 27 String span = element.select("td[class=last").select("span").text(); 28 if (StringUtils.isEmpty(attr_url) || StringUtils.isEmpty(text) || StringUtils.isEmpty(span)) { 29 continue; 30 } 31 entity.setTitle(text); 32 entity.setUrl(attr_url); 33 entity.setHots(span); 34 i++; 35 if (i > 10) { 36 break; 37 } 38 list.add(entity); 39 40 } 41 System.out.println(list.toString()); 42 System.out.println(list.size()); 43 44 } catch (IOException e) { 45 e.printStackTrace(); 46 System.out.println("网页元素发生改变或访问被禁止"); 47 } 48 } 49 }
简易封装:
1 /** 2 * @author RYH 3 * @description 封装新闻实体 4 * @date 2019/2/26 5 **/ 6 public class NewsEntity { 7 private String title; 8 private String url; 9 private String hots; 10 11 public String getTitle() { 12 return title; 13 } 14 15 public void setTitle(String title) { 16 this.title = title; 17 } 18 19 public String getUrl() { 20 return url; 21 } 22 23 public void setUrl(String url) { 24 this.url = url; 25 } 26 27 public String getHots() { 28 return hots; 29 } 30 31 public void setHots(String hots) { 32 this.hots = hots; 33 } 34 35 @Override 36 public String toString() { 37 return "NewsEntity{" + 38 "title='" + title + '\'' + 39 ", url='" + url + '\'' + 40 ", hots=" + hots + 41 '}'; 42 } 43 }
导入的包也只有jsoup包,功能还是很强大的
1 <dependency> 2 <groupId>org.springframework</groupId> 3 <artifactId>spring-jdbc</artifactId> 4 <version>5.1.4.RELEASE</version> 5 </dependency>
控制台打印也一目了然, 做些简单的爬取还是很容易的