教您使用java爬虫gecco抓取JD全部商品信息

gecco爬虫

如果对gecco还没有了解可以参看一下gecco的github首页。gecco爬虫十分的简单易用，JD全部商品信息的抓取9个类就能搞定。

JD网站的分析

要抓取JD网站的全部商品信息，我们要先分析一下网站，京东网站可以大体分为三级，首页上通过分类跳转到商品列表页，商品列表页对每个商品有详情页。那么我们通过找到所有分类就能逐个分类抓取商品信息。

入口地址

http://www.jd.com/allSort.aspx，这个地址是JD全部商品的分类列表，我们以该页面作为开始页面，抓取JD的全部商品信息

新建开始页面的HtmlBean类AllSort

 @Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})

 public  class  AllSort  implements  HtmlBean{

     private static final long serialVersionUID = 665662335318691818L;

     @Request

     private HttpRequest request;

     //手机

     @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")

     private List<Category> mobile;

     //家用电器

     @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")

     private List<Category> domestic;

     public List<Category> getMobile(){

         return mobile;

     }

     public void setMobile(List<Category> mobile){

         this.mobile = mobile;

     }

     public List<Category> getDomestic(){

         return domestic;

     }

     public void setDomestic(List<Category> domestic){

         this.domestic = domestic;

     }

     public HttpRequest getRequest(){

         return request;

     }

     public void setRequest(HttpRequest request){

         this.request = request;

     }

 }

可以看到，这里以抓取手机和家用电器两个大类的商品信息为例，可以看到每个大类都包含若干个子分类，用List<Category>表示。gecco支持Bean的嵌套，可以很好的表达html页面结构。Category表示子分类信息内容，HrefBean是共用的链接Bean。

public class Category implements HtmlBean{

    private static final long serialVersionUID = 3018760488621382659L;

    @Text

    @HtmlField(cssPath="dt a")

    private String parentName;

    @HtmlField(cssPath="dd a")

    private List<HrefBean> categorys;

    public String getParentName(){

        return parentName;

    }

    public void setParentName(String parentName){

        this.parentName = parentName;

    }

    public List<HrefBean> getCategorys(){

        return categorys;

    }

    public void setCategorys(List<HrefBean> categorys){

        this.categorys = categorys;

    }

}

获取页面元素cssPath的小技巧

上面两个类难点就在cssPath的获取上，这里介绍一些cssPath获取的小技巧。用Chrome浏览器打开需要抓取的网页，按F12进入发者模式。选择你要获取的元素，如图：

教您使用java爬虫gecco抓取JD全部商品信息

在浏览器右侧选中该元素，鼠标右键选择Copy--Copy selector，即可获得该元素的cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

如果你对jquery的selector有了解，另外我们只希望获得dl元素，因此即可简化为：

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

编写AllSort的业务处理类

完成对AllSort的注入后，我们需要对AllSort进行业务处理，这里我们不做分类信息持久化等处理，只对分类链接进行提取，进一步抓取商品列表信息。看代码：

 @PipelineName("allSortPipeline")

 public classAllSortPipelineimplementsPipeline<AllSort> {

     @Override

     public void process(AllSort allSort) {

         List<Category> categorys = allSort.getMobile();

         for(Category category : categorys) {

             List<HrefBean> hrefs = category.getCategorys();

             for(HrefBean href : hrefs) {

                 String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";

                 HttpRequest currRequest = allSort.getRequest();

                 SchedulerContext.into(currRequest.subRequest(url));

             }

         }

     }

 }

@PipelinName定义该pipeline的名称，在AllSort的@Gecco注解里进行关联，这样，gecco在抓取完并注入Bean后就会逐个调用@Gecco定义的pipeline了。为每个子链接增加"&delivery=1&page=1&JL=4_10_0&go=0"的目的是只抓取京东自营并且有货的商品。SchedulerContext.into()方法是将待抓取的链接放入队列中等待进一步抓取。

教您使用java爬虫gecco抓取JD全部商品信息的更多相关文章

Python爬虫实战---抓取图书馆借阅信息
Python爬虫实战---抓取图书馆借阅信息原创作品,引用请表明出处:Python爬虫实战---抓取图书馆借阅信息前段时间在图书馆借了很多书,借得多了就容易忘记每本书的应还日期,老是担心自己会违约 ...
Java广度优先爬虫示例(抓取复旦新闻信息)
一.使用的技术这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的开发 ...
爬虫—Selenium爬取JD商品信息
一,抓取分析本次目标是爬取京东商品信息,包括商品的图片,名称,价格,评价人数,店铺名称.抓取入口就是京东的搜索页面,这个链接可以通过直接构造参数访问https://search.jd.com/Sea ...
使用轻量级JAVA 爬虫Gecco工具抓取新闻DEMO
写在前面最近看到Gecoo爬虫工具,感觉比较简单好用,所有写个DEMO测试一下,抓取网站 http://zj.zjol.com.cn/home.html,主要抓取新闻的标题和发布时间做为抓取测试对象 ...
【JAVA系列】Google爬虫如何抓取JavaScript的？
公众号:SAP Technical 本文作者:matinal 原文出处:http://www.cnblogs.com/SAPmatinal/ 原文链接:[JAVA系列]Google爬虫如何抓取Java ...
Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗
Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗零.致谢感谢BOSS直聘相对权威的招聘信息,使本人有了这次比较有意思的研究之旅. 由于爬虫持续爬取 www.zhipin.com 网 ...
JAVA 爬虫Gecco
主要代码: Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipelin ...
Golang分布式爬虫：抓取煎蛋文章|Redis/Mysql|56,961 篇文章
--- layout: post title: "Golang分布式爬虫:抓取煎蛋文章" date: 2017-04-15 author: hunterhug categories ...
scrapy抓取拉勾网职位信息（一）——scrapy初识及lagou爬虫项目建立
本次以scrapy抓取拉勾网职位信息作为scrapy学习的一个实战演练 python版本:3.7.1 框架:scrapy(pip直接安装可能会报错,如果是vc++环境不满足,建议直接安装一个visua ...

随机推荐

【ubuntu 】常见错误--Could not get lock /var/lib/dpkg/lock
ubuntu 常见错误--Could not get lock /var/lib/dpkg/lock 通过终端安装程序sudo apt-get install xxx时出错: E: Could not ...
iOS - Swift SingleClass&Tab;&Tab;单例类
前言单例对象能够被整个程序所操作.对于一个单例类,无论初始化单例对象多少次,也只能有一个单例对象存在,并且该对象是全局的,能够被整个系统访问到. 单例类的创建 1.1 单例类的创建 1 单例类的创建 ...
【mac开发&period;NET】No installed provisioning profiles match the installed iOS signing identities
编译错误提示 /Library/Frameworks/Mono.framework/External/xbuild/Xamarin/iOS/Xamarin.iOS.Common.targets: Er ...
2016-2017 CT S03E02&colon; Codeforces Trainings Season 3 Episode 2
A HHPaint B Square Root C Interesting Places D Road to Home E Ant and apples F Square G Pair H The F ...
repeat a string in java
if I want to repeat "hello" four times as a new string-> "hellohellohellohello&quo ...
Linux 监控文件事件
某些应用程序需要对文件或者目录进行监控,来侦测其是否发生了某些事件.Linux很贴心的为我们提供了inotify API,也是Linux的专有. inotify API 在使用之前一定要有一个inot ...
Google的SPDY协议成为HTTP 2&period;0的基础
详见:http://blog.yemou.net/article/query/info/tytfjhfascvhzxcyt384 据TNW援引 IFTF HTTP 工作组主席 Mark Notting ...
Spring + Mybatis&&num;160&semi;集成原理分析
由于我之前是写在wizNote上的,迁移过来比较浪费时间,所以,这里我直接贴个图片,PDF文件我上传到百度云盘了,需要的可直接下载. 地址:https://pan.baidu.com/s/12ZJmw ...
对多条件进行组合，生成笛卡尔积的用例集合的python代码实现
做专项测试需要对一些因素进行组合的测试,这里组合起来后数据量可能很大,我们可以用python来代劳代码有优化空间,目前先用着. ************************代码开始******* ...
ionic3使用moment&period;js
安装npm模块 $ npm install moment 例如在/home/home.ts文件里 import { Component } from '@angular/core'; import { ...