Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

时间:2024-04-12 07:34:49

            我最近做了一组关于京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据搜索提供,用到的技术有java+xpath(爬虫相关技术)+springboot,就这两个打算做一个自己随便用用,随便比比赛,虽然我早就意料到网上有类似的东西。不足之处没有多线程处理还有一些细枝末节的东西都没有顾及到。尽力就好,何况也没尽力。

 

  1.    京东:

       成果: 

  2. Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

                问题:

                京东的动态加载,它会现在加载大概三十个,接着再次加载三十个,我的方案是加上几个传递参数,url如下:

    https://search.jd.com/Search?keyword="+question+"&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&stock=1&page="+n+"&s="+(1+(n-1)*30)+"&click=0&scrolling=y 

    其中n为起始商品数,如果点击第二页就是第31个商品开始,其中page为页数,这样就能把所有搜索到的商品都加载进去了。

    只贴部分代码:

     List<item> item = new ArrayList<>();
                    Document doc = Jsoup.parse(page.getHtml());
                    //System.out.println("doerall"+doc);
                   // String all = "//li[contains(@class,'gl-item')]";
    
                    //String titleXpath = "div/div[@class='p-price']/strong/i/text()";
                    // String timeXpath = "//*[@id='page-tools']/span/span[position() = 1]";
                    List<Element> elements = doc.getElementsByClass("gl-item");
                    for (Element element :
                            elements) {
                        item item1 = new item();
                        //System.out.println(element.html());
                        item1.setItemSellpoint(JsoupParserUtils.getXpathString(element, "//div/div[@class='p-name p-name-type-2']/a/i/text()"));
                        item1.setItemName(JsoupParserUtils.getXpathString(element, "//div/div[@class='p-name p-name-type-2']/a/em/text()"));
                        item1.setPrice(JsoupParserUtils.getXpathString(element, "//div/div[@class='p-price']/strong/i/text()"));
                        item1.setImages("https:"+JsoupParserUtils.getXpathString(element, "//div[@class='gl-i-wrap']/div[@class='p-img']/a/img/@source-data-lazy-img"));
                        item1.setShopName( element.getElementsByClass("p-shop").text());
                        item1.setShopUrl( "https:"+element.getElementsByClass("curr-shop").attr("href"));
                        if (item1.getShopName().equals("")){
                            item1.setShopName("京东自营");
                            item1.setShopUrl("https://www.jd.com");
                        }
                        item1.setEcName(dianshang);
                        Date date = new Date();
                        SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                        item1.setUpdateTime(sdft.format(date));
    
                        if (JsoupParserUtils.getXpathString(element, "//div[@class='gl-i-wrap']/div[@class='p-img']/a/@href").length() > 50) {
                            item1.setItemUrl(JsoupParserUtils.getXpathString(element, "//div[@class='gl-i-wrap']/div[@class='p-img']/a/@href"));
                        } else {
                            item1.setItemUrl("https:" + JsoupParserUtils.getXpathString(element, "//div[@class='gl-i-wrap']/div[@class='p-img']/a/@href"));
                        }
                        item.add(item1);
                        System.out.println("\n\n\n\n");
                    }
                    System.out.println("jd success\n");
                    return item;

     

  3.    阿里

            成果:

                        Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

                  问题:

一开始我是通过web电脑端查看它的数据加载网页,然后通过xpath解析,但是他们反爬机制让我隔天就要换一个cookies,所以我通过web app端通过查看network找到直接返回json的url。url如下:

https://m.p4psearch.1688.com/chord/scene.html?q=%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E7%BD%91%E7%AB%99%E6%89%B9%E5%8F%91&cosite=baidujj&trackid=4014000012730004&format=normal&_version=&pagesize=20&beginpage="+n+"&sortType=&scene=WuxianOfferResult&location=landing_t3&v=1&ie=utf-8&prodid=163&pid=&fcatid=&p4pid=1554724111293181203364&keywords="+question

其中question为查询keyword,n为页数。

 

     2   天猫

成果:

Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

天猫还是挺厉害的,反爬虫做的很好,可以通过检测你所带的请求头检测,并且还可以检测异常行为,如果你用同一个ip一直访问同一个搜索词,那么将自动送你机票到登陆界面,或者自动给你一个滑动界面滑动,检测你是否是人类行为,因为爬虫并不能滑动模块,当然爬虫也可以模仿浏览器行为滑动,但是总的来说那样代价就太大了。

所以我的解决方案就是利用请求头,用自己的检测完成以后的cookies,并且实施ip池来变化ip实现爬虫模仿认类行为。

代码: 

package com.zz.search.crawl.page;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;

public class RequestAndResponseTool {


    public static Page  sendRequstAndGetResponse(String url,String dianshang) {
        Page page = null;
        // 1.生成 HttpClinet 对象并设置参数
        HttpClient httpClient = new HttpClient();
        // 设置 HTTP 连接超时 5s
        httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000);
        // 2.生成 GetMethod 对象并设置参数
        GetMethod getMethod = new GetMethod(url);
        //zz设置请求头
        Date date = new Date();
        SimpleDateFormat simpleDateFormat =  new SimpleDateFormat("");
        System.out.println(date);
        //cna=xc8QFScq3mMCAXWIdnNklw/5; x=__ll%3D-1%26_ato%3D0; enc=22D2pCkbDgD4j4NI690F1syj2pzcmVODKNelTBhnJFSbQKa86y3R4gP2f957TU49KrG4i8Z8A0GZ8WP3yEz0%2BQ%3D%3D; _med=dw:1920&dh:1080&pw:1920&ph:1080&ist:0; otherx=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; tk_trace=1; hng=CN%7Czh-CN%7CCNY%7C156; t=2feeba33aa8d109a344fc80c085e942e; lid=%E5%85%8B%E6%8B%89%E5%A4%AB%E5%93%88%E8%8B%8F%E4%B8%9C%E5%9D%A1; _tb_token_=e687eb397a673; cookie2=1a050f7d908b3b750603a0c1a47df435; tt=tmall-main; pnm_cku822=098%23E1hv%2BvvUvbpvUvCkvvvvvjiPRLFwtjnCPssyljljPmPW6j1nP2Fw1jDvPsqy6j3WvphvCyCCvvvvvbyCvm3vpvvvvvCvphCvjvUvvhP7phvwv9vvBj1vpCQmvvChpyCvjvUvvhBmuphvmhCvC8evVczpkphvCyEmmvo4e9yCvh1CVfQvIqU3o5%2BO3w0AhjEmJDKXlLJ1nH6Sp42EHFiihFnhiaV1nV9w4B8n3feAOHCTmEcBKFyK2kyZD70wd5QXVAtlK24Abyy6cPs92QhvCvvvMMGtvpvhphvvv8wCvvBvpvpZ; res=scroll%3A1899*5994-client%3A1899*917-offset%3A1899*5994-screen%3A1920*1080; cq=ccp%3D1; isg=BOnpwpgOkbToia07swi_dp8F7JVJNmK-_aISFYveTlBJUgtk0weRvaMEELRBFXUg; l=bBO58m_qvAtE67oMBOCwqZZ49EbTALRb6uWbggHei_5CF19fmY_OlML0Le96VjCP9iTB4QAn21ytieD4rzkf.
        if (dianshang.equals("tm")){
            getMethod.setRequestHeader("cookie","");
            getMethod.setRequestHeader("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36");
            //getMethod.setRequestHeader("refer","https://list.tmall.com/search_product.htm?q=shouji+&type=p&vmarket=&spm=875.7931836%2FB.a2227oh.d100&from=mallfp..pc_1_searchbutton");
            //Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36
            //enc
        }
       
        if (!dianshang.equals("tm")) {
            getMethod.setRequestHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36");
        }
        //zz设置编码格式Content-Encoding →gzip
        //getMethod.setRequestHeader("Content-Encoding","GBKs");
        // 设置 get 请求超时 5s
        getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000);
        // 设置请求重试处理
        getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());
        // 3.执行 HTTP GET 请求
        try {
            int statusCode = httpClient.executeMethod(getMethod);
            // 判断访问的状态码
            if (statusCode != HttpStatus.SC_OK) {
                System.err.println("Method failed: " + getMethod.getStatusLine());
            }
            // 4.处理 HTTP 响应内容
            byte[] responseBody = getMethod.getResponseBody();// 读取为字节 数组


            if (dianshang.equals("al")){
                page = new Page(responseBody,url,null); //封装成为页面
            }
            else {
                String contentType = getMethod.getResponseHeader("Content-Type").getValue(); // 得到当前返回类型
                page = new Page(responseBody,url,contentType); //封装成为页面
            }
        } catch (HttpException e) {
            // 发生致命的异常,可能是协议不对或者返回的内容有问题
            System.out.println("Please check your provided http address!");
            e.printStackTrace();
        } catch (IOException e) {
            // 发生网络异常
            e.printStackTrace();
        } finally {
            // 释放连接
            getMethod.releaseConnection();
        }
        return page;
    }
}

这里的cookies 要自己用浏览器打开天猫,接着搜索,右键查看network,查看此网页请求头,然后添加上去就行了。

数据处理

 Document doc = Jsoup.parse(page.getHtml());
                List<item> item = new ArrayList<>();
                //System.out.println(doc);
                //筛选出商品列表/ssss
                //String all = "//html/body[@class='pg']/div[@class='page']/div[@id='mallPage']/div[@id='content']/div[@class='main      bts-61 ']/div[@id='J_ItemList']/div";
                List<Element> elements = doc.getElementsByClass("product-iWrap");
                // System.out.println(elements);
                //计数天猫在动态加载功能,前只加载前五个数据,后五个数据的html结构变化
                int i=1;
                for (Element element:
                        elements) {
                    System.out.println(element);
                    item item1 = new item();
                    if (i<=5) {
                        //图片地址
                        //System.out.println(element.getElementsByClass("productImg-wrap"));
                        if (!element.getElementsByClass("productImg-wrap").equals("")) {
                            //System.out.println("https:" + element.getElementsByClass("productImg-wrap").get(0).getElementsByTag("img").attr("src"));
                            item1.setImages("https:" + element.getElementsByClass("productImg-wrap").get(0).getElementsByTag("img").attr("src"));
                        }

                        //商家地址
                        //System.out.println("https:"+element.getElementsByClass("productShop-name").get(0).attr("href"));
                        item1.setShopUrl("https:"+element.getElementsByClass("productShop-name").get(0).attr("href"));

                        //商家名字
                        //System.out.println(element.getElementsByClass("productShop-name").get(0).text());
                        item1.setShopName(element.getElementsByClass("productShop-name").get(0).text());

                        //价格
                        //System.out.println(element.getElementsByClass("productPrice").get(0).text());
                        item1.setPrice(element.getElementsByClass("productPrice").get(0).text());

                        //商品地址productTitle productTitle-spu
                        //System.out.println("https:"+element.getElementsByClass("productTitle").get(0).getElementsByTag("a").attr("href"));
                        item1.setItemUrl("https:"+element.getElementsByClass("productTitle").get(0).getElementsByTag("a").attr("href"));

                        //商品名字\买点
                        //System.out.println(element.getElementsByClass("productTitle productTitle-spu").text());
                        item1.setItemName(element.getElementsByClass("productTitle productTitle-spu").text());

                        //商品月销量
                        //System.out.println(element.getElementsByClass("productStatus").text());
                        item1.setBuyNum(element.getElementsByClass("productStatus").text());

                        //电商名字
                        item1.setEcName("tm");

                        //更新时间
                        Date date = new Date();
                        SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                        item1.setUpdateTime(sdft.format(date));
                        // System.out.println("asdas"+element);
                        i++;
                        System.out.println("\n\n\n\n\n");
                    }
                    else{
                        //商品图片
                        //System.out.println(i+""+element.getElementsByClass("productImg-wrap"));
                        //System.out.println(i+""+element.hasClass("productImg-wrap"));
                        if (!element.getElementsByClass("productImg-wrap").equals("")) {
                            System.out.println("https:" + element.getElementsByClass("productImg-wrap").get(0).getElementsByTag("img").attr("data-ks-lazyload"));
                            item1.setImages("https:" + element.getElementsByClass("productImg-wrap").get(0).getElementsByTag("img").attr("data-ks-lazyload"));
                        }
                        //商家地址
                        //System.out.println("https:"+element.getElementsByClass("productShop-name").get(0).attr("href"));
                        item1.setShopUrl("https:"+element.getElementsByClass("productShop-name").get(0).attr("href"));

                        //商家名字
                        //System.out.println(element.getElementsByClass("productShop-name").get(0).text());
                        item1.setShopName(element.getElementsByClass("productShop-name").get(0).text());

                        //价格
                        //System.out.println(element.getElementsByClass("productPrice").get(0).text());
                        item1.setPrice(element.getElementsByClass("productPrice").get(0).text());

                        //商品地址productTitle productTitle-spu
                        //System.out.println("https:"+element.getElementsByClass("productTitle").get(0).getElementsByTag("a").attr("href"));
                        item1.setItemUrl("https:"+element.getElementsByClass("productTitle").get(0).getElementsByTag("a").attr("href"));

                        //商品名字\买点
                        //System.out.println(element.getElementsByClass("productTitle productTitle-spu").text());
                        item1.setItemName(element.getElementsByClass("productTitle productTitle-spu").text());

                        //商品月销量
                        //System.out.println(element.getElementsByClass("productStatus").text());
                        item1.setBuyNum(element.getElementsByClass("productStatus").text());

                        //电商名字
                        item1.setEcName("tm");

                        //更新时间
                        Date date = new Date();
                        SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                        item1.setUpdateTime(sdft.format(date));
                        // System.out.println(element);
                        System.out.println("\n\n\n\n\n");
                    }
                    item.add(item1);
                    //https://img.alicdn.com/bao/uploaded/i8/TB1LPeMDRLoK1RjSZFuLG8n0XXa_043355.jpg
                }
                return item;

 至于浏览url这里就不贴了。

      4   淘宝

成果:

Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

问题:

遇到了无法获取script通过封装的方法解决了,同样如果你淘宝也遇到了给你飞机票到登陆界面,你也可以添加cookies来避免这种行为。

 Document doc = Jsoup.parse(page.getHtml());
                List<item> item = new ArrayList<>();
                // System.out.println(page.getHtml());
                //筛选出商品列表
                List<Element> elements1 = doc.getElementsByTag("script");
                Elements e = doc.getElementsByTag("script").eq(7);
                String sc = e.html();
                //  System.out.println(sc);
                String[] it = sc.split("}};");
                String it1 = it[0] + "}}";
                System.out.println(it1);
                it1 = it1.substring(16);
                System.out.println(it1);
                //数据处理完成
                try {
                    JSONObject obj = new JSONObject(it1);
                    JSONObject obj1 = obj.getJSONObject("mods");
                    JSONObject obj2 = obj1.getJSONObject("itemlist");
                    JSONObject obj3 = obj2.getJSONObject("data");
                    JSONArray jarry = obj3.getJSONArray("auctions");
                    //json解析完成
                    for (int i = 0; i < jarry.length(); i++) {
                        item item1 = new item();
                        //商品名字\买点
                        item1.setItemName(jarry.getJSONObject(i).getString("raw_title"));
                        System.out.println(jarry.getJSONObject(i).getString("raw_title"));

                        //商品图片路由
                        item1.setImages("https:" + jarry.getJSONObject(i).getString("pic_url"));
                        System.out.println("https:"+jarry.getJSONObject(i).getString("pic_url"));

                        //商品路由
                        item1.setItemUrl("https:" + jarry.getJSONObject(i).getString("detail_url"));
                        System.out.println("https:"+jarry.getJSONObject(i).getString("detail_url"));

                        //商品价格
                        item1.setPrice(jarry.getJSONObject(i).getString("view_price"));
                        System.out.println(jarry.getJSONObject(i).getString("view_price"));

                        //商品购买数量
                         item1.setBuyNum(jarry.getJSONObject(i).getString("view_sales"));
                        System.out.println(jarry.getJSONObject(i).getString("view_sales"));

                        //店家名字
                        item1.setShopName(jarry.getJSONObject(i).getString("nick"));
                        System.out.println(jarry.getJSONObject(i).getString("nick"));

                        //店家路由
                        item1.setShopUrl("https:" + jarry.getJSONObject(i).getString("shopLink"));
                        System.out.println("https:"+jarry.getJSONObject(i).getString("shopLink"));

                        //时间
                        Date date = new Date();
                        SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                        item1.setUpdateTime(sdft.format(date));

                        //发货地址
                        System.out.println(item1.getLocal());
                        item1.setLocal(jarry.getJSONObject(i).getString("shopLink"));

                        //电商名字
                        item1.setEcName("tb");
                        //System.out.println(jarry.getJSONObject(i).getString(""));

                        System.out.println("\n\n\n\n\n");
                        item.add(item1);
                    }

                } catch (JSONException e1) {
                    e1.printStackTrace();
                }

                return item;

总的来说就是通过处理script种的json来获取 。

     5     苏宁:

成果:

Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

问题找不到价格,通过普通的url访问,并没有返回价格,通过长时间的搜索,找到一个返回json 的url

 

https://search.suning.com/emall/mobile/wap/clientSearch.jsonp?keyword="+question+"&cp="+(n-1)+"&ps=30&set=5&ct=-1&channelId=WAP&sp=&sg=&sc=&prune=&operate=0&isAnalysised=0&istongma=1&v=99999999&sesab=ABB&&jzq=1535&callback=success_jsonpCallback

这个是url,其中question是搜索内容n代表页数。

  List<item> item = new ArrayList<>();
                Document doc = Jsoup.parse(page.getHtml());
                String json = doc.body().text();
                json = json.substring(22);
                json = json.replace(");", "");
                //System.out.println(json);
                //json 处理完成

                JSONObject obj = new JSONObject(json);
                JSONArray jarry = obj.getJSONArray("goods");
                //System.out.println(jarry);

                for (int i = 0; i < jarry.length(); i++) {
                    item item1 = new item();
                    //System.out.println(i);
                    //title
                    //System.out.println(jarry.getJSONObject(i).getString("catentdesc"));
                    item1.setItemName(jarry.getJSONObject(i).getString("catentdesc"));

                    //sellPoint
                    //System.out.println(jarry.getJSONObject(i).getString("auxdescription"));
                    item1.setItemSellpoint(jarry.getJSONObject(i).getString("auxdescription"));

                    //comment
                    //System.out.println(jarry.getJSONObject(i).getJSONObject("extenalFileds").getString("commentShow"));
                    item1.setBuyNum("评价数:"+jarry.getJSONObject(i).getJSONObject("extenalFileds").getString("commentShow")+" 好评率"+jarry.getJSONObject(i).getString("praiseRate"));

                    //price!!!!!!!!!!!!!!!!!!!!!!!!
                    //System.out.println(jarry.getJSONObject(i).getString("price"));
                    item1.setPrice("¥"+jarry.getJSONObject(i).getString("price"));

                    //picUrl
                    if (!jarry.getJSONObject(i).isNull("dynamicImg")) {
                        //System.out.println("http:" + jarry.getJSONObject(i).getString("dynamicImg"));
                        item1.setImages("https:" + jarry.getJSONObject(i).getString("dynamicImg"));
                    }
                    else {
                        item1.setImages("/images/ZO.png");
                    }

                    //shopName
                    //System.out.println(jarry.getJSONObject(i).getString("salesName"));
                    if (jarry.getJSONObject(i).getString("salesName").equals("苏宁自营")) {
                        item1.setShopUrl("");
                    } else {
                        item1.setShopUrl(jarry.getJSONObject(i).getJSONObject("extenalFileds").getString("specificUrl"));
                    }
                    item1.setShopName(jarry.getJSONObject(i).getString("salesName"));

                    //itemUrl
                    //System.out.println("https://product.suning.com/"+jarry.getJSONObject(i).getString("salesCode")+"/"+jarry.getJSONObject(i).getString("catentryId")+".html");
                    item1.setItemUrl("https://product.suning.com/" + jarry.getJSONObject(i).getString("salesCode") + "/" + jarry.getJSONObject(i).getString("catentryId") + ".html");


                    //ec
                    item1.setEcName("sn");

                    //time
                    Date date = new Date();
                    SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                    item1.setUpdateTime(sdft.format(date));
                    item.add(item1);
                }
                return item;

单纯的处理json数据。 

     6     国美:

成果:

Java爬虫,爬取京东、天猫、淘宝、阿里巴巴、苏宁、国美、考拉电商数据

问题:

国美没有翻页键所以你要通过webapp端的url进行分析然后得到url

https://m.gome.com.cn/category.html?from=1&scat=2&key_word="+question+"&page="+n+"&plsj_flag=N&sort=10

数据处理代码如下: 

  List<item> item = new ArrayList<>();
                Document doc = Jsoup.parse(page.getHtml());
                // System.out.println();
                List<Element> elements = doc.getElementsByClass("gd_list");
                int i = 0;
                List<item> items = new ArrayList<>();
                for (Element e :
                        elements) {
                    item item1 = new item();
                    //System.out.println(e+"\n\n\n\n");
                    //pic
                    //System.out.println("https:"+e.getElementsByTag("img").attr("src"));
                    item1.setImages("https:" + e.getElementsByTag("img").attr("src"));

                    //itemUrl
                    //System.out.println("https:"+e.getElementsByClass("a-mask").attr("href").split("\\?")[0].replace("product-","").replace(".m",""));
                    item1.setItemUrl("https:" + e.getElementsByClass("a-mask").attr("href").split("\\?")[0].replace("product-", "").replace(".m", ""));

                    //price
                    //System.out.println(e.getElementsByClass("price_warp").text());
                    item1.setPrice(e.getElementsByClass("price_warp").text());

                    //title
                    System.out.println(e.getElementsByClass("title ellipsis-one").text());
                    item1.setItemName(e.getElementsByClass("title ellipsis-one").text());
                    if (item1.getItemName().equals("")){
                        item1.setItemName(e.getElementsByClass("title ellipsis_two").text());
                    }

                    //comment
                    //System.out.println(e.getElementsByClass("cmt").text());
                    item1.setBuyNum(e.getElementsByClass("cmt").text());

                    //sellPoint
                    System.out.println(e.getElementsByClass("sell-point").text());
                    item1.setItemSellpoint(e.getElementsByClass("sell-point").text());

                    //ec
                    item1.setEcName("gm");

                    //time
                    Date date = new Date();
                    SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                    item1.setUpdateTime(sdft.format(date));
                    item.add(item1);


                    // System.out.println(e);
                    //System.out.println(i++);
                    //System.out.println(e.html());
                    // System.out.println("\n\n\n\n\n");
                }
                return item;

 

7     考拉:

url:

https://search.kaola.com/search.html?key="+question+"&pageNo="+n+"&type=&pageSize=20&isStock=false&isSelfProduct=false

没啥问题

代码处理:

 List<item> item = new ArrayList<>();
                Document doc = Jsoup.parse(page.getHtml());
                // System.out.println(doc);
                List<Element> elements = doc.getElementsByClass("goods colorsku");
                int i = 0;
                for (Element e :
                        elements) {
                    item item1 = new item();
                    //System.out.println(e.html());
                    //pic
                    System.out.println("http:"+e.getElementsByTag("img").attr("data-src"));
                    item1.setImages("https:" + e.getElementsByTag("img").attr("data-src"));

                    //itemUrl
                    //System.out.println("https:"+e.getElementsByClass("title").attr("href"));
                    item1.setItemUrl("https:" + e.getElementsByClass("title").attr("href"));

                    //price
                    //System.out.println(e.getElementsByClass("marketprice").text());
                    item1.setPrice(e.getElementsByClass("marketprice").text());

                    //title
                    //System.out.println(e.getElementsByTag("img").attr("alt"));
                    item1.setItemName(e.getElementsByTag("img").attr("alt"));

                    //comment
                    //System.out.println(e.getElementsByClass("comments").text());
                    item1.setBuyNum(e.getElementsByClass("comments").text());

                    //sellPoint
                    //System.out.println(e.getElementsByClass("sell-point").text());
                    //item1.setItemSellpoint(e.getElementsByClass("sell-point").text());

                    //localtion
                    //System.out.println(e.getElementsByClass("proPlace ellipsis").text());
                    item1.setLocal(e.getElementsByClass("proPlace ellipsis").text());

                    //shop
                    //System.out.println(e.getElementsByClass("selfflag").text());
                    if (e.getElementsByClass("selfflag").text().equals("网易考拉自营")) {
                        item1.setShopName(e.getElementsByClass("selfflag").text());
                        item1.setShopUrl("");
                    } else {
                        //System.out.println("https:"+e.getElementsByClass("selfflag").get(0).getElementsByTag("a").attr("href"));
                        item1.setShopName(e.getElementsByClass("selfflag").text());
                        item1.setShopUrl("https:" + e.getElementsByClass("selfflag").text());
                    }
                    //item1.setShopName(e.getElementsByClass("comments").text());

                    //ec
                    item1.setEcName("kl");

                    //time
                    Date date = new Date();
                    SimpleDateFormat sdft = new SimpleDateFormat("yyyy-mm-dd  HH:mm:ss");
                    item1.setUpdateTime(sdft.format(date));
                    item.add(item1);

                    //System.out.println(i++);
                    //System.out.println("\n\n\n\n\n\n\n\n\n");
                    //item.add(item1);
                }
                return item;