java多线程爬虫

本博客记录的是我学习爬虫的过程，里面可能会有错的，如果发现，请艾特我一下，互相学习！

在这里我们用到一个架包：Jsoup 下载地址：http://pan.baidu.com/s/1i5LZv0p

在这里我们以爬取企查查河南区域的企业相关信息

入口url：http://www.qichacha.com/g_HEN

首先创建好一个项目，这里我就叫qichacha_spider好了

先建一个存放企业信息的类：QichachaInfo

package com.qichacha;

import java.util.ArrayList;
import java.util.List;

public class QiChaChaInfo {
public List<String> company_name=new ArrayList<>();
public List<String> company_cun=new ArrayList<>();
public List<String> company_user=new ArrayList<>();
public List<String> company_money=new ArrayList<>();
public List<String> company_work=new ArrayList<>();
public List<String> company_time=new ArrayList<>();
public List<String> company_address=new ArrayList<>();
@Override
public String toString() {
return company_name.toString()+" "+company_user.toString();
}
}

这个类专门存放企业的信息，方便我们后期调用

好了，然后就开始建我们爬虫的主类了：QiCha_Spider

首先加载我们的jsoup架包

获得入口url那个页面源码，然后获得页码，和页码的url存放到list集合里

//主页的htmlDocument document=Jsoup.connect(root_url).cookies(map).timeout(10000).get();//页数Elements page_node=document.select("[class=end]");//存放每个页数的urlList<String> url_list=new ArrayList<>();//每个页数的urlElements urls_nodes=document.select("[class=num]");for (Element element : urls_nodes) {url_list.add("http://www.qichacha.com"+element.attr("href"));}
新建一个线程池，为每一个页码的url到开启一个线程去爬取
//为每一页单独分一个线程进行爬取
for (String url : url_list) {
cachedThreadPool.execute(new Runnable() {
public void run() {
try {
Document document=Jsoup.connect(url).cookies(map).timeout(10000).get();

//公司名称
Elements elements_name=document.select("[class=name]");
List<String> list_name=new ArrayList<>();
//是否存续还是运营
Elements elements_cun=document.select("[class=label label-success m-l-xs]");
List<String> list_cun=new ArrayList<>();
//创办人
Elements elements_user=document.select("[class=i i-user3]");
List<String> list_user=new ArrayList<>();
//创办时间
Elements elements_time=document.select("[class=i i-clock m-l]");
List<String> list_time=new ArrayList<>();
//创办资金
Elements elements_money=document.select("[class=i  i-bulb m-l]");
List<String> list_money=new ArrayList<>();
//公司业务
Elements elements_work=document.select("[class=i  i-tag2  m-l]");
List<String> list_work=new ArrayList<>();
//公司地点
Elements elements_address=document.select("[class=i i-local]");
List<String> list_address=new ArrayList<>();

List<QiChaChaInfo> list=new ArrayList<>();

QiChaChaInfo info2=new QiChaChaInfo();
for (Element element : elements_name) {
info2.company_name.add(element.text());
}
for (Element element : elements_cun) {
info2.company_cun.add(element.text());
}
for (Element element : elements_user) {
info2.company_user.add(element.nextSibling().toString());
}
for(Element element : elements_money){
info2.company_money.add(element.nextSibling().toString());
}
for(Element element : elements_time){
info2.company_time.add(element.nextSibling().toString());
}
for(Element element : elements_work){
info2.company_work.add(element.nextSibling().toString());
}
for(Element element : elements_address){
info2.company_address.add(element.nextSibling().toString());
}
list.add(info2);




将爬取的信息存放到数据库中
//连接数据库
Class.forName("com.mysql.jdbc.Driver");
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/spider?useUnicode=true&characterEncoding=utf-8", "root", "123456");
Statement statement = con.createStatement();
for (QiChaChaInfo qiChaChaInfo : list) {for(int i=0;i<qiChaChaInfo.company_name.size();i++){String sql="insert into qichacha values('"+qiChaChaInfo.company_name.get(i).toString()+"','"+qiChaChaInfo.company_cun.get(i).toString()+"','"+qiChaChaInfo.company_user.get(i).toString()+"','"+qiChaChaInfo.company_money.get(i).toString()+"','"+qiChaChaInfo.company_work.get(i).toString()+"','"+qiChaChaInfo.company_time.get(i).toString()+"','"+qiChaChaInfo.company_address.get(i).toString()+"')";statement.execute(sql);}}
这样，一个多线程的爬虫就写好了

注：页码的url我只获取一个页面的，如果想全部爬取，你需要用个for循环，自动增加页码进行爬取。

秒客网

java多线程爬虫

相关文章