最近在扒一些数据,原本使用jsoup,但是发觉这个框架爬取的效率不高,用起来也不是很方便,了解了一些爬虫框架之后,决定使用SeimiCrawler来爬取数据。
开发环境:ideal+mybatis+SeimiCrawler
环境配置,具体的不解释,做过Java开发的明白,直接上配置文件:注意:SeimiCrawler相关的配置必须以seimi开头;
全局配置:seimi.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
<!--配置Seimi默认的基于redis数据队列的网络相关配置-->
<bean id="propertyConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="locations">
<list>
<value>classpath:**/*.properties</value>
</list>
</property>
</bean>
</beans>
数据库全局配置:mybatis-config.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN" "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
<!--配置全局属性-->
<settings>
<setting name="mapUnderscoreToCamelCase" value="true"/>
</settings>
</configuration>
SeimiCrawler数据配置seimi-mybatis.xml:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">
<context:annotation-config/>
<bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource">
<property name="driverClassName" value="${jdbc.driver}"/>
<property name="url" value="${jdbc.url}"/>
<property name="username" value="${jdbc.username}"/>
<property name="password" value="${jdbc.password}"/>
</bean>
<bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true">
<property name="configLocation" value="classpath:mybatis-config.xml"/>
</bean>
<bean id="seimiSqlSessionFactory" parent="sqlSessionFactory">
<property name="dataSource" ref="mybatisDataSource"/>
</bean>
<bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
<property name="basePackage" value="com.morse.seimicrawler.dao"/>
<property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/>
</bean>
</beans>
数据库引擎配置seimi.properties:
jdbc.driver=com.mysql.jdbc.Driver
jdbc.url=jdbc:mysql://localhost:3360/xiaohuo?useUnicode=true&characterEncoding=utf8&useSSL=false
jdbc.username=root
jdbc.password=123456
日志输出配置log4j.properties:
log4j.rootLogger=info, console, log, error
###Console ###
log4j.appender.console = org.apache.log4j.ConsoleAppender
log4j.appender.console.Target = System.out
log4j.appender.console.layout = org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern = %d %p[%C:%L]- %m%n
### log ###
log4j.appender.log = org.apache.log4j.DailyRollingFileAppender
log4j.appender.log.File = ${catalina.base}/logs/debug.log
log4j.appender.log.Append = true
log4j.appender.log.Threshold = DEBUG
log4j.appender.log.DatePattern='.'yyyy-MM-dd
log4j.appender.log.layout = org.apache.log4j.PatternLayout
log4j.appender.log.layout.ConversionPattern = %d %p[%c:%L] - %m%n
### Error ###
log4j.appender.error = org.apache.log4j.DailyRollingFileAppender
log4j.appender.error.File = ${catalina.base}/logs/error.log
log4j.appender.error.Append = true
log4j.appender.error.Threshold = ERROR
log4j.appender.error.DatePattern='.'yyyy-MM-dd
log4j.appender.error.layout = org.apache.log4j.PatternLayout
log4j.appender.error.layout.ConversionPattern =%d %p[%c:%L] - %m%n
###\u8F93\u51FASQL
log4j.logger.com.ibatis=DEBUG
log4j.logger.com.ibatis.common.jdbc.SimpleDataSource=DEBUG
log4j.logger.com.ibatis.common.jdbc.ScriptRunner=DEBUG
log4j.logger.com.ibatis.sqlmap.engine.impl.SqlMapClientDelegate=DEBUG
log4j.logger.java.sql.Connection=DEBUG
log4j.logger.java.sql.Statement=DEBUG
log4j.logger.java.sql.PreparedStatement=DEBUG
基础配置到这里就配置完成了,接下来的就是实现爬虫业务了。
SeimiCrawler融合Spring,结合XPath,可以很方便的解析html,每一个爬虫的具体实现类需要放在包名为:xxx.crawlers的目录下,SeimiCrawler会自动扫描该目录下的文件,不然会找不到文件,爬虫无法启动。每一个爬虫需要集成BaseSeimiCrawler,并实现重写startUrls(),start(Response response)和回调接口。
下面以爬取代理IP为例,实现并对爬虫框架进行简单的二次封装:
基类爬虫BaseCrawler:
public abstract class BaseCrawler extends BaseSeimiCrawler {
/** * 数据搜集前缀 * * @return */
protected abstract String getUrlPrefix();
/** * 数据搜集后缀 * * @return */
protected abstract String getUrlsuffix();
/** * 获取最大页数 * * @param document * @return */
protected abstract int getMaxPage(JXDocument document);
/** * 数据解析 * * @param response */
public abstract void operation(Response response);
/** * 设置头信息 * * @return */
protected Map<String, String> setHeader() {
return null;
}
@Override
public void start(Response response) {
try {
JXDocument document = response.document();
int max = getMaxPage(document);
for (int i = 1; i <= max; i++) {
logger.info("当前为第{}页", i);
push(Request.build(getUrlPrefix() + i + getUrlsuffix(), "operation").setHeader(setHeader()));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
具体的爬虫实现SeCrawler:
@Crawler(name = "seCrawler")
public class SeCrawler extends BaseCrawler {
@Autowired
private ProxyIpStoreDao dao;
@Override
public String[] startUrls() {
return new String[]{"https://ip.seofangfa.com/"};
}
@Override
protected String getUrlPrefix() {
return "https://ip.seofangfa.com/proxy/";
}
@Override
protected String getUrlsuffix() {
return ".html";
}
@Override
protected int getMaxPage(JXDocument document) {
try {
List<Object> pages = document.sel("//div[@class='page_nav']/ul/li/a/text()");
return Integer.parseInt((String) pages.get(pages.size() - 1));
} catch (Exception e) {
e.printStackTrace();
}
return 0;
}
@Override
public void operation(Response response) {
try {
JXDocument document = response.document();
List<Object> ips = document.sel("//table[@class='table']/tbody/tr/td[1]/text()");
List<Object> ports = document.sel("//table[@class='table']/tbody/tr/td[2]/text()");
List<Object> speeds = document.sel("//table[@class='table']/tbody/tr/td[3]/text()");
List<Object> addres = document.sel("//table[@class='table']/tbody/tr/td[4]/text()");
List<Object> times = document.sel("//table[@class='table']/tbody/tr/td[5]/text()");
ProxyIp proxyIp = new ProxyIp();
for (int i = 0; i < ips.size(); i++) {
proxyIp.setIp((String) ips.get(i));
proxyIp.setPort((String) ports.get(i));
proxyIp.setSpeed((String) speeds.get(i));
proxyIp.setAddr((String) addres.get(i));
proxyIp.setTime((String) times.get(i));
dao.insert(proxyIp);
logger.info("插入代理IP:", proxyIp.toString());
}
} catch (Exception e) {
}
}
}
启动爬虫:
public static void main(String... agrs) {
Seimi seimi = new Seimi();
seimi.goRun("seCrawler");
}
SeimiCrawler爬虫就是这么简单,你学会了吗?