初次接触nutch,记录下来
首先数据库
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;
表
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext,
`status` int(11) default NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) default NULL,
`score` float default NULL,
`typ` varchar(32) default NULL,
`baseUrl` varchar(767) default NULL,
`content` longblob,
`title` varchar(2048) default NULL,
`reprUrl` varchar(767) default NULL,
`fetchInterval` int(11) default NULL,
`prevFetchTime` bigint(20) default NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) default NULL,
`retriesSinceFetch` int(11) default NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED;
eclipse安装svn,ivy,ant
以上两个插件是nutch项目租使用的插件,自行安装。
nutch2.1的远程svn库文件地址
https://svn.apache.org/repos/asf/nutch/tags/release-2.1
check out检出项目
默认直接finish并创建java project项目
等待下载完成
下载完成后(注:这里的nutch2西面已做更改成nutch-2.1)
在project explorer下右击项目,选择properties。进入java build path
Add Folder > 导入选择,并把plugin下面的项目中的src/java和src/test都加入进去
src/bin
src/java
src/test
src/testresources
这一步也可以直接修改项目中的classpath文件,然后在直接刷新项目来自动添加,这样比较方便,但要注意是否有添加错误
.classpath内容
<?xml version="1.0" encoding="UTF-8"?>
<classpath>
<classpathentry kind="src" path="conf"/>
<classpathentry kind="src" path="src/java"/>
<classpathentry kind="src" path="src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-file/src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/>
<classpathentry kind="src" path="src/plugin/subcollection/src/test"/>
<classpathentry kind="src" path="src/plugin/parse-html/src/test"/>
<classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/>
<classpathentry kind="src" path="src/plugin/parse-html/src/java"/>
<classpathentry kind="src" path="src/plugin/parse-tika/src/test"/>
<classpathentry kind="src" path="src/plugin/lib-http/src/test"/>
<classpathentry kind="src" path="src/plugin/parse-tika/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/>
<classpathentry kind="src" path="src/plugin/scoring-link/src/java"/>
<classpathentry kind="src" path="src/plugin/index-anchor/src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-http/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/>
<classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/>
<classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-file/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/>
<classpathentry kind="src" path="src/plugin/language-identifier/src/java"/>
<classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/>
<classpathentry kind="src" path="src/plugin/language-identifier/src/test"/>
<classpathentry kind="src" path="src/plugin/subcollection/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/>
<classpathentry kind="src" path="src/plugin/index-basic/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/>
<classpathentry kind="src" path="src/plugin/creativecommons/src/java"/>
<classpathentry kind="src" path="src/bin"/>
<classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/>
<classpathentry kind="src" path="src/plugin/tld/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/>
<classpathentry kind="src" path="src/plugin/index-basic/src/test"/>
<classpathentry kind="src" path="src/plugin/lib-http/src/java"/>
<classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/>
<classpathentry kind="src" path="src/plugin/index-anchor/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/>
<classpathentry kind="src" path="src/plugin/index-more/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/>
<classpathentry kind="src" path="src/plugin/creativecommons/src/test"/>
<classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/>
<classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/>
<classpathentry kind="src" path="src/plugin/index-more/src/test"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/>
<classpathentry kind="src" path="src/testresources"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=ivy%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fcreativecommons%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Ffeed%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Findex-anchor%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Findex-basic%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Findex-more%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flanguage-identifier%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-http%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-nekohtml%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-regex-filter%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-xml%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fmicroformats-reltag%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fnutch-extensionpoints%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-ext%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-html%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-js%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-swf%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-tika%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-zip%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-file%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-ftp%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-http%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-httpclient%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-sftp%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fscoring-link%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fscoring-opic%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fsubcollection%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Ftld%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-automaton%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-domain%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-prefix%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-regex%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-suffix%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-validator%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlnormalizer-basic%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlnormalizer-pass%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlnormalizer-regex%2Fivy.xml&confs=*"/>
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/>
<classpathentry kind="con" path="org.eclipse.jdt.junit.JUNIT_CONTAINER/4"/>
<classpathentry kind="lib" path="lib/org.restlet-2.0.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.example.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.atom_1.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.atom.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.crypto.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.fileupload_1.2.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.freemarker_2.3.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.freemarker.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.grizzly.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.gwt.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.httpclient.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.jaas.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.jackson.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.jaxb_2.1.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.jaxrs_1.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.jaxrs-2.0-RC3.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.jibx_1.1.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.json_2.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.json.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.net.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.odata.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.rdf.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.servlet-2.0-RC3.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.servlet-2.0.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.servlet.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.spring_2.5.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.spring-2.0.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.velocity_1.5.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.wadl_1.0.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.xml.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.ext.xstream.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.gae-2.0-RC3.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.gwt.jar"/>
<classpathentry kind="lib" path="lib/org.restlet.lib.org.json-2.0.jar"/>
<classpathentry kind="lib" path="src/plugin/urlfilter-automaton/lib/automaton.jar"/>
<classpathentry kind="lib" path="lib/mysql-connector-java-5.0.7.jar"/>
<classpathentry kind="output" path="bin"/>
</classpath>
刷新项目就跟上面一样了
接下order and export中要把conf提到最前面加载
这里处理玩之后接下来就是导包的过程
安装ivy的插件则能直接右击ivy.xml
直接finish。jar就会自动下载下来,需要注意,这里的ivy.xml有很多文件,只要有jar的都要add ivy library一次
这样去找会消耗点时间
当所有的ivy到导入后,最后总会有几个jar不存在的
(这里网上自行下载了,我这里自己另加入的包有)
另还有一个包hadoop-core的包需要修改,FileUtil.java
详情见http://yangshangchuan.iteye.com/blog/1839784
摘录下来(在运行时会提示错误)
错误信息:
Exception in thread "main" java.io.IOException:Failed to set permissions of path:\tmp\hadoop-ysc\mapred\staging\ysc-2036315919\.staging to 0700 官方BUG参考:
https://issues.apache.org/jira/browse/HADOOP-7682 解决方法:
1、下载并解压http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-1.1.2/hadoop-1.1.2.tar.gz
2、修改hadoop-1.1.2\src\core\org\apache\hadoop\fs\FileUtil.java,搜索 Failed to set permissions of path,找到689行,把throw new IOException改为LOG.warn
3、修改hadoop-1.1.2\build.xml,搜索autoreconf,移除匹配的6个executable="autoreconf"的exec配置
4、下载解压ant,将ant目录下的bin目录加入环境变量path
5、在Cygwin命令下行切换到hadoop-1.1.2目录,执行ant
6、用新生成的hadoop-1.1.2\build\hadoop-core-1.1.3-SNAPSHOT.jar替换nutch的hadoop-core-1.0.3.jar
7、对于eclipse开发来说,替换C:\Users\ysc\.ivy2\cache\org.apache.hadoop\hadoop-core\jars\hadoop-core-1.1.2.jar 附件中的JAR是对hadoop1.2.1修改后的JAR,可用于Nutch1.7,其他Nutch版本没测试过。
我在修改的时候直接下载这个然后替换ivy库中的hadoop-core包,名称一样;
下载http://pan.baidu.com/s/1i3FBLEP
接下里就是配置
在nutch2.1/conf下
Gora.properties
加入:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root
并注释掉其他的数据库链接。
在ivy/ivy.xml
解除mysql-connector的注释。
在/conf/nutch-site.xml.template的configuration中添加如下代码:
<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property> <property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property> <property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property> <property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property> <property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property> <property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
在根目录下的build.xml中找到如下代码
<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">
<ivy:resolve file="${ivy.file}" conf="default" log="download-only" />
<ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />
<antcall target="copy-libs" />
</target>
将原本的
pattern="${build.lib.dir}/[artifact]-[revision].[ext]"
改为
pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]"
用来避免ivy再次下载编译不通过的情况。原因:ivy会下载class的jar和source的jar,当时如果直接按照上面的pattern下载的话,两个文件是无法区分的。会出现相同的文件的错误。
完成如上信息之后,点击build.xml进行ant编译就会生成runtime目录。
在根目录下添加一个urls文件夹,放入seed.txt文件,其中加一个网站地址。如:http://nutch.apache.org/
打开
src/java下的crawl的package下的crawler,使用run configuration
第一页已经默认填写完毕
选择第二个arguments
放入:
urls -depth 3 -topN 5
-Xms64m -Xmx512m
最后就可以使用run进行爬取该网站的链接信息了。
执行完后打印
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://nutch.apache.org/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 84 84 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://nutch.apache.org/
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 6 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://cassandra.apache.org/
fetching http://nutch.apache.org/
fetching http://accumulo.apache.org/
fetching http://avro.apache.org/
fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
fetching http://code.google.com/p/crawler-commons/
-finishing thread FetcherThread1, activeThreads=9
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread3, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread0, activeThreads=5
-finishing thread FetcherThread8, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread9, activeThreads=2
0/2 spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 136 136 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.4 0.0 pages/s, 68 0 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.3 0.0 pages/s, 45 0 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.2 0.0 pages/s, 34 0 kb/s, 0 URLs in 2 queues
fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread4, activeThreads=1
fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread5, activeThreads=0
0/0 spinwaiting/active, 6 pages, 2 errors, 0.2 0.4 pages/s, 27 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Skipping http://sched.co/1pav9xl; different batch id (null)
Skipping http://sched.co/1pbE15n; different batch id (null)
Skipping http://t.co/k3VLhbJQhg; different batch id (null)
Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null)
Skipping http://www.cafepress.com/nutch; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null)
Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null)
Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null)
Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null)
Parsing http://code.google.com/p/crawler-commons/
Skipping https://twitter.com/ApacheNutch; different batch id (null)
Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null)
Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null)
Skipping https://twitter.com/TheASF; different batch id (null)
Skipping http://www.brics.dk/automaton/; different batch id (null)
Skipping http://www.brics.dk/automaton/automaton; different batch id (null)
Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
Parsing http://accumulo.apache.org/
Parsing http://avro.apache.org/
Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null)
Parsing http://cassandra.apache.org/
Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null)
Skipping http://gora.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/; different batch id (null)
Skipping http://hbase.apache.org/; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null)
Skipping http://lucene.apache.org/; different batch id (null)
Skipping http://lucene.apache.org/solr; different batch id (null)
Skipping http://lucene.apache.org/solr/; different batch id (null)
Parsing http://nutch.apache.org/
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/downloads.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/javadoc.html; different batch id (null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch id (null)
Skipping http://s.apache.org/1.9-release; different batch id (null)
Skipping http://s.apache.org/1zE; different batch id (null)
Skipping http://s.apache.org/LPB; different batch id (null)
Skipping http://s.apache.org/nutch10; different batch id (null)
Skipping http://s.apache.org/nutch_2.3; different batch id (null)
Skipping http://s.apache.org/oHY; different batch id (null)
Skipping http://s.apache.org/PGa; different batch id (null)
Skipping http://tika.apache.org/; different batch id (null)
Skipping http://tika.apache.org/1.2/index.html; different batch id (null)
Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null)
Skipping http://wicket.apache.org/; different batch id (null)
Skipping http://wiki.apache.org/nutch/; different batch id (null)
Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null)
Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null)
Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null)
Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null)
Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null)
Skipping http://www.apache.org/; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null)
Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null)
Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null)
Skipping http://www.apache.org/foundation/sponsorship.html; different batch id (null)
Skipping http://www.apache.org/foundation/thanks.html; different batch id (null)
Skipping http://www.apache.org/licenses/; different batch id (null)
Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null)
Skipping http://www.apache.org/security/; different batch id (null)
Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null)
Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null)
Skipping http://www.elasticsearch.org/; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null)
Skipping http://search.maven.org/; different batch id (null)
Skipping http://mongodb.org/; different batch id (null)
Skipping http://osuosl.org/news_folder/nutch; different batch id (null)
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
fetching http://cassandra.apache.org/
fetching http://nutch.apache.org/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://accumulo.apache.org/
fetching http://avro.apache.org/
QueueFeeder finished: total 11 records. Hit by time limit :0
fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
fetching http://www.apache.org/foundation/sponsorship.html
fetching http://code.google.com/p/crawler-commons/
fetching http://www.apache.org/security/
7/10 spinwaiting/active, 5 pages, 0 errors, 1.0 1.0 pages/s, 169 169 kb/s, 3 URLs in 3 queues
* queue: http://www.apache.org
maxThreads = 1
inProgress = 1
crawlDelay = 4000
minCrawlDelay = 0
nextFetchTime = 1445831574525
now = 1445831574814
0. http://www.apache.org/foundation/thanks.html
1. http://www.apache.org/licenses/
2. http://www.apache.org/
fetching http://www.apache.org/foundation/thanks.html
8/10 spinwaiting/active, 7 pages, 0 errors, 0.7 0.4 pages/s, 113 57 kb/s, 2 URLs in 3 queues
* queue: http://www.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 4000
minCrawlDelay = 0
nextFetchTime = 1445831583211
now = 1445831579817
0. http://www.apache.org/licenses/
1. http://www.apache.org/
fetching http://www.apache.org/licenses/
8/10 spinwaiting/active, 8 pages, 0 errors, 0.5 0.2 pages/s, 86 31 kb/s, 1 URLs in 3 queues
* queue: http://www.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 4000
minCrawlDelay = 0
nextFetchTime = 1445831587582
now = 1445831584820
0. http://www.apache.org/
fetching http://www.apache.org/
-finishing thread FetcherThread9, activeThreads=8
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread0, activeThreads=7
-finishing thread FetcherThread1, activeThreads=6
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
0/2 spinwaiting/active, 9 pages, 0 errors, 0.5 0.2 pages/s, 84 81 kb/s, 0 URLs in 2 queues
fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread8, activeThreads=1
fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread6, activeThreads=0
0/0 spinwaiting/active, 11 pages, 2 errors, 0.4 0.4 pages/s, 67 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Skipping http://sched.co/1pav9xl; different batch id (null)
Skipping http://sched.co/1pbE15n; different batch id (null)
Skipping http://t.co/k3VLhbJQhg; different batch id (null)
Skipping http://accumulosummit.com/; different batch id (null)
Skipping http://www.amazon.com/Cassandra-High-Availability-Robbie-Strickland/dp/1783989122; different batch id (null)
Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null)
Skipping http://www.cafepress.com/nutch; different batch id (null)
Skipping http://www.datastax.com/dev/blog/2012-in-review-performance; different batch id (null)
Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html; different batch id (null)
Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_primary_index_c.html; different batch id (null)
Skipping http://www.datastax.com/resources/whitepapers/benchmarking-top-nosql-databases; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null)
Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null)
Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null)
Skipping http://getbootstrap.com/; different batch id (null)
Skipping https://github.com/apache/accumulo; different batch id (null)
Skipping http://glyphicons.com/; different batch id (null)
Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null)
Parsing http://code.google.com/p/crawler-commons/
Skipping http://research.google.com/archive/bigtable.html; different batch id (null)
Skipping https://www.linkedin.com/groups/Apache-Accumulo-Professionals-4554913; different batch id (null)
Skipping http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/; different batch id (null)
Skipping http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/; different batch id (null)
Skipping http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html; different batch id (null)
Skipping https://twitter.com/apacheaccumulo; different batch id (null)
Skipping https://twitter.com/ApacheNutch; different batch id (null)
Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null)
Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null)
Skipping https://twitter.com/TheASF; different batch id (null)
Skipping http://www.brics.dk/automaton/; different batch id (null)
Skipping http://www.brics.dk/automaton/automaton; different batch id (null)
Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
Skipping http://fontawesome.io/; different batch id (null)
Skipping http://freenode.net/; different batch id (null)
Skipping http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra; different batch id (null)
Skipping http://www.slideshare.net/daveconnors/cassandra-puppet-scaling-data-at-15-per-month; different batch id (null)
Skipping http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376; different batch id (null)
Skipping http://www.slideshare.net/jbellis; different batch id (null)
Skipping http://www.slideshare.net/jbellis/cassandra-at-nosql-matters-2012; different batch id (null)
Skipping http://www.slideshare.net/planetcassandra/3-mohit-anchlia; different batch id (null)
Skipping http://www.slideshare.net/planetcassandra/nyc-tech-day-using-cassandra-for-dvr-scheduling-at-comcast; different batch id (null)
Skipping http://www.slideshare.net/slideshow/embed_code/15832310; different batch id (null)
Parsing http://accumulo.apache.org/
Skipping http://accumulo.apache.org/1.5/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.5/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.5/examples; different batch id (null)
Skipping http://accumulo.apache.org/1.6/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.6/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.6/examples; different batch id (null)
Skipping http://accumulo.apache.org/1.7/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.7/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.7/examples; different batch id (null)
Skipping http://accumulo.apache.org/bylaws.html; different batch id (null)
Skipping http://accumulo.apache.org/contrib.html; different batch id (null)
Skipping http://accumulo.apache.org/downloads; different batch id (null)
Skipping http://accumulo.apache.org/downloads/; different batch id (null)
Skipping http://accumulo.apache.org/get_involved.html; different batch id (null)
Skipping http://accumulo.apache.org/git.html; different batch id (null)
Skipping http://accumulo.apache.org/glossary.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/consensusBuilding.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/lazyConsensus.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/releasing.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/voting.html; different batch id (null)
Skipping http://accumulo.apache.org/index.html; different batch id (null)
Skipping http://accumulo.apache.org/mailing_list.html; different batch id (null)
Skipping http://accumulo.apache.org/notable_features.html; different batch id (null)
Skipping http://accumulo.apache.org/old_documentation.html; different batch id (null)
Skipping http://accumulo.apache.org/papers.html; different batch id (null)
Skipping http://accumulo.apache.org/people.html; different batch id (null)
Skipping http://accumulo.apache.org/projects.html; different batch id (null)
Skipping http://accumulo.apache.org/rb.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.5.4.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.6.4.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.7.0.html; different batch id (null)
Skipping http://accumulo.apache.org/releasing.html; different batch id (null)
Skipping http://accumulo.apache.org/screenshots.html; different batch id (null)
Skipping http://accumulo.apache.org/source.html; different batch id (null)
Skipping http://accumulo.apache.org/verifying_releases.html; different batch id (null)
Skipping http://accumulo.apache.org/versioning.html; different batch id (null)
Parsing http://avro.apache.org/
Skipping http://avro.apache.org/credits.html; different batch id (null)
Skipping http://avro.apache.org/docs/1.6.3; different batch id (null)
Skipping http://avro.apache.org/docs/1.7.7; different batch id (null)
Skipping http://avro.apache.org/docs/current; different batch id (null)
Skipping http://avro.apache.org/docs/current/; different batch id (null)
Skipping http://avro.apache.org/index.html; different batch id (null)
Skipping http://avro.apache.org/irc.html; different batch id (null)
Skipping http://avro.apache.org/issue_tracking.html; different batch id (null)
Skipping http://avro.apache.org/mailing_lists.html; different batch id (null)
Skipping http://avro.apache.org/releases.html; different batch id (null)
Skipping http://avro.apache.org/version_control.html; different batch id (null)
Skipping http://blogs.apache.org/accumulo; different batch id (null)
Skipping https://blogs.apache.org/accumulo/; different batch id (null)
Skipping https://builds.apache.org/view/A-D/view/Accumulo/; different batch id (null)
Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null)
Parsing http://cassandra.apache.org/
Skipping http://cassandra.apache.org/download/; different batch id (null)
Skipping http://cassandra.apache.org/privacy.html; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/AVRO/Index; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null)
Skipping http://forrest.apache.org/; different batch id (null)
Skipping http://gora.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/privacy_policy.html; different batch id (null)
Skipping http://hbase.apache.org/; different batch id (null)
Skipping https://issues.apache.org/jira/browse/accumulo; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null)
Skipping http://lucene.apache.org/; different batch id (null)
Skipping http://lucene.apache.org/solr; different batch id (null)
Skipping http://lucene.apache.org/solr/; different batch id (null)
Parsing http://nutch.apache.org/
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/downloads.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/javadoc.html; different batch id (null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch id (null)
Skipping http://s.apache.org/1.9-release; different batch id (null)
Skipping http://s.apache.org/1zE; different batch id (null)
Skipping http://s.apache.org/LPB; different batch id (null)
Skipping http://s.apache.org/nutch10; different batch id (null)
Skipping http://s.apache.org/nutch_2.3; different batch id (null)
Skipping http://s.apache.org/oHY; different batch id (null)
Skipping http://s.apache.org/PGa; different batch id (null)
Skipping http://thrift.apache.org/; different batch id (null)
Skipping http://tika.apache.org/; different batch id (null)
Skipping http://tika.apache.org/1.2/index.html; different batch id (null)
Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null)
Skipping http://wicket.apache.org/; different batch id (null)
Skipping http://wiki.apache.org/cassandra; different batch id (null)
Skipping http://wiki.apache.org/cassandra/Durability; different batch id (null)
Skipping http://wiki.apache.org/cassandra/FAQ; different batch id (null)
Skipping http://wiki.apache.org/cassandra/GettingStarted; different batch id (null)
Skipping http://wiki.apache.org/cassandra/HintedHandoff; different batch id (null)
Skipping http://wiki.apache.org/cassandra/HowToContribute; different batch id (null)
Skipping http://wiki.apache.org/cassandra/ReadRepair; different batch id (null)
Skipping http://wiki.apache.org/cassandra/ThirdPartySupport; different batch id (null)
Skipping http://wiki.apache.org/nutch/; different batch id (null)
Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null)
Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null)
Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null)
Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null)
Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null)
Parsing http://www.apache.org/
Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null)
Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null)
Skipping http://www.apache.org/foundation/policies/conduct.html; different batch id (null)
Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null)
Parsing http://www.apache.org/foundation/sponsorship.html
Parsing http://www.apache.org/foundation/thanks.html
Parsing http://www.apache.org/licenses/
Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null)
Parsing http://www.apache.org/security/
Skipping http://zookeeper.apache.org/; different batch id (null)
Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null)
Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null)
Skipping http://www.elasticsearch.org/; different batch id (null)
Skipping http://hypertable.org/; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null)
Skipping http://search.maven.org/; different batch id (null)
Skipping http://mongodb.org/; different batch id (null)
Skipping http://osuosl.org/news_folder/nutch; different batch id (null)
Skipping http://www.planetcassandra.org/; different batch id (null)
Skipping http://planetcassandra.org/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/analytics-at-github-with-apache-cassandra/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/cassandra-at-cern-large-hadron-collider/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/cassandra-used-to-build-scalable-and-highly-available-systems-at-hulu-streaming-content-to-over-5-million-subscribers/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/godaddy-worlds-largest-domain-name-registrar-and-web-host-provider-utilizes-cassandra-for-replication-and-scalability/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/instagram-making-the-switch-to-cassandra-from-redis-75-instasavings/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/make-it-rain-apache-cassandra-at-the-weather-channel-for-severe-weather-alerts/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/reddit-upvotes-apache-cassandras-horizontal-scaling-managing-17000000-votes-daily/; different batch id (null)
Skipping http://planetcassandra.org/companies/; different batch id (null)
Skipping http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf; different batch id (null)
表中插入的数据
到直接基本算是在eclipse导入完成
接下自己慢慢学习了
---------------------------------------------------------------------------------
另一种简单方式
File > New > Project > SVN > 从SVN 检出项目
创建新的资源库位置 >
URL:https://svn.apache.org/repos/asf/nutch/tags/release-1.7/
选中URL > Finish 弹出New Project向导,选择JavaProject > Next,
输入Project name:nutch1.7 > Finishsd
搭建环境
在左部Package Explorer的 nutch1.7文件夹上单击右键 >Build Path > Configure Build Path...
> 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/Java, src/test 和 src/testresources
切换到Libraries选项 >
Add Class Folder... > 选中nutch1.7/conf
Add Library... > IvyDE Managed Dependencies > Next >Main > Ivy File > Browse > ivy/ivy.xml > Finish
切换到Order and Export选项>选中conf > Top > OK
最后:在左部Package Explorer的 nutch1.7文件夹下的build.xml文件上单击右键 > Run As > Ant Build (然后等待完成)
在左部Package Explorer的 nutch1.7文件夹上单击右键 > Refresh
在左部Package Explorer的 nutch1.7文件夹上单击右键 > Build Path > Configure Build Path... > 选中Libraries选项 > Add Class Folder... > 选中build >
等待完成
OK,整个工程导入完成,没有红叉