crawl started in: crwal
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-20 14:39:30
Injector: crawlDb: crwal/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
java.lang.RuntimeException: Error in configuring object
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
atorg.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
atorg.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
atsun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
atjava.lang.reflect.Method.invoke(Unknown Source)
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
...5 more
Caused by: java.lang.RuntimeException: Error in configuring object
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
atorg.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
...10 more
Caused by: java.lang.reflect.InvocationTargetException
atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
atsun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
atjava.lang.reflect.Method.invoke(Unknown Source)
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
...13 more
Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
atorg.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
atorg.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:72)
atorg.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
atorg.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:117)
atorg.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
...18 more
12/04/20 10:14:44 INFOmapred.JobClient: map 0% reduce 0%
12/04/20 10:14:44 INFOmapred.JobClient: Job complete: job_local_0001
12/04/20 10:14:44 INFOmapred.JobClient: Counters: 0
Exception in thread"main" java.io.IOException: Job failed!
atorg.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
atorg.apache.nutch.crawl.Injector.inject(Injector.java:217)
atorg.apache.nutch.crawl.Crawl.run(Crawl.java:127)
atorg.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
atorg.apache.nutch.crawl.Crawl.main(Crawl.java:55)
首先不要怪我贴了这么多的错误信息,只是为了让大家更容易找到这里而已。
解决这个问题就是将nutch-default.xml中的
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
红色处改一下就可以了。
祝大家好运哦!
补充一下将nutch运行在eclipse上的步骤,搞了一天才搞通,不过要谢谢北北同学。哈哈
http://wiki.apache.org/nutch/RunNutchInEclipse 英语权威 处
做好准备工作
1、安装subeclpse插件,安装ivyDE插件,安装maven插件
2、check出代码 https://svn.apache.org/repos/asf/nutch/trunk
3、删除src,然后将src/bin,src/java,src/test,src/testsource,src/plugin/xx/src/java,src/plugin/xx/src/test作为folder
4、加上两jar包,看英文能看懂的
5、在libraries分页上,右边点击Add Class Floder 选中nutch的conf.
6、还是在libraries分页上,右击Add Library > IvyDE Managed Dependencies > 选ivy/ivy.xml
7、build.xml----ant一下
8、刷新一下nutch工程,在conf下增加了nutch-site.xml,regex-urlfilter.xml,配置内容
9、在nutch-default.xml中修改
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
很关键
10、在根目录下建一个文件夹urls,文件夹下seed.txt,seed.txt中写要抓取页面的网址
11、build.xml 再次编译(ant)
12、执行