nutch在eclipse上运行时错误

时间:2022-11-18 15:56:20
solrUrl is not set, indexing will be skipped...
crawl started in: crwal
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-20 14:39:30
Injector: crawlDb: crwal/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

java.lang.RuntimeException: Error in configuring object

    atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)

    atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)

    atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)

    atorg.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)

    atorg.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

    atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Caused by: java.lang.reflect.InvocationTargetException

    atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    atsun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

    atsun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

    atjava.lang.reflect.Method.invoke(Unknown Source)

    atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)

    ...5 more

Caused by: java.lang.RuntimeException: Error in configuring object

    atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)

    atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)

    atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)

    atorg.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)

    ...10 more

Caused by: java.lang.reflect.InvocationTargetException

    atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    atsun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

    atsun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

    atjava.lang.reflect.Method.invoke(Unknown Source)

    atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)

    ...13 more

Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined

    atorg.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)

    atorg.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:72)

    atorg.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)

    atorg.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:117)

    atorg.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)

    ...18 more

12/04/20 10:14:44 INFOmapred.JobClient:  map 0% reduce 0%

12/04/20 10:14:44 INFOmapred.JobClient: Job complete: job_local_0001

12/04/20 10:14:44 INFOmapred.JobClient: Counters: 0

Exception in thread"main" java.io.IOException: Job failed!

    atorg.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)

    atorg.apache.nutch.crawl.Injector.inject(Injector.java:217)

    atorg.apache.nutch.crawl.Crawl.run(Crawl.java:127)

    atorg.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

    atorg.apache.nutch.crawl.Crawl.main(Crawl.java:55)

首先不要怪我贴了这么多的错误信息,只是为了让大家更容易找到这里而已。

解决这个问题就是将nutch-default.xml中的

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

红色处改一下就可以了。

祝大家好运哦!


补充一下将nutch运行在eclipse上的步骤,搞了一天才搞通,不过要谢谢北北同学。哈哈

http://wiki.apache.org/nutch/RunNutchInEclipse  英语权威 处

做好准备工作

1、安装subeclpse插件,安装ivyDE插件,安装maven插件

2、check出代码 https://svn.apache.org/repos/asf/nutch/trunk

3、删除src,然后将src/bin,src/java,src/test,src/testsource,src/plugin/xx/src/java,src/plugin/xx/src/test作为folder

4、加上两jar包,看英文能看懂的

5、在libraries分页上,右边点击Add Class Floder 选中nutch的conf.

6、还是在libraries分页上,右击Add Library > IvyDE Managed Dependencies > 选ivy/ivy.xml

7、build.xml----ant一下

8、刷新一下nutch工程,在conf下增加了nutch-site.xml,regex-urlfilter.xml,配置内容

9、在nutch-default.xml中修改

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

很关键

10、在根目录下建一个文件夹urls,文件夹下seed.txt,seed.txt中写要抓取页面的网址

11、build.xml 再次编译(ant)

12、执行

nutch在eclipse上运行时错误