Injector Job深入分析分类： H3_NUTCH 2015-03-10 15:44 334人阅读评论(0) 收藏

Injector Job的主要功能是根据crawlId在hbase中创建一个表，将将文本中的seed注入表中。

（一）命令执行

1、运行命令

[jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest

InjectorJob: starting at 2015-03-10 14:59:19

InjectorJob: Injecting urlDir: seeds

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06

2、查看表中内容

hbase(main):004:0> scan 'sourcetest_webpage'

ROW                                       COLUMN+CELL

 com.163.money:http/                      column=f:fi, timestamp=1425970761871, value=\x00'\x8D\x00

 com.163.money:http/                      column=f:ts, timestamp=1425970761871, value=\x00\x00\x01L\x02{\x08_

 com.163.money:http/                      column=mk:_injmrk_, timestamp=1425970761871, value=y

 com.163.money:http/                      column=mk:dist, timestamp=1425970761871, value=0

 com.163.money:http/                      column=mtdt:_csh_, timestamp=1425970761871, value=?\x80\x00\x00

 com.163.money:http/                      column=s:s, timestamp=1425970761871, value=?\x80\x00\x00

1 row(s) in 0.0430 seconds

3、读取数据库中的内容

由于hbase表使用了字节码表示内容，因此需要通过以下命令来查看具体内容

[jediael@master local]$ bin/nutch readdb  -dump ./test -crawlId sourcetest -content

WebTable dump: starting

WebTable dump: done

[jediael@master local]$ cat test/part-r-00000

http://money.163.com/   key:    com.163.money:http/

baseUrl:        null

status: 0 (null)

fetchTime:      1425970759775

prevFetchTime:  0

fetchInterval:  2592000

retriesSinceFetch:      0

modifiedTime:   0

prevModifiedTime:       0

protocolStatus: (null)

parseStatus:    (null)

title:  null

score:  1.0

marker _injmrk_ :       y

marker dist :   0

reprUrl:        null

metadata _csh_ :        ?锟

（二）源码流程分析

类：org.apache.nutch.crawl.InjectorJob

1、程序入口

public static void main(String[] args) throws Exception {

    int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),

        args);

    System.exit(res);

  }

2、ToolRunner.run(String[] args)

此步骤主要是调用inject方法，其余均是一些参数合规性的检查

public int run(String[] args) throws Exception {

  …………

    inject(new Path(args[0]));

   …………

  }

3、inject()方法

nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数，并返回Map类参数。

<pre name="code" class="java">public void inject(Path urlDir) throws Exception {

    run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir));

  }

4、job的具体配置，并创建hbase中的表格

public Map<String, Object> run(Map<String, Object> args) throws Exception {

    numJobs = 1;

    currentJobNum = 0;

    currentJob = new NutchJob(getConf(), "inject " + input);

    FileInputFormat.addInputPath(currentJob, input);

    currentJob.setMapperClass(UrlMapper.class);

    currentJob.setMapOutputKeyClass(String.class);

    currentJob.setMapOutputValueClass(WebPage.class);

    currentJob.setOutputFormatClass(GoraOutputFormat.class);

    DataStore<String, WebPage> store = StorageUtils.createWebStore(

        currentJob.getConfiguration(), String.class, WebPage.class);

    GoraOutputFormat.setOutput(currentJob, store, true);

    currentJob.setReducerClass(Reducer.class);

    currentJob.setNumReduceTasks(0);

    currentJob.waitForCompletion(true);

    ToolUtil.recordJobStatus(null, currentJob, results);

}

5、mapper方法

由于Injector Job中无reducer，因此只要关注mapper即可。

mapper主要完成以下几项工作：

（1）对文本中的内容进行分析，并提取其中的参数

（2）根据filter过滤url

（3）反转url作为key，创建Webpage对象作为value，然后将之写入表中。

protected void map(LongWritable key, Text value, Context context)

        throws IOException, InterruptedException {

      String url = value.toString().trim(); // value is line of text

      if (url != null && (url.length() == 0 || url.startsWith("#"))) {

        /* Ignore line that start with # */

        return;

      }

      // if tabs : metadata that could be stored

      // must be name=value and separated by \t

      float customScore = -1f;

      int customInterval = interval;

      Map<String, String> metadata = new TreeMap<String, String>();

      if (url.indexOf("\t") != -1) {

        String[] splits = url.split("\t");

        url = splits[0];

        for (int s = 1; s < splits.length; s++) {

          // find separation between name and value

          int indexEquals = splits[s].indexOf("=");

          if (indexEquals == -1) {

            // skip anything without a =

            continue;

          }

          String metaname = splits[s].substring(0, indexEquals);

          String metavalue = splits[s].substring(indexEquals + 1);

          if (metaname.equals(nutchScoreMDName)) {

            try {

              customScore = Float.parseFloat(metavalue);

            } catch (NumberFormatException nfe) {

            }

          } else if (metaname.equals(nutchFetchIntervalMDName)) {

            try {

              customInterval = Integer.parseInt(metavalue);

            } catch (NumberFormatException nfe) {

            }

          } else

            metadata.put(metaname, metavalue);

        }

      }

      try {

        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);

        url = filters.filter(url); // filter the url

      } catch (Exception e) {

        LOG.warn("Skipping " + url + ":" + e);

        url = null;

      }

      if (url == null) {

        context.getCounter("injector", "urls_filtered").increment(1);

        return;

      } else { // if it passes

        String reversedUrl = TableUtil.reverseUrl(url); // collect it

        WebPage row = WebPage.newBuilder().build();

        row.setFetchTime(curTime);

        row.setFetchInterval(customInterval);

        // now add the metadata

        Iterator<String> keysIter = metadata.keySet().iterator();

        while (keysIter.hasNext()) {

          String keymd = keysIter.next();

          String valuemd = metadata.get(keymd);

          row.getMetadata().put(new Utf8(keymd),

              ByteBuffer.wrap(valuemd.getBytes()));

        }

        if (customScore != -1)

          row.setScore(customScore);

        else

          row.setScore(scoreInjected);

        try {

          scfilters.injectedScore(url, row);

        } catch (ScoringFilterException e) {

          if (LOG.isWarnEnabled()) {

            LOG.warn("Cannot filter injected score for url " + url

                + ", using default (" + e.getMessage() + ")");

          }

        }

        context.getCounter("injector", "urls_injected").increment(1);

        row.getMarkers()

            .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));

        Mark.INJECT_MARK.putMark(row, YES_STRING);

        context.write(reversedUrl, row);

      }

    }

（三）重点源码学习

秒客网

Injector Job深入分析分类： H3_NUTCH 2015-03-10 15:44 334人阅读评论(0) 收藏

相关文章

Injector Job深入分析 分类： H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏

相关文章

Injector Job深入分析分类： H3_NUTCH 2015-03-10 15:44 334人阅读评论(0) 收藏