nutch搏斗之一

问题描述：
在用nutch1.0做generate 包括5亿url的crawldb时，它默认按照64M分块，分成777个map task，在运行的后期出现
Could
not find
taskTracker/jobcache/job_200903231519_0017/attempt_200903231519_0017_r_000051_0/output/file.out
in any of the configured local directories
异常。
解决办法：
减小task数目，改成按照crawldb里面文件个数划分的策略：

public static class InputFormat extends SequenceFileInputFormat<WritableComparable, Writable> {
/** Don't split inputs, to keep things polite. */
public InputSplit[] getSplits(JobConf job, int nSplits)
throws IOException {
FileStatus[] files = listStatus(job);
FileSystem fs = FileSystem.get(job);
InputSplit[] splits = new InputSplit[files.length];
for (int i = 0; i < files.length; i++) {
FileStatus cur = files[i];
splits[i] = new FileSplit(cur.getPath(), 0,
cur.getLen(), (String[])null);
}
return splits;
}
}

这次出现了新问题，有数个task因为十分钟无反应而导致整个任务failed
解决办法：
修改hadoop-site.xml

<property>
<name>mapred.task.timeout</name>
<value>3600000</value>
<description>The number of milliseconds before a task will be
terminated if it neither reads an input, writes an output, nor
updates its status string.
</description>
</property>

总结：
大与小，多与少，长与短，在不同的情况下是不断变化的，对于大数据量而言，更要跟具实际情况灵活变化，所谓运用之刀，存乎一心是也！

秒客网

nutch搏斗之一

相关文章