seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:
[java] view plain copy
1. Usage:
2. [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
3. <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
4. <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
5. --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
6. --overwrite --help --sequentialAccessVector --namedVector --logNormalize]
7. Options
8. --minSupport (-s) minSupport (Optional) Minimum Support. Default
9. 2
10. class name of the analyzer
11. 100-10000 MB
12. for output.
13. --input (-i) input Path to job input directory.
14. --minDF (-md) minDF The minimum document frequency. Default
15. 1
16. --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors
17. to be used, expressed in times the
18. standard deviation (sigma) of the
19. document frequencies of these vectors.
20. Can be used to remove really high
21. double
22. 3.0.
23. case the value is less then 0 no
24. vectors will be filtered out. Default is
25. 1.0. Overrides maxDFPercent
26. for the DF.
27. Can be used to remove really high
28. frequency terms. Expressed as an integer
29. 0 and 100. Default is 99. If
30. maxDFSigma is also set, it will override
31. this value.
32. --weight (-wt) weight The kind of weight to use. Currently TF
33. or TFIDF
34. --norm (-n) norm The norm to use, expressed as either a
35. float or "INF" if you want to use the
36. Infinite norm. Must be greater or equal
37. 0. The default is not to normalize
38. --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
39. 1.0
40. --numReducers (-nr) numReducers (Optional) Number of reduce tasks.
41. 1
42. --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to
43. 2 = bigrams, 3 = trigrams, etc)
44. 1
45. --overwrite (-ow) If set, overwrite the output directory
46. --help (-h) Print out help
47. --sequentialAccessVector (-seq) (Optional) Whether output vectors should
48. true
49. else false
50. --namedVector (-nv) (Optional) Whether output vectors should
51. true else false
52. --logNormalize (-lnorm) (Optional) Whether output vectors should
53. true else false
在昨天算法的终端信息中该步骤的调用命令如下:
[python] view plain copy
1. ./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf
我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考http://zh.wikipedia.org/wiki/TF-IDF 。
第(1)步在SparseVectorsFromSequenceFiles的253行的:
[java] view plain copy
1. DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);
这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:
[java] view plain copy
1. protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
2. new StringReader(value.toString()));
3. class);
4. new StringTuple();
5. stream.reset();
6. while (stream.incrementToken()) {
7. if (termAtt.length() > 0) {
8. new String(termAtt.buffer(), 0, termAtt.length()));
9. }
10. }
11. context.write(key, document);
12. }
该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考:http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ,其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:
[java]
view plain
copy
1. package mahout.fansy.test.bayes;
2.
3. import java.io.IOException;
4. import java.io.StringReader;
5.
6. import org.apache.hadoop.conf.Configuration;
7. import org.apache.hadoop.io.Text;
8. import org.apache.lucene.analysis.Analyzer;
9. import org.apache.lucene.analysis.TokenStream;
10. import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
11. import org.apache.mahout.common.ClassUtils;
12. import org.apache.mahout.common.StringTuple;
13. import org.apache.mahout.vectorizer.DefaultAnalyzer;
14. import org.apache.mahout.vectorizer.DocumentProcessor;
15.
16. public class TestSequenceFileTokenizerMapper {
17.
18. /**
19. * @param args
20. */
21. private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",
22. Analyzer.class);
23. public static void main(String[] args) throws IOException {
24. testMap();
25. }
26.
27. public static void testMap() throws IOException{
28. new Text("4096");
29. new Text("today is also late.what about tomorrow?");
30. new StringReader(value.toString()));
31. class);
32. new StringTuple();
33. stream.reset();
34. while (stream.incrementToken()) {
35. if (termAtt.length() > 0) {
36. new String(termAtt.buffer(), 0, termAtt.length()));
37. }
38. }
39. "key:"+key.toString()+",document"+document);
40. }
41.
42. }
得出的结果如下:
[plain]
view plain
copy
1. key:4096,document[today, also, late.what, about, tomorrow]
其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。
微信公众号: