Twenty Newsgroups Classification任务之二seq2sparse（1）

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）VectorTfIdf Document Frequency Count（6）MakePartialVectors（7）MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息：

[java] view plain copy

1. Usage:                                                                            
2.  [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize             
3. <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma        
4. <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>        
5. --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>          
6. --overwrite --help --sequentialAccessVector --namedVector --logNormalize]         
7. Options                                                                           
8.   --minSupport (-s) minSupport        (Optional) Minimum Support. Default         
9. 2                                    
10. class name of the analyzer              
11. 100-10000 MB    
12. for output.          
13.   --input (-i) input                  Path to job input directory.                
14.   --minDF (-md) minDF                 The minimum document frequency.  Default    
15. 1                                        
16.   --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors     
17.                                       to be used, expressed in times the          
18.                                       standard deviation (sigma) of the           
19.                                       document frequencies of these vectors.      
20.                                       Can be used to remove really high           
21. double      
22. 3.0.   
23. case the value is less then 0 no         
24.                                       vectors will be filtered out. Default is    
25. 1.0.  Overrides maxDFPercent               
26. for the DF.      
27.                                       Can be used to remove really high           
28.                                       frequency terms. Expressed as an integer    
29. 0 and 100. Default is 99.  If       
30.                                       maxDFSigma is also set, it will override    
31. this value.                                 
32.   --weight (-wt) weight               The kind of weight to use. Currently TF     
33.                                       or TFIDF                                    
34.   --norm (-n) norm                    The norm to use, expressed as either a      
35. float or "INF" if you want to use the       
36.                                       Infinite norm.  Must be greater or equal    
37. 0.  The default is not to normalize      
38.   --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood        
39. 1.0                
40.   --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.          
41. 1                            
42.   --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to    
43. 2 = bigrams, 3 = trigrams, etc)     
44. 1                             
45.   --overwrite (-ow)                   If set, overwrite the output directory      
46.   --help (-h)                         Print out help                              
47.   --sequentialAccessVector (-seq)     (Optional) Whether output vectors should    
48. true     
49. else false                                  
50.   --namedVector (-nv)                 (Optional) Whether output vectors should    
51. true else false     
52.   --logNormalize (-lnorm)             (Optional) Whether output vectors should    
53. true else false

在昨天算法的终端信息中该步骤的调用命令如下：

[python] view plain copy

1. ./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数，首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化（设置则为true），-nv解释为输出向量被设置为named 向量，这里的named是啥意思？（暂时不清楚），-wt tfidf解释为使用权重的算法，具体参考http://zh.wikipedia.org/wiki/TF-IDF 。

第（1）步在SparseVectorsFromSequenceFiles的253行的：

[java] view plain copy

1. DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是：SequenceFileTokenizerMapper，没有使用Reducer。Mapper的代码如下：

[java] view plain copy

1. protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {  
2. new StringReader(value.toString()));  
3. class);  
4. new StringTuple();  
5.     stream.reset();  
6. while (stream.incrementToken()) {  
7. if (termAtt.length() > 0) {  
8. new String(termAtt.buffer(), 0, termAtt.length()));  
9.       }  
10.     }  
11.     context.write(key, document);  
12.   }

该Mapper的setup函数主要设置Analyzer的，关于Analyzer的api参考：http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ，其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) ：Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序：

[java] 
    view plain
    copy 
    

      
    
 
  
1. package mahout.fansy.test.bayes;  
2.   
3. import java.io.IOException;  
4. import java.io.StringReader;  
5.   
6. import org.apache.hadoop.conf.Configuration;  
7. import org.apache.hadoop.io.Text;  
8. import org.apache.lucene.analysis.Analyzer;  
9. import org.apache.lucene.analysis.TokenStream;  
10. import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;  
11. import org.apache.mahout.common.ClassUtils;  
12. import org.apache.mahout.common.StringTuple;  
13. import org.apache.mahout.vectorizer.DefaultAnalyzer;  
14. import org.apache.mahout.vectorizer.DocumentProcessor;  
15.   
16. public class TestSequenceFileTokenizerMapper {  
17.   
18. /**
19.      * @param args
20.      */  
21. private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",  
22. Analyzer.class);  
23. public static void main(String[] args) throws IOException {  
24.         testMap();  
25.     }  
26.       
27. public static void testMap() throws IOException{  
28. new Text("4096");  
29. new Text("today is also late.what about tomorrow?");  
30. new StringReader(value.toString()));  
31. class);  
32. new StringTuple();  
33.         stream.reset();  
34. while (stream.incrementToken()) {  
35. if (termAtt.length() > 0) {  
36. new String(termAtt.buffer(), 0, termAtt.length()));  
37.           }  
38.         }  
39. "key:"+key.toString()+",document"+document);  
40.     }  
41.   
42. }

得出的结果如下：

[plain] 
    view plain
    copy 
    

      
    
 
  
1. key:4096,document[today, also, late.what, about, tomorrow]

其中，TokenStream有一个stopwords属性，值为：[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]，所以当遇到这些单词的时候就不进行计算了。

微信公众号：

秒客网

Twenty Newsgroups Classification任务之二seq2sparse（1）

相关文章