Twenty Newsgroups Classification任务之二seq2sparse

时间:2022-09-18 12:00:12

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:

Usage:
[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--analyzerName (-a) analyzerName The class name of the analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--minDF (-md) minDF The minimum document frequency. Default
is 1
--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors
to be used, expressed in times the
standard deviation (sigma) of the
document frequencies of these vectors.
Can be used to remove really high
frequency terms. Expressed as a double
value. Good value to be specified is 3.0.
In case the value is less then 0 no
vectors will be filtered out. Default is
-1.0. Overrides maxDFPercent
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.
Can be used to remove really high
frequency terms. Expressed as an integer
between 0 and 100. Default is 99. If
maxDFSigma is also set, it will override
this value.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers (Optional) Number of reduce tasks.
Default Value: 1
--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to
create (2 = bigrams, 3 = trigrams, etc)
Default Value:1
--overwrite (-ow) If set, overwrite the output directory
--help (-h) Print out help
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false

在昨天算法的终端信息中该步骤的调用命令如下:

./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考 http://zh.wikipedia.org/wiki/TF-IDF 。

第(1)步在SparseVectorsFromSequenceFiles的253行的:

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:

protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
StringTuple document = new StringTuple();
stream.reset();
while (stream.incrementToken()) {
if (termAtt.length() > 0) {
document.add(new String(termAtt.buffer(), 0, termAtt.length()));
}
}
context.write(key, document);
}

该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考: http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ,其中在map中用到的函数为 reusableTokenStream( String fieldName,  Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:

package mahout.fansy.test.bayes;

import java.io.IOException;
import java.io.StringReader; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.mahout.common.ClassUtils;
import org.apache.mahout.common.StringTuple;
import org.apache.mahout.vectorizer.DefaultAnalyzer;
import org.apache.mahout.vectorizer.DocumentProcessor; public class TestSequenceFileTokenizerMapper { /**
* @param args
*/
private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",
Analyzer.class);
public static void main(String[] args) throws IOException {
testMap();
} public static void testMap() throws IOException{
Text key=new Text("4096");
Text value=new Text("today is also late.what about tomorrow?");
TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
StringTuple document = new StringTuple();
stream.reset();
while (stream.incrementToken()) {
if (termAtt.length() > 0) {
document.add(new String(termAtt.buffer(), 0, termAtt.length()));
}
}
System.out.println("key:"+key.toString()+",document"+document);
} }

得出的结果如下:

key:4096,document[today, also, late.what, about, tomorrow]

其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。

额,又太晚了。哎,早困了,刷个牙线。。。

分享,快乐,成长

转载请注明出处:http://blog.csdn.net/fansy1990

Twenty Newsgroups Classification任务之二seq2sparse的更多相关文章

  1. Twenty Newsgroups Classification任务之二seq2sparse(5)

    接上篇blog,继续分析.接下来要调用代码如下: // Should document frequency features be processed if (shouldPrune || proce ...

  2. Twenty Newsgroups Classification任务之二seq2sparse(3)

    接上篇,如果想对上篇的问题进行测试其实可以简单的编写下面的代码: package mahout.fansy.test.bayes.write; import java.io.IOException; ...

  3. Twenty Newsgroups Classification任务之二seq2sparse(2)

    接上篇,SequenceFileTokenizerMapper的输出文件在/home/mahout/mahout-work-mahout0/20news-vectors/tokenized-docum ...

  4. mahout 运行Twenty Newsgroups Classification实例

    按照mahout官网https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups的说法,我只用运行一条命令就可以完成这个算法 ...

  5. Twenty Newsgroups Classification实例任务之TrainNaiveBayesJob&lpar;一&rpar;

    接着上篇blog,继续看log里面的信息如下: + echo 'Training Naive Bayes model' Training Naive Bayes model + ./bin/mahou ...

  6. 项目笔记《DeepLung&colon;Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification》(二)(上)模型设计

    我只讲讲检测部分的模型,后面两样性分类的试验我没有做,这篇论文采用了很多肺结节检测论文都采用的u-net结构,准确地说是具有DPN结构的3D版本的u-net,直接上图. DPN是颜水成老师团队的成果, ...

  7. 深度学习数据集Deep Learning Datasets

    Datasets These datasets can be used for benchmarking deep learning algorithms: Symbolic Music Datase ...

  8. Open Data for Deep Learning

    Open Data for Deep Learning Here you’ll find an organized list of interesting, high-quality datasets ...

  9. 深度学习课程笔记(二)Classification: Probility Generative Model

    深度学习课程笔记(二)Classification: Probility Generative Model  2017.10.05 相关材料来自:http://speech.ee.ntu.edu.tw ...

随机推荐

  1. connect&sol;express 的参考

    1.Node.js[5] connect & express简介    对connect中间件的分类比较容易理解. http://www.cnblogs.com/luics/archive/2 ...

  2. C语言结构体位域

    demo: typedef struct { int a:2; int b:2; int c:1; }test; int main() { test t; t.a=1; t.b=3; t.c=1; / ...

  3. CentOS修改默认编码为UTF-8,使java程序字符集默认为UTF-8

    java程序在本地接受php的utf8字符串好好的,到了服务器就行了. 解决,修改vi /etc/sysconfig/i18n,修改之后ssh断开,重连后KILL你的java. LANG=" ...

  4. Linux创建用户命令

    创建用户.设置密码.修改用户.删除用户: useradd testuser   创建用户testuser passwd testuser   给已创建的用户testuser设置密码 说明:新创建的用户 ...

  5. Ridge Regression and Ridge Regression Kernel

    Ridge Regression and Ridge Regression Kernel Reference: 1. scikit-learn linear_model ridge regressio ...

  6. WHAT&quest;【 &dollar;&period;fn&period;extend&lpar;&rpar; 】vs【 &dollar;&period;extend&lpar;&rpar; 】

    废话不多说,干货来了,转自http://www.cnblogs.com/hellman/p/4349777.html (function($){ $.fn.extend({ test:function ...

  7. 第四章初始CSS3预习笔记

    第四章 初始CSS3预习笔记 一: 1: 什么是CSS? 全称是层叠样式表;/通常又称为风格样式表,.他是用来进行网页风格设计的; 2:CSS的优势: 1>内容以表现分离,即使用u前面学习的HT ...

  8. 马拉车算法&comma;mannacher查找最长回文子串

    作用: 在线性时间内找到一个字符串的最大回文子串 原理: 奇偶变换:为处理字符串方便,现将给定的任意字符串进行处理,使所有可能的奇数/偶数长度的回文子串都转换成了奇数长度. 具体就是在每个字符的两边都 ...

  9. ES6 字符串

    拓展的方法 子串的识别 ES6 之前判断字符串是否包含子串,用 indexOf 方法,ES6 新增了子串的识别方法. includes():返回布尔值,判断是否找到参数字符串. startsWith( ...

  10. hdu-1173(最短距离)

    题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1173 思路:最短距离:就是现将x,y从小到大排序,然后去中间点就行了.(注意:本题答案不唯一) #in ...