Mahout线性回归算法源码分析（2）--RunLogistic

mahout:0.9 ;

首先说明一点：logistic regression不支持并行，也就是mahout实现的也是单机的，运行在hadoop上面也没有意义（个人观点）。其次，建模得到那个公式太复杂了，所以TrainLogistic不分析了，只分析RunLogistic，即分析如何通过那个公式得到预测值。（最近感觉心很浮躁，静不下来，完全不在状态。。。哎。。。）

通过TrainLogistic可以得到一个公式，例如下面的样子：

-0.149*Intercept Term + -0.701*x + -0.427*y

先解释下Intercept Term吧，这个就是一个常量，在mahout里面实现的算法里面，其值为1.

其实本来我就是用测试的值，比如[0.97073650965467,0.989339149091393]，对应于[x,y]然后使用1作为常量，相加，这样得到的值是：-1.25253208955095，和最后的结果0,0.222,-0.251366，的0.222相差很多，那么到底通过-1.25253208955095如何得到0.222呢？

RunLogistic的预测代码：

while (line != null) {        Vector v = new SequentialAccessSparseVector(lmp.getNumFeatures());        int target = csv.processLine(line, v);        double score = lr.classifyScalar(v);        if (showScores) {          output.printf(Locale.ENGLISH, "%d,%.3f,%.6f%n", target, score, lr.logLikelihood(target, v));        }        collector.add(target, score);        line = in.readLine();      }

首先看下csv.processLine 其代码如下：

public int processLine(String line, Vector featureVector) {    List<String> values = parseCsvLine(line);    int targetValue = targetDictionary.intern(values.get(target));    if (targetValue >= maxTargetValue) {      targetValue = maxTargetValue - 1;    }    for (Integer predictor : predictors) {      String value;      if (predictor >= 0) {        value = values.get(predictor);      } else {        value = null;      }      predictorEncoders.get(predictor).addToVector(value, featureVector);    }    return targetValue;  }

这里返回的targetValue就是line里面的分类值（注意，这里是做了转换2-->0,1-->1）。predictors有三个值[-1,0,1]，-1代表常量值，0代表x变量，1代表y变量。然后predictorEncoders。。。是把line中的x、y取出来放在featureVector对应的位置。

double score = lr.classifyScalar(v);这里就是分类的主要代码了：

public double classifyScalar(Vector instance) {    Preconditions.checkArgument(numCategories() == 2, "Can only call classifyScalar with two categories");    // apply pending regularization to whichever coefficients matter    regularize(instance);    // result is a vector with one element so we can just use dot product    return link(classifyScalarNoLink(instance));  }

regularize，个人感觉基本没做什么。

主要还是最后一个return。

public double classifyScalarNoLink(Vector instance) {    return beta.viewRow(0).dot(instance);  }

这里就是把对应的值相乘，最后得到就是咱们前面算到的-1.25253208955095。

最后就是link函数了：

 public static double link(double r) {    if (r < 0.0) {      double s = Math.exp(r);      return s / (1.0 + s);    } else {      double s = Math.exp(-r);      return 1.0 / (1.0 + s);    }  }

这里的r 就是-1.25253208955095，带入上面的公式就可以得到最后的预测值0.222262129784177。

最后，顺便说下个人关于这个算法的一些并行的想法。其实，如果可以的话，应该可以建立多条直线方程的（相同变量的前提下），然后把它们评估效果最好的进行输出，其实这样也不算是并行，只是提高准确率的一种方法。这种思路其实是来自于随机森林算法，那里面就是多棵决策树，多棵决策树就是算作并行了。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

秒客网

Mahout线性回归算法源码分析（2）--RunLogistic

相关文章