利用Spark-mllab进行聚类，分类，回归分析的代码实现(python)

http://www.cnblogs.com/adienhsuan/p/5654481.html

稀疏向量：

关于SparkMLlib的基础数据结构Spark-MLlib-Basics：

http://blog.csdn.net/canglingye/article/details/41316193

关于正则化项：http://www.itnose.net/detail/6266100.html

精度和召回率：http://f.dataguru.cn/thread-707310-1-1.html

机器学习：http://www.cnblogs.com/Leo_wl/p/5544239.html

【重要】正则化范数详细：http://www.fuqingchuan.com/2015/03/500.html

http://www.cnblogs.com/tovin/p/3816289.html

计算AUC http://www.toutiao.com/i6259948874706715138/

spark pyspark 常用算法实现

 1 from pyspark.mllib.regression import LabeledPoint

 2 from pyspark.mllib.feature import HashingTF

 3 from pyspark.mllib.calssification import LogisticRegressionWithSGD

 4

 5 spam = sc.textFile("spam.txt")

 6 normal = sc.textFile("normal.txt")

 7

 8 #创建一个HashingTF实例来把邮件文本映射为包含10000个特征的向量

 9 tf = HashingTF(numFeatures = 10000)

10 #各邮件都被切分为单词，每个单词背映射为一个特征

11 spamFeatures = spam.map(lambda email: tf.transform(email.split(" ")))

12 normalFeatures = normal.map(lambda email: tf.transform(email.split(" ")))

13

14 #创建LabeledPoint数据集分别存放阳性（垃圾邮件）和阴性（正常邮件）的例子

15 positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1,features))

16 negativeExamples = normalFeatures.map(lambda features: LabeledPoint(0,features))

17 trainingData = positiveExamples.union(negativeExamples)

18 trainingData.cache#因为逻辑回归是迭代算法，所以缓存数据RDD

19

20 #使用SGD算法运行逻辑回归

21 model = LogisticRegressionWithSGD.train(trainingData)

22

23 #以阳性（垃圾邮件）和阴性（正常邮件）的例子分别进行测试

24 posTest = tf.transform("O M G GET cheap stuff by sending money to...".split(" "))

25 negTest = tf.transform("Hi Dad, I stared studying Spark the other ...".split(" "))

26 print "Prediction for positive test examples: %g" %model.predict(posTest)

27 print "Prediction for negative test examples: %g" %model.predict(negTest)

算法通解：

spark pyspark 常用算法实现

秒客网

spark pyspark 常用算法实现

利用Spark-mllab进行聚类，分类，回归分析的代码实现(python)

相关文章