1. Introduction
2017年8月,前百度首席科学家吴恩达先生在twitter上宣布自己从百度离职后的第一个动作:在Coursera上推出一门从零开始构建神经网络的Deep Learning课程,一时间广为轰动。
截止到今天(2017年8月17日星期四),本人已经注册该门课程并且完成了两周的课程学习和作业。在前两周的课程中,吴恩达先生利用Logistic Regression来深入浅出的说明了神经网络的工作原理,并且用通俗易懂的语言介绍了反向传播的原理即为链式求导。在第二周的编码作业中,学生被要求利用Python Notebook从头实现Logistic Regression模型,并且利用此模型对所给定的图像集进行二元分类,判断某张图片是否是猫,最终训练好的模型的Test Accuracy能达到70%。吴恩达先生还说,在接下来的课程中我们会进一步学习神经网络的优化方法,以进一步提高猫狗分辨的Accuracy。
在编码测验中,吴恩达先生所强调的重点即为向量化运算,并且用实例说明了Python Numpy包的向量乘法比简单的for循环求和的速度快300多倍,这也意味着1分钟与5个小时的差距。然而,众所周知Python在实际的工程开发中更多是扮演者快速实验idea,快速得到结果的作用,一定程度上不适用于模型的正式开发及上线。本文中使用Scala实现吴恩达先生在Deep Learning课程中布置的所有作业,感谢您的阅读,期望共同进步。
在Python中实现深度学习算法以及向量化运算所依赖的包叫做Numpy,即Number Python。Numpy中提供了Vector与Matrix的实现,以及矩阵的各种运算和分解的函数。对应地,在Scala中我们使用Breeze包,其中也提供了DenseVector和DenseMatrix的数据结构,并且在数据量特别稀疏的情况下还有SparseVector和SparseMatrix可供使用,一定程度上比Numpy更加强大。最重要地,作为静态类型语言Scala是类型安全的,意味着我们不仅可以用Scala来实现算法,还可以用其进行数据预处理和数据清洗,即ETL。
本文分为四个部分。第一部分介绍整个项目结构;第二部分详细解释用Scala实现Logistic Regression的代码;第三部分给出其他功能性代码的解释,如数据预处理,画图工具,和一些其他的helper类;第四部分给出本文的demo结果和数据集的下载地址。另外,本项目的所有代码都可以在GitHub中找到,GitHub项目地址为https://github.com/pan5431333/coursera-deeplearning-practice-in-scala,跟随吴恩达先生的课程进度代码会及时保持更新,欢迎follow。
2. 项目结构
本文拟使用的项目结构分为五个subpackage,分别为data,demo,helper,model和utils。
data包中包含一个类Cat,其类型为Scala中的caseclass,特别适合用来表示真实世界中的一个entity。demo中即为每一节课后作业的运行实例;helper中现在包含两个类,CatDataHelper利用Java中的ImageIO从本地文件系统中读取图片,将其转化为RGB矩阵的表示形式,之后再reshape成向量形式。DlCollection为一个集合泛型类,其提供三个深度学习中常用的方法,分别为split,用来切分训练集和测试集;getFeatureAsMatrix返回算法所需要的特征矩阵;getLabelAsVector返回标签向量。Model包中现在仅包含Logistic Regression Model的实现。Utils包中现在有PlotUtils,其提供一个plotCostHistory方法,用来对cost随着迭代次数的变化情况画图。
下面介绍Logistic Regression算法在Scala中的具体实现。
3. Logistic Regression的Scala实战
首先,定义LogisticRegressionModel类:
classLogisticRegressionModel(){
var learningRate:Double= _
var iterationTime:Int = _
var w: DenseVector[Double] =_
var b:Double = _
val costHistory: mutable.TreeMap[Int, Double] =new mutable.TreeMap[Int,Double]()
此类包含五个InstanceVariables,其中前两个为超参数,learningRate表示学习率,iterationTime表示最大迭代次数;w和b即为模型参数,会随着迭代进行寻优;costHistory是一个用来保存迭代过程中cost变化情况的TreeMap,其key为迭代次数,value为cost值。
接下来是模型超参数的两个setter:
def setLearningRate(learningRate: Double): this.type = { this.learningRate = learningRate this } def setIterationTime(iterationTime: Int): this.type = { this.iterationTime = iterationTime this }
注意这里的setter与Java中的setter不一样,我们采用了链式编程的开发模式,即用户在调用时可以写成:val model = new LogisticRegressionModel().setLearningRate(0.0001).setIterationTime(3000),会使得整个编码过程更加流畅。链式编程也在Spark中被广泛使用,特别是构造数据管道(Pipeline)时会显得很优雅。
接下来是模型训练方法:
def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = { var (w, b) = initializeParams(feature.cols) (1 to this.iterationTime) .foreach{i => val (cost, dw, db) = propagate(feature, label, w, b) if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost) costHistory.put(i, cost) val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1) w :-= adjustedLearningRate * dw b -= adjustedLearningRate * db } this.w = w this.b = b this }
注意在此方法中我们用了两个私有方法,分别为initializeParams()和propagate(),我们会在下面对这两个方法详细解释。另外,我们对learningRate进行了简单的调整,使其随着迭代次数的增加逐渐减小,以尽量减少寻优时跳过最优解的可能性。
接下来是模型参数初始化的方法:
private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = { val w = DenseVector.rand[Double](featureSize) val b = DenseVector.rand[Double](1).data(0) (w, b) }
这里我们对w和b赋予0到1之间的随机赋值。
接下来是LogisticRegression核心的正向传播与反向传播的实现方法:
private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = { val numExamples = feature.rows val labelHat = sigmoid(feature * w + b) // println("DEBUG: feature * w + b is " + feature * w + b) // println("DEBUG: the feature's number of cols is " + feature.cols) // println("DEBUG: the feature's number of rows is " + feature.rows) // println("DEBUG: the labelHat is " + labelHat) val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble // println("DEBUG: the (dw, db) is " + dw + ", " + db) (cost, dw, db) }
其中注释掉的代码为开发过程中的DEBUG代码,因为我没有在此项目中引入log包,所以只能以这种方式进行DEBUG。feature.rows和feature.cols相当于Python Numpy中的feature.shape[0]和feature.shape[1];Sigmoid为breeze.numerics._中提供的函数,可以接受一个DenseVector或者DenseMatrix作为参数;cost、dw和db的计算请详见LogisticRegression的理论知识,如有不清楚的地方可以学习吴恩达先生Deep Learning课程。这里需要注意的一点是,Python Numpy支持broadcasting运算,如1 – np.array([1, 2, 3])会得到np.array([0, -1,-2]),即一个常量与向量或矩阵发生运算时,numpy会自动将该常量与向量或矩阵中的每个元素进行运算。Scala的breeze对此支持有限,所以在计算cost时我们只能用DenseVector.ones[Double](numExamples)– label,而不能直接用 1 – label。
接下来是用训练好的模型预测的方法:
def predict(feature: DenseMatrix[Double]): DenseVector[Double] = { val yPredicted = sigmoid(feature * this.w).map{eachY => if (eachY <= 0.05) 0.0 else 1.0 } yPredicted }
这里我们使用了函数式编程中常用的map,可以看出map使我们的代码变得很简洁。
接下来是计算预测准确度的方法:
def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = { val numCorrect = (0 until label.length) .map{index => if (label(index) == labelPredicted(index)) 1 else 0 } .count(_ == 1) numCorrect.toDouble / label.length.toDouble }
这里进一步使用了函数式编程的特性,代码非常简洁。
最后,还有一些辅助的getter方法:
def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory def getLearningRate: Double = this.learningRate def getIterationTime: Int = this.iterationTime
以上就是用Scala实现的Logistic Regression,完整代码如下所示,纸上得来终觉浅,绝知此事要躬行,如有疑问烦请复制代码到本地环境试着运行一下,对有疑问的地方进行适当修改观察程序的表现,可获益良多。
package org.mengpan.deeplearning.model import breeze.linalg.{DenseMatrix, DenseVector, max} import breeze.numerics.{log, sigmoid} import scala.collection.mutable /** * Created by mengpan on 2017/8/15. */ class LogisticRegressionModel() { var learningRate:Double = _ var iterationTime: Int = _ var w: DenseVector[Double] = _ var b: Double = _ val costHistory: mutable.TreeMap[Int, Double] = new mutable.TreeMap[Int, Double]() def setLearningRate(learningRate: Double): this.type = { this.learningRate = learningRate this } def setIterationTime(iterationTime: Int): this.type = { this.iterationTime = iterationTime this } def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = { var (w, b) = initializeParams(feature.cols) (1 to this.iterationTime) .foreach{i => val (cost, dw, db) = propagate(feature, label, w, b) if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost) costHistory.put(i, cost) val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1) w :-= adjustedLearningRate * dw b -= adjustedLearningRate * db } this.w = w this.b = b this } def predict(feature: DenseMatrix[Double]): DenseVector[Double] = { val yPredicted = sigmoid(feature * this.w).map{eachY => if (eachY <= 0.05) 0.0 else 1.0 } yPredicted } def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = { val numCorrect = (0 until label.length) .map{index => if (label(index) == labelPredicted(index)) 1 else 0 } .count(_ == 1) numCorrect.toDouble / label.length.toDouble } def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory def getLearningRate: Double = this.learningRate def getIterationTime: Int = this.iterationTime private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = { val w = DenseVector.rand[Double](featureSize) val b = DenseVector.rand[Double](1).data(0) (w, b) } private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = { val numExamples = feature.rows val labelHat = sigmoid(feature * w + b) // println("DEBUG: feature * w + b is " + feature * w + b) // println("DEBUG: the feature's number of cols is " + feature.cols) // println("DEBUG: the feature's number of rows is " + feature.rows) // println("DEBUG: the labelHat is " + labelHat) val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble // println("DEBUG: the (dw, db) is " + dw + ", " + db) (cost, dw, db) } }
4. 其他功能性代码
首先我们来看一下表示Cat的case class:
package org.mengpan.deeplearning.data import breeze.linalg.DenseVector /** * Created by mengpan on 2017/8/15. */ case class Cat(feature: DenseVector[Double], label: Double)
然后是从本地读取图片数据的CatDataHelper静态类(Scala中的object):
package org.mengpan.deeplearning.helper import java.io.File import javax.imageio.ImageIO import breeze.linalg.{DenseMatrix, DenseVector} import org.mengpan.deeplearning.data.Cat import scala.io.Source /** * Created by mengpan on 2017/8/15. */ object CatDataHelper { def getAllCatData: DlCollection[Cat] = { val labels = getLabels val catNonCatLabels = getBalancedBatNonCatLabels(labels) val catList = catNonCatLabels.map{indexedLabel => val fileNumber = indexedLabel._1 val label = indexedLabel._2 val animalFileName: String = "/Users/mengpan/Downloads/train/" + fileNumber + ".png" val feature = getFeatureForOneAnimal(animalFileName) feature match { case Some(s) => Cat(s, label) case None => Cat(DenseVector.zeros[Double](10), label) } } .filter{cat => cat.feature.length != 10 } .toList new DlCollection[Cat](catList) } private def getFeatureForOneAnimal(animalFileName: String): Option[DenseVector[Double]] = { println("Reading file: " + animalFileName) try { val image = ImageIO.read(new File(animalFileName)) val imageData = image.getData val redVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth) val greenVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth) val blueVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth) (0 until imageData.getHeight).foreach{height => (0 until imageData.getWidth).foreach{width => val RGB = imageData.getPixel(width, height, Array(0, 0, 0)) redVector(width + height*10) = RGB(0) greenVector(width + height*10) = RGB(1) blueVector(width + height*10) = RGB(2) } } val resVector = DenseMatrix(redVector, greenVector, blueVector).reshape(imageData.getHeight*imageData.getWidth*3, 1).toDenseVector Some((resVector - breeze.stats.mean(resVector)) /:/ breeze.stats.stddev(resVector)) } catch { case _: Exception => None } } private def getLabels: Vector[(Int, String)] = { Source .fromFile("/Users/mengpan/Downloads/trainLabels.csv") .getLines() .map{eachRow => val split = eachRow.split(",") (split(0), split(1)) } .filter{eachRow => eachRow._1 != "id" } .map{eachRow => (eachRow._1.toInt, eachRow._2) } .toVector } private def getBalancedBatNonCatLabels(labels: Vector[(Int, String)]): Vector[(Int, Int)] = { labels .map{label => val numLabel = label._2 match { case "cat" => 1 case "automobile" => 0 case _ => 2 } (label._1, numLabel) } .filter{label => label._2 != 2 } } }
接下来是在本项目中我们用来存储数据集合的容器DlCollection:
package org.mengpan.deeplearning.helper import breeze.linalg.{DenseMatrix, DenseVector} import org.mengpan.deeplearning.data.Cat /** * Created by mengpan on 2017/8/15. */ class DlCollection[E <: Cat](data: List[E]) { private val numRows: Int = this.data.size private val numCols: Int = this.data.head.feature.length def split(trainingSize: Double): (DlCollection[E], DlCollection[E]) = { val splited = data.splitAt((data.length * trainingSize).toInt) (new DlCollection[E](splited._1), new DlCollection[E](splited._2)) } def getFeatureAsMatrix: DenseMatrix[Double] = { val feature = DenseMatrix.zeros[Double](this.numRows, this.numCols) var i = 0 this.data.foreach{eachRow => feature(i, ::) := eachRow.feature.t i = i+1 } feature } def getLabelAsVector: DenseVector[Double] = { val label = DenseVector.zeros[Double](this.numRows) var i: Int = 0 this.data.foreach{eachRow => label(i) = eachRow.label i += 1 } label } override def toString = s"DlCollection($numRows, $numCols, $getFeatureAsMatrix, $getLabelAsVector)" }
最后是画图的工具类,是对JFreeChart的一层包装:
package org.mengpan.deeplearning.utils import javax.swing.JFrame import org.jfree.chart.plot.PlotOrientation import org.jfree.chart.{ChartFactory, ChartPanel, JFreeChart} import org.jfree.data.xy.DefaultXYDataset import scala.collection.mutable /** * Created by mengpan on 2017/8/17. */ object PlotUtils { def plotCostHistory(costHistory: mutable.TreeMap[Int, Double]): Unit = { val x = costHistory.keys.toArray.map{_.toDouble} val y = costHistory.values.toArray[Double] val data = Array(x, y) val xyDataset: DefaultXYDataset = new DefaultXYDataset() xyDataset.addSeries("Iteration v.s. Cost", data) val jFreeChart: JFreeChart = ChartFactory.createScatterPlot("Cost History", "Iteration", "Cost", xyDataset, PlotOrientation.VERTICAL, true, false, false ) val panel = new ChartPanel(jFreeChart, true) val frame = new JFrame() frame.add(panel) frame.setBounds(50, 50, 800, 600) frame.setVisible(true) } }
5. Demo
由于我无法找到DeepLearning中吴恩达先生用来分别猫的图像集,我就以图像识别领域著名的数据集CIFAR-10来做测试,本例中我们只选取了10中动物中的两种来进行分类,CIFAR-10的下载地址可在网上搜到,如果不想搜索可直接在Kaggle官网下载:https://www.kaggle.com/c/cifar-10
接下来是本文使用的Demo代码:
package org.mengpan.deeplearning.demo import org.mengpan.deeplearning.data.Cat import org.mengpan.deeplearning.helper.{CatDataHelper, DlCollection} import org.mengpan.deeplearning.model.LogisticRegressionModel import org.mengpan.deeplearning.utils.PlotUtils /** * Created by mengpan on 2017/8/15. */ object ClassOneLogisticRegressionDemo extends App{ //加载猫的图像的数据集 val catData: DlCollection[Cat] = CatDataHelper.getAllCatData //获取training set和test set val (training, test) = catData.split(0.8) //分别获取训练集和测试集的feature和label val trainingFeature = training.getFeatureAsMatrix val trainingLabel = training.getLabelAsVector val testFeature = test.getFeatureAsMatrix val testLabel = test.getLabelAsVector //初始化LR的算法模型 val lrModel: LogisticRegressionModel = new LogisticRegressionModel() .setLearningRate(0.005) .setIterationTime(3000) //用训练集的数据训练算法 val trainedModel: LogisticRegressionModel = lrModel.train(trainingFeature, trainingLabel) //测试算法获得算法优劣指标 val yPredicted = trainedModel.predict(testFeature) val trainYPredicted = trainedModel.predict(trainingFeature) val testAccuracy = trainedModel.accuracy(testLabel, yPredicted) val trainAccuracy = trainedModel.accuracy(trainingLabel, trainYPredicted) println("\n The train accuracy of this model is: " + trainAccuracy) println("\n The test accuracy of this model is: " + testAccuracy) //对算法的训练过程中cost与迭代次数变化关系进行画图 val costHistory = trainedModel.getCostHistory PlotUtils.plotCostHistory(costHistory) }
最终的模型准确度为:
The train accuracy of this model is:0.634125
The test accuracy of this model is: 0.6195
注意因为我们使用的数据集为Kaggle比赛的数据集,图像很模糊,而LogisticRegression自身不是一个特别强大的算法,所以难免准确度会较低。后面的课程中我会用学到的知识进一步提升此分类结果的准确度。
感谢您的阅读。