1. Introduction

2017年8月，前百度首席科学家吴恩达先生在twitter上宣布自己从百度离职后的第一个动作：在Coursera上推出一门从零开始构建神经网络的Deep Learning课程，一时间广为轰动。

截止到今天（2017年8月17日星期四），本人已经注册该门课程并且完成了两周的课程学习和作业。在前两周的课程中，吴恩达先生利用Logistic Regression来深入浅出的说明了神经网络的工作原理，并且用通俗易懂的语言介绍了反向传播的原理即为链式求导。在第二周的编码作业中，学生被要求利用Python Notebook从头实现Logistic Regression模型，并且利用此模型对所给定的图像集进行二元分类，判断某张图片是否是猫，最终训练好的模型的Test Accuracy能达到70%。吴恩达先生还说，在接下来的课程中我们会进一步学习神经网络的优化方法，以进一步提高猫狗分辨的Accuracy。

在编码测验中，吴恩达先生所强调的重点即为向量化运算，并且用实例说明了Python Numpy包的向量乘法比简单的for循环求和的速度快300多倍，这也意味着1分钟与5个小时的差距。然而，众所周知Python在实际的工程开发中更多是扮演者快速实验idea，快速得到结果的作用，一定程度上不适用于模型的正式开发及上线。本文中使用Scala实现吴恩达先生在Deep Learning课程中布置的所有作业，感谢您的阅读，期望共同进步。

在Python中实现深度学习算法以及向量化运算所依赖的包叫做Numpy，即Number Python。Numpy中提供了Vector与Matrix的实现，以及矩阵的各种运算和分解的函数。对应地，在Scala中我们使用Breeze包，其中也提供了DenseVector和DenseMatrix的数据结构，并且在数据量特别稀疏的情况下还有SparseVector和SparseMatrix可供使用，一定程度上比Numpy更加强大。最重要地，作为静态类型语言Scala是类型安全的，意味着我们不仅可以用Scala来实现算法，还可以用其进行数据预处理和数据清洗，即ETL。

本文分为四个部分。第一部分介绍整个项目结构；第二部分详细解释用Scala实现Logistic Regression的代码；第三部分给出其他功能性代码的解释，如数据预处理，画图工具，和一些其他的helper类；第四部分给出本文的demo结果和数据集的下载地址。另外，本项目的所有代码都可以在GitHub中找到，GitHub项目地址为https://github.com/pan5431333/coursera-deeplearning-practice-in-scala，跟随吴恩达先生的课程进度代码会及时保持更新，欢迎follow。

2. 项目结构

本文拟使用的项目结构分为五个subpackage，分别为data，demo，helper，model和utils。

data包中包含一个类Cat，其类型为Scala中的caseclass，特别适合用来表示真实世界中的一个entity。demo中即为每一节课后作业的运行实例；helper中现在包含两个类，CatDataHelper利用Java中的ImageIO从本地文件系统中读取图片，将其转化为RGB矩阵的表示形式，之后再reshape成向量形式。DlCollection为一个集合泛型类，其提供三个深度学习中常用的方法，分别为split，用来切分训练集和测试集；getFeatureAsMatrix返回算法所需要的特征矩阵；getLabelAsVector返回标签向量。Model包中现在仅包含Logistic Regression Model的实现。Utils包中现在有PlotUtils，其提供一个plotCostHistory方法，用来对cost随着迭代次数的变化情况画图。

下面介绍Logistic Regression算法在Scala中的具体实现。

3. Logistic Regression的Scala实战

首先，定义LogisticRegressionModel类：

classLogisticRegressionModel(){
var learningRate:Double= _
var iterationTime:Int = _
var w: DenseVector[Double] =_
var b:Double = _
val costHistory: mutable.TreeMap[Int, Double] =new mutable.TreeMap[Int,Double]()

此类包含五个InstanceVariables，其中前两个为超参数，learningRate表示学习率，iterationTime表示最大迭代次数；w和b即为模型参数，会随着迭代进行寻优；costHistory是一个用来保存迭代过程中cost变化情况的TreeMap，其key为迭代次数，value为cost值。

接下来是模型超参数的两个setter：

def setLearningRate(learningRate: Double): this.type = {
  this.learningRate = learningRate
  this }

def setIterationTime(iterationTime: Int): this.type = {
  this.iterationTime = iterationTime
  this }

注意这里的setter与Java中的setter不一样，我们采用了链式编程的开发模式，即用户在调用时可以写成：val model = new LogisticRegressionModel().setLearningRate(0.0001).setIterationTime(3000)，会使得整个编码过程更加流畅。链式编程也在Spark中被广泛使用，特别是构造数据管道（Pipeline）时会显得很优雅。

接下来是模型训练方法：

def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = {

  var (w, b) = initializeParams(feature.cols)

  (1 to this.iterationTime)
    .foreach{i =>
      val (cost, dw, db) = propagate(feature, label, w, b)

      if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost)
      costHistory.put(i, cost)

      val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1)
      w :-= adjustedLearningRate * dw
      b -= adjustedLearningRate * db
    }

  this.w = w
  this.b = b
  this }

注意在此方法中我们用了两个私有方法，分别为initializeParams()和propagate()，我们会在下面对这两个方法详细解释。另外，我们对learningRate进行了简单的调整，使其随着迭代次数的增加逐渐减小，以尽量减少寻优时跳过最优解的可能性。

接下来是模型参数初始化的方法：

private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = {
  val w = DenseVector.rand[Double](featureSize)
  val b = DenseVector.rand[Double](1).data(0)
  (w, b)
}

这里我们对w和b赋予0到1之间的随机赋值。

接下来是LogisticRegression核心的正向传播与反向传播的实现方法：

  private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = {
    val numExamples = feature.rows
    val labelHat = sigmoid(feature * w + b)

//    println("DEBUG: feature * w + b is " + feature * w + b)
//    println("DEBUG: the feature's number of cols is " + feature.cols)
//    println("DEBUG: the feature's number of rows is " + feature.rows)
//    println("DEBUG: the labelHat is " + labelHat)

    val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples

    val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble
    val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble

//    println("DEBUG: the (dw, db) is " + dw + ", " + db)

    (cost, dw, db)
  }

其中注释掉的代码为开发过程中的DEBUG代码，因为我没有在此项目中引入log包，所以只能以这种方式进行DEBUG。feature.rows和feature.cols相当于Python Numpy中的feature.shape[0]和feature.shape[1]；Sigmoid为breeze.numerics._中提供的函数，可以接受一个DenseVector或者DenseMatrix作为参数；cost、dw和db的计算请详见LogisticRegression的理论知识，如有不清楚的地方可以学习吴恩达先生Deep Learning课程。这里需要注意的一点是，Python Numpy支持broadcasting运算，如1 – np.array([1, 2, 3])会得到np.array([0, -1,-2])，即一个常量与向量或矩阵发生运算时，numpy会自动将该常量与向量或矩阵中的每个元素进行运算。Scala的breeze对此支持有限，所以在计算cost时我们只能用DenseVector.ones[Double](numExamples)– label，而不能直接用 1 – label。

接下来是用训练好的模型预测的方法：

def predict(feature: DenseMatrix[Double]): DenseVector[Double] = {

  val yPredicted = sigmoid(feature * this.w).map{eachY =>
    if (eachY <= 0.05) 0.0 else 1.0
  }

  yPredicted
}

这里我们使用了函数式编程中常用的map，可以看出map使我们的代码变得很简洁。

接下来是计算预测准确度的方法：

def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = {
  val numCorrect = (0 until label.length)
    .map{index =>
      if (label(index) == labelPredicted(index)) 1 else 0
    }
    .count(_ == 1)
  numCorrect.toDouble / label.length.toDouble
}

这里进一步使用了函数式编程的特性，代码非常简洁。

最后，还有一些辅助的getter方法：

def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory def getLearningRate: Double = this.learningRate def getIterationTime: Int = this.iterationTime

以上就是用Scala实现的Logistic Regression，完整代码如下所示，纸上得来终觉浅，绝知此事要躬行，如有疑问烦请复制代码到本地环境试着运行一下，对有疑问的地方进行适当修改观察程序的表现，可获益良多。

package org.mengpan.deeplearning.model

import breeze.linalg.{DenseMatrix, DenseVector, max}
import breeze.numerics.{log, sigmoid}

import scala.collection.mutable

/**   * Created by mengpan on 2017/8/15.   */ class LogisticRegressionModel() {
  var learningRate:Double = _
  var iterationTime: Int = _
  var w: DenseVector[Double] = _
  var b: Double = _
  val costHistory: mutable.TreeMap[Int, Double] = new mutable.TreeMap[Int, Double]()

  def setLearningRate(learningRate: Double): this.type = {
    this.learningRate = learningRate
    this   }

  def setIterationTime(iterationTime: Int): this.type = {
    this.iterationTime = iterationTime
    this   }

  def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = {

    var (w, b) = initializeParams(feature.cols)

    (1 to this.iterationTime)
      .foreach{i =>
        val (cost, dw, db) = propagate(feature, label, w, b)

        if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost)
        costHistory.put(i, cost)

        val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1)
        w :-= adjustedLearningRate * dw
        b -= adjustedLearningRate * db
      }

    this.w = w
    this.b = b
    this   }

  def predict(feature: DenseMatrix[Double]): DenseVector[Double] = {

    val yPredicted = sigmoid(feature * this.w).map{eachY =>
      if (eachY <= 0.05) 0.0 else 1.0
    }

    yPredicted
  }

  def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = {
    val numCorrect = (0 until label.length)
      .map{index =>
        if (label(index) == labelPredicted(index)) 1 else 0
      }
      .count(_ == 1)
    numCorrect.toDouble / label.length.toDouble
  }

  def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory   def getLearningRate: Double = this.learningRate   def getIterationTime: Int = this.iterationTime   private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = {
    val w = DenseVector.rand[Double](featureSize)
    val b = DenseVector.rand[Double](1).data(0)
    (w, b)
  }

  private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = {
    val numExamples = feature.rows
    val labelHat = sigmoid(feature * w + b)

//    println("DEBUG: feature * w + b is " + feature * w + b)
//    println("DEBUG: the feature's number of cols is " + feature.cols)
//    println("DEBUG: the feature's number of rows is " + feature.rows)
//    println("DEBUG: the labelHat is " + labelHat)

    val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples

    val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble
    val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble

//    println("DEBUG: the (dw, db) is " + dw + ", " + db)

    (cost, dw, db)
  }
}

4. 其他功能性代码

首先我们来看一下表示Cat的case class：

package org.mengpan.deeplearning.data

import breeze.linalg.DenseVector

/**   * Created by mengpan on 2017/8/15.   */ case class Cat(feature: DenseVector[Double], label: Double)

然后是从本地读取图片数据的CatDataHelper静态类（Scala中的object）：

package org.mengpan.deeplearning.helper

import java.io.File
import javax.imageio.ImageIO

import breeze.linalg.{DenseMatrix, DenseVector}
import org.mengpan.deeplearning.data.Cat

import scala.io.Source

/**   * Created by mengpan on 2017/8/15.   */ object CatDataHelper {
  def getAllCatData: DlCollection[Cat] = {

    val labels = getLabels     val catNonCatLabels = getBalancedBatNonCatLabels(labels)

    val catList = catNonCatLabels.map{indexedLabel =>

      val fileNumber = indexedLabel._1
      val label = indexedLabel._2
      val animalFileName: String = "/Users/mengpan/Downloads/train/" + fileNumber + ".png"
      val feature = getFeatureForOneAnimal(animalFileName)

      feature match {
        case Some(s) => Cat(s, label)
        case None => Cat(DenseVector.zeros[Double](10), label)
      }
    }
      .filter{cat =>
        cat.feature.length != 10
      }
      .toList

    new DlCollection[Cat](catList)
 }

  private def getFeatureForOneAnimal(animalFileName: String): Option[DenseVector[Double]] = {
    println("Reading file: " + animalFileName)

    try {
      val image = ImageIO.read(new File(animalFileName))
      val imageData = image.getData

      val redVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth)
      val greenVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth)
      val blueVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth)

      (0 until imageData.getHeight).foreach{height =>
        (0 until imageData.getWidth).foreach{width =>
          val RGB = imageData.getPixel(width, height, Array(0, 0, 0))
          redVector(width + height*10) = RGB(0)
          greenVector(width + height*10) = RGB(1)
          blueVector(width + height*10) = RGB(2)
        }
      }

      val resVector = DenseMatrix(redVector, greenVector, blueVector).reshape(imageData.getHeight*imageData.getWidth*3, 1).toDenseVector
      Some((resVector - breeze.stats.mean(resVector)) /:/ breeze.stats.stddev(resVector))
    } catch {
      case _: Exception => None
    }
  }

  private def getLabels: Vector[(Int, String)] = {
    Source
      .fromFile("/Users/mengpan/Downloads/trainLabels.csv")
      .getLines()
      .map{eachRow =>
        val split = eachRow.split(",")
        (split(0), split(1))
      }
      .filter{eachRow =>
        eachRow._1 != "id"
      }
      .map{eachRow =>
        (eachRow._1.toInt, eachRow._2)
      }
      .toVector
  }

  private def getBalancedBatNonCatLabels(labels: Vector[(Int, String)]): Vector[(Int, Int)] = {
    labels
      .map{label =>
      val numLabel = label._2 match {
        case "cat" => 1
        case "automobile" => 0
        case _ => 2
      }
      (label._1, numLabel)
    }
      .filter{label =>
        label._2 != 2
      }
  }

}

接下来是在本项目中我们用来存储数据集合的容器DlCollection：

package org.mengpan.deeplearning.helper

import breeze.linalg.{DenseMatrix, DenseVector}
import org.mengpan.deeplearning.data.Cat

/**   * Created by mengpan on 2017/8/15.   */ class DlCollection[E <: Cat](data: List[E]) {
  private val numRows: Int = this.data.size
  private val numCols: Int = this.data.head.feature.length

  def split(trainingSize: Double): (DlCollection[E], DlCollection[E]) = {
    val splited = data.splitAt((data.length * trainingSize).toInt)
    (new DlCollection[E](splited._1), new DlCollection[E](splited._2))
  }

  def getFeatureAsMatrix: DenseMatrix[Double] = {
    val feature = DenseMatrix.zeros[Double](this.numRows, this.numCols)

    var i = 0
    this.data.foreach{eachRow =>
      feature(i, ::) := eachRow.feature.t
      i = i+1
    }

    feature
  }

  def getLabelAsVector: DenseVector[Double] = {
    val label = DenseVector.zeros[Double](this.numRows)

    var i: Int = 0
    this.data.foreach{eachRow =>
      label(i) = eachRow.label
      i += 1
    }

    label
  }


  override def toString = s"DlCollection($numRows, $numCols, $getFeatureAsMatrix, $getLabelAsVector)"
}

最后是画图的工具类，是对JFreeChart的一层包装：

package org.mengpan.deeplearning.utils

import javax.swing.JFrame

import org.jfree.chart.plot.PlotOrientation
import org.jfree.chart.{ChartFactory, ChartPanel, JFreeChart}
import org.jfree.data.xy.DefaultXYDataset

import scala.collection.mutable

/**   * Created by mengpan on 2017/8/17.   */ object PlotUtils {
  def plotCostHistory(costHistory: mutable.TreeMap[Int, Double]): Unit = {

    val x = costHistory.keys.toArray.map{_.toDouble}
    val y = costHistory.values.toArray[Double]

    val data = Array(x, y)

    val xyDataset: DefaultXYDataset = new DefaultXYDataset()
    xyDataset.addSeries("Iteration v.s. Cost", data)

    val jFreeChart: JFreeChart = ChartFactory.createScatterPlot("Cost History",
      "Iteration", "Cost", xyDataset, PlotOrientation.VERTICAL, true, false, false     )

    val panel = new ChartPanel(jFreeChart, true)

    val frame = new JFrame()

    frame.add(panel)
    frame.setBounds(50, 50, 800, 600)
    frame.setVisible(true)
  }
}

5. Demo

由于我无法找到DeepLearning中吴恩达先生用来分别猫的图像集，我就以图像识别领域著名的数据集CIFAR-10来做测试，本例中我们只选取了10中动物中的两种来进行分类，CIFAR-10的下载地址可在网上搜到，如果不想搜索可直接在Kaggle官网下载：https://www.kaggle.com/c/cifar-10

接下来是本文使用的Demo代码：

package org.mengpan.deeplearning.demo

import org.mengpan.deeplearning.data.Cat
import org.mengpan.deeplearning.helper.{CatDataHelper, DlCollection}
import org.mengpan.deeplearning.model.LogisticRegressionModel
import org.mengpan.deeplearning.utils.PlotUtils

/**   * Created by mengpan on 2017/8/15.   */ object ClassOneLogisticRegressionDemo extends App{
  //加载猫的图像的数据集
  val catData: DlCollection[Cat] = CatDataHelper.getAllCatData   //获取training set和test set
  val (training, test) = catData.split(0.8)


  //分别获取训练集和测试集的feature和label
  val trainingFeature = training.getFeatureAsMatrix
  val trainingLabel = training.getLabelAsVector
  val testFeature = test.getFeatureAsMatrix
  val testLabel = test.getLabelAsVector

  //初始化LR的算法模型
  val lrModel: LogisticRegressionModel = new LogisticRegressionModel()
    .setLearningRate(0.005)
    .setIterationTime(3000)

  //用训练集的数据训练算法
  val trainedModel: LogisticRegressionModel = lrModel.train(trainingFeature, trainingLabel)

  //测试算法获得算法优劣指标
  val yPredicted = trainedModel.predict(testFeature)
  val trainYPredicted = trainedModel.predict(trainingFeature)

  val testAccuracy = trainedModel.accuracy(testLabel, yPredicted)
  val trainAccuracy = trainedModel.accuracy(trainingLabel, trainYPredicted)
  println("\n The train accuracy of this model is: " + trainAccuracy)
  println("\n The test accuracy of this model is: " + testAccuracy)

  //对算法的训练过程中cost与迭代次数变化关系进行画图
  val costHistory = trainedModel.getCostHistory
  PlotUtils.plotCostHistory(costHistory)
}

最终的模型准确度为：

The train accuracy of this model is:0.634125

The test accuracy of this model is: 0.6195

注意因为我们使用的数据集为Kaggle比赛的数据集，图像很模糊，而LogisticRegression自身不是一个特别强大的算法，所以难免准确度会较低。后面的课程中我会用学到的知识进一步提升此分类结果的准确度。

感谢您的阅读。

秒客网

跟着吴恩达学深度学习：用Scala实现神经网络-第一课

1. Introduction

2. 项目结构

3. Logistic Regression的Scala实战

4. 其他功能性代码

5. Demo

相关文章