1-背景：

此前，我们已经介绍过单隐藏层的神经网络模型，本文要介绍的是多隐藏层的神经网络模型。
采用非线性的如RELU激活函数

符号说明：

上标 [l] 表示层号， lth
- 例如: a[L] 是第 Lth 层的激活函数. W[L] 和 b[L] 分别是 Lth 层的参数。
上标 (i) 表示第 ith 个样本。
- 例如: x(i) 表示第 ith 个训练样本。
下标 i 表示 ith 神经元位置。
- 例如: a[l]i 表示第 lth 层，第 ith 个神经元的激活函数。

2- 准备工作：

预先需要的一个库和文件

import numpy as np
import h5py
import matplotlib.pyplot as plt
from testCases_v2 import *
from dnn_utils_v2 import sigmoid, sigmoid_backward, relu, relu_backward

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1)#使得随机函数的调用具有一致性

3- 概要

双隐藏层和L层神经网络模型的参数初始化
做前向传播操作
- 计算正向传播的LINEAR 部分，因为每个神经元节点都是由两部分组成，这在逻辑回归里面有阐述。线性部分即Z=WX+b这部分，输出部分就是A,就是将线性部分的结果输入到激活函数所产生的结果。
- 采用RELU或者sigmoid激活函数计算结果值
- 联合上述两个步骤，进行前向传播操作[LINEAR->ACTIVATION]
- 对输出层之前的L-1层，做L-1次的前向传播 [LINEAR->RELU] ，并将结果输出到第L层[LINEAR->SIGMOID]。所以在前面L-1层我们的激活函数是RELU，在输出层我们的激活函数是sigmoid。
计算损失函数
做后向传播操作（下图红色区域部分）
- 计算神经网络反向传播的LINEAR部分
- 计算激活函数（RELU或者sigmoid）的梯度
- 结合前面两个步骤，产生一个新的后向函数[LINEAR->ACTIVATION]
更新参数

流程图：
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 1
注意：每个正向函数都是和反向函数相关联的。所以，正向传播模块的每一个步骤都会将反向传播需要用到的值存储在cache中。在反向传播模块，我们需要用到cache中的值计算梯度。

4- 初始化

在此，设计了两个初始化化函数，分别用以双层模型和泛化的L层模型。

4-1 双层神经网络

该模型的结构是：LINEAR -> RELU -> LINEAR -> SIGMOID
对于权重矩阵采用随机化方式进行初始化（np.random.randn(shape)*0.01），对于偏移值b矩阵则采用0矩阵即可(np.zeros(shape))。
初始化代码如下：

# GRADED FUNCTION: initialize_parameters

def initialize_parameters(n_x, n_h, n_y):
"""
 Argument:
 n_x -- size of the input layer
 n_h -- size of the hidden layer
 n_y -- size of the output layer

 Returns:
 parameters -- python dictionary containing your parameters:
 W1 -- weight matrix of shape (n_h, n_x)
 b1 -- bias vector of shape (n_h, 1)
 W2 -- weight matrix of shape (n_y, n_h)
 b2 -- bias vector of shape (n_y, 1)
 """

    np.random.seed(1)

### START CODE HERE ### (≈ 4 lines of code)
    W1 = np.random.randn(n_h, n_x)*0.01 
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h)*0.01 
    b2 = np.zeros((n_y, 1))
### END CODE HERE ###

assert(W1.shape == (n_h, n_x))
assert(b1.shape == (n_h, 1))
assert(W2.shape == (n_y, n_h))
assert(b2.shape == (n_y, 1))

    parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}

return parameters

测试代码如下：

parameters = initialize_parameters(2,2,1)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

测试代码运行结果如下：

W1 = [[ 0.01624345 -0.00611756]
 [-0.00528172 -0.01072969]]
b1 = [[ 0.]
 [ 0.]]
W2 = [[ 0.00865408 -0.02301539]]
b2 = [[ 0.]]

4-2 L层神经网络

对于L层的神经网络由于涉及到很多的权重矩阵和偏移矩阵显得更加复杂。要特别注意的是矩阵之间的尺寸匹配。 n[l] 表示 l 层的神经元数量。例如输入 X 的尺寸是 (12288,209) ( m=209 表示样本数) ：

	Shape of W	Shape of b	Activation	Shape of Activation
Layer 1	(n[1],12288)	(n[1],1)	Z[1]=W[1]X+b[1]	(n[1],209)
Layer 2	(n[2],n[1])	(n[2],1)	Z[2]=W[2]A[1]+b[2]	(n[2],209)
⋮	⋮	⋮	⋮	⋮
Layer L-1	(n[L−1],n[L−2])	(n[L−1],1)	Z[L−1]=W[L−1]A[L−2]+b[L−1]	(n[L−1],209)
Layer L	(n[L],n[L−1])	(n[L],1)	Z[L]=W[L]A[L−1]+b[L]	(n[L],209)

对于 WX+b 在python中是由于broadcasting机制存在所以可以正常执行。例如:

W = ⎡ ⎣ ⎢ j m p k n q l o r ⎤ ⎦ ⎥ X = ⎡ ⎣ ⎢ a d g b e h c f i ⎤ ⎦ ⎥ b = ⎡ ⎣ ⎢ s t u ⎤ ⎦ ⎥ (1)

Then WX+b will be:

W X + b = ⎡ ⎣ ⎢ (j a + k d + l g) + s (m a + n d + o g) + t (p a + q d + r g) + u (j b + k e + l h) + s (m b + n e + o h) + t (p b + q e + r h) + u (j c + k f + l i) + s (m c + n f + o i) + t (p c + q f + r i) + u ⎤ ⎦ ⎥ (2)

对于L层模型：

模型结构： [LINEAR -> RELU] × (L-1) -> LINEAR -> SIGMOID。所以 L−1 层是需要用到 ReLU激活函数的。输出层用的是sigmoid函数。
权重矩阵采用仍旧是随机化初始化的方式： np.random.rand(shape) * 0.01
偏移矩阵仍旧是0矩阵进行处初始化： np.zeros(shape).
我们将每层的神经元数量 n[l] 信息进行存储，layer_dims。例如在平面数据分类模型中 layer_dims 的值是[2,4,1]，其中输入层的神经元个数是2，隐藏层的神经元个数是4，输出层的神经元个数是1。对应的 W1尺寸= (4,2), b1尺寸= (4,1), W2尺寸= (1,4) ， b2 尺寸= (1,1)。

代码如下：

# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
"""
 Arguments:
 layer_dims -- python array (list) containing the dimensions of each layer in our network

 Returns:
 parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
 Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
 bl -- bias vector of shape (layer_dims[l], 1)
 """

    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)            # number of layers in the network

for l in range(1, L):
### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
### END CODE HERE ###

assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))


return parameters

测试代码：

parameters = initialize_parameters_deep([5,4,3])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

测试结果如下：

W1 = [[ 0.01788628 0.0043651 0.00096497 -0.01863493 -0.00277388]
 [-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
 [-0.01313865 0.00884622 0.00881318 0.01709573 0.00050034]
 [-0.00404677 -0.0054536 -0.01546477 0.00982367 -0.01101068]]
b1 = [[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]
W2 = [[-0.01185047 -0.0020565 0.01486148 0.00236716]
 [-0.01023785 -0.00712993 0.00625245 -0.00160513]
 [-0.00768836 -0.00230031 0.00745056 0.01976111]]
b2 = [[ 0.]
 [ 0.]
 [ 0.]]

5- 前向传播模型

5-1 线性传播部分

前向传播的过程，先计算如下的线性部分：

Z [l] = W [l] A [l - 1] + b [l] (3)

其中

A[0]=X .
代码如下：

# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
"""
 Implement the linear part of a layer's forward propagation.

 Arguments:
 A -- activations from previous layer (or input data): (size of previous layer, number of examples)
 W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
 b -- bias vector, numpy array of shape (size of the current layer, 1)

 Returns:
 Z -- the input of the activation function, also called pre-activation parameter 
 cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
 """

### START CODE HERE ### (≈ 1 line of code)
    Z = np.dot(W, A) + b
### END CODE HERE ###

assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)

return Z, cache

测试代码：

def linear_forward_test_case():
    np.random.seed(1)
"""
 X = np.array([[-1.02387576, 1.12397796],
 [-1.62328545, 0.64667545],
 [-1.74314104, -0.59664964]])
 W = np.array([[ 0.74505627, 1.97611078, -1.24412333]])
 b = np.array([[1]])
 """
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)

return A, W, b


A, W, b = linear_forward_test_case()

Z, linear_cache = linear_forward(A, W, b)
print("Z = " + str(Z))

测试代码运行结果：

Z = [[ 3.26295337 -1.23429987]]

5-2 激活部分

在激活部分，本文用到两个激活函数：

Sigmoid: σ(Z)=σ(WA+b)=11+e−(WA+b) 。在这个步骤我们需要两个结果，一个是激活函数的结果值，另一个是包含”Z” 的”cache“值，这个我们在后向传播过程需要用到。

A, activation_cache = sigmoid(Z)

ReLU: 其数学表达式： A=RELU(Z)=max(0,Z) 。同样结果值有两部分，其一是激活函数结果值 “A” ，另一个是包含”Z“的 “cache“值。

A, activation_cache = relu(Z)

5-2-1 相邻两层的激活实现

代码实现：

# GRADED FUNCTION: linear_activation_forward

def linear_activation_forward(A_prev, W, b, activation):
"""
 Implement the forward propagation for the LINEAR->ACTIVATION layer

 Arguments:
 A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
 W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
 b -- bias vector, numpy array of shape (size of the current layer, 1)
 activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

 Returns:
 A -- the output of the activation function, also called the post-activation value 
 cache -- a python dictionary containing "linear_cache" and "activation_cache";
 stored for computing the backward pass efficiently
 """

if activation == "sigmoid":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
### START CODE HERE ### (≈ 2 lines of code)
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
### END CODE HERE ###

elif activation == "relu":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
### START CODE HERE ### (≈ 2 lines of code)
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
### END CODE HERE ###

assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

return A, cache

#其中sigmoid和relu定义如下：
def sigmoid(Z):
"""
 Implements the sigmoid activation in numpy

 Arguments:
 Z -- numpy array of any shape

 Returns:
 A -- output of sigmoid(z), same shape as Z
 cache -- returns Z as well, useful during backpropagation
 """

    A = 1/(1+np.exp(-Z))
    cache = Z

return A, cache

def relu(Z):
"""
 Implement the RELU function.

 Arguments:
 Z -- Output of the linear layer, of any shape

 Returns:
 A -- Post-activation parameter, of the same shape as Z
 cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
 """

    A = np.maximum(0,Z)

assert(A.shape == Z.shape)

    cache = Z 
return A, cache

测试代码：

def linear_activation_forward_test_case():
"""
 X = np.array([[-1.02387576, 1.12397796],
 [-1.62328545, 0.64667545],
 [-1.74314104, -0.59664964]])
 W = np.array([[ 0.74505627, 1.97611078, -1.24412333]])
 b = 5
 """
    np.random.seed(2)
    A_prev = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
return A_prev, W, b

A_prev, W, b = linear_activation_forward_test_case()

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "sigmoid")
print("With sigmoid: A = " + str(A))

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "relu")
print("With ReLU: A = " + str(A))

测试代码运行结果：

With sigmoid: A = [[ 0.96890023 0.11013289]]
With ReLU: A = [[ 3.43896131 0. ]]

5-2-2 L层模型：

上面已经阐述了相邻两层之间的激活模型，那么对于L层的神经网络，激活函数为RELU的linear_activation_forward 需要重复L-1次，而最后的输出层采用的参数为SIGMOID的linear_activation_forward 。
L层的前向传播如下：
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 2 : [LINEAR -> RELU] × (L-1) -> LINEAR -> SIGMOID model

代码中我们用 AL 表示 A[L]=σ(Z[L])=σ(W[L]A[L−1]+b[L]) ，即 Y^

Tips:

复用此前的代码
循环 [LINEAR->RELU] (L-1) 次
注意保持 “caches” 中的数据。

代码：

# GRADED FUNCTION: L_model_forward

def L_model_forward(X, parameters):
"""
 Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation

 Arguments:
 X -- data, numpy array of shape (input size, number of examples)
 parameters -- output of initialize_parameters_deep()

 Returns:
 AL -- last post-activation value
 caches -- list of caches containing:
 every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
 the cache of linear_sigmoid_forward() (there is one, indexed L-1)
 """

    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network

# Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
for l in range(1, L):
        A_prev = A 
### START CODE HERE ### (≈ 2 lines of code)
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], activation = "relu")
        caches.append(cache)
### END CODE HERE ###

# Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
### START CODE HERE ### (≈ 2 lines of code)
    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "sigmoid")
    caches.append(cache)
### END CODE HERE ###

assert(AL.shape == (1,X.shape[1]))

return AL, caches

测试代码：

def L_model_forward_test_case():
"""
 X = np.array([[-1.02387576, 1.12397796],
 [-1.62328545, 0.64667545],
 [-1.74314104, -0.59664964]])
 parameters = {'W1': np.array([[ 1.62434536, -0.61175641, -0.52817175],
 [-1.07296862, 0.86540763, -2.3015387 ]]),
 'W2': np.array([[ 1.74481176, -0.7612069 ]]),
 'b1': np.array([[ 0.],
 [ 0.]]),
 'b2': np.array([[ 0.]])}
 """
    np.random.seed(1)
    X = np.random.randn(4,2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}

return X, parameters


X, parameters = L_model_forward_test_case()
AL, caches = L_model_forward(X, parameters)
print("AL = " + str(AL))
print("Length of caches list = " + str(len(caches)))

测试代码运行结果：

AL = [[ 0.17007265 0.2524272 ]]
Length of caches list = 2

至此，我们可以计算得到AL的值，该值包含了所有的预测结果。在caches中也记录了中间值。为此，我们可以用AL值来计算代价。

6- 代价函数

代价函数 J :

- 1 m \sum i = 1 m (y (i) log (a [L] (i)) + (1 - y (i)) log (1 - a [L] (i))) (4)

代码：

# GRADED FUNCTION: compute_cost

def compute_cost(AL, Y):
"""
 Implement the cost function defined by equation (7).

 Arguments:
 AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
 Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

 Returns:
 cost -- cross-entropy cost
 """

    m = Y.shape[1]

# Compute loss from aL and y.
### START CODE HERE ### (≈ 1 lines of code)
    cost = -np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1-Y, np.log(1-AL)), axis=1 ,keepdims=True)/m
### END CODE HERE ###

    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(cost.shape == ())

return cost

测试代码：

def compute_cost_test_case():
    Y = np.asarray([[1, 1, 1]])
    aL = np.array([[.8,.9,0.4]])

return Y, aL


Y, AL = compute_cost_test_case()

print("cost = " + str(compute_cost(AL, Y)))

测试代码运行结果：

cost = 0.414931599615397

7-后向传播模型

后向传播是为了计算各个参数梯度，其模型如下：
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 3 : 前向传播和后向传播： LINEAR->RELU->LINEAR->SIGMOID
紫色模块表示前向传播, 红色模块表示反向传播
和之前的前向传播类似，后向传播模块的建立分以下三个步骤：

后向LINEAR（Linear backward）
ReLU 或者 sigmoid 激活函数的后向LINEAR -> ACTIVATION
[LINEAR -> RELU] × (L-1) -> LINEAR -> SIGMOID backward (whole model)

7-1 后向Linear

对于 l 层，linear part= Z[l]=W[l]A[l−1]+b[l]
假设 dZ[l]=∂L∂Z[l] 已知，我们想要计算 (dW[l],db[l]dA[l−1]) 。
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 4

三个输出 (dW[l],db[l],dA[l]) 可以通过输入 dZ[l] 计算获得。公式如下:

d W [l] = \partial L \partial W [ l ] = 1 m d Z [l] A [l - 1] T (5)

d b [l] = \partial L \partial b [ l ] = 1 m \sum i = 1 m d Z [l] (i) (6)

d A [l - 1] = \partial L \partial A [ l - 1 ] = W [l] T d Z [l] (7)

代码：

# GRADED FUNCTION: linear_backward

def linear_backward(dZ, cache):
"""
 Implement the linear portion of backward propagation for a single layer (layer l)

 Arguments:
 dZ -- Gradient of the cost with respect to the linear output (of current layer l)
 cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

 Returns:
 dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
 dW -- Gradient of the cost with respect to W (current layer l), same shape as W
 db -- Gradient of the cost with respect to b (current layer l), same shape as b
 """
    A_prev, W, b = cache
    m = A_prev.shape[1]

### START CODE HERE ### (≈ 3 lines of code)
    dW = np.dot(dZ,A_prev.T)/m
    db = np.sum(dZ,axis = 1, keepdims=True)/m
    dA_prev = np.dot(W.T, dZ)
### END CODE HERE ###

assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)

return dA_prev, dW, db

测试代码：

def linear_backward_test_case():
"""
 z, linear_cache = (np.array([[-0.8019545 , 3.85763489]]), (np.array([[-1.02387576, 1.12397796],
 [-1.62328545, 0.64667545],
 [-1.74314104, -0.59664964]]), np.array([[ 0.74505627, 1.97611078, -1.24412333]]), np.array([[1]]))
 """
    np.random.seed(1)
    dZ = np.random.randn(1,2)
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    linear_cache = (A, W, b)
return dZ, linear_cache


# Set up some test inputs
dZ, linear_cache = linear_backward_test_case()

dA_prev, dW, db = linear_backward(dZ, linear_cache)
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

测试代码运行结果：

dA_prev = [[ 0.51822968 -0.19517421]
 [-0.40506361 0.15255393]
 [ 2.37496825 -0.89445391]]
dW = [[-0.10076895 1.40685096 1.64992505]]
db = [[ 0.50629448]]

7-2 Linear-Activation backward

对于sigmoid函数，可以定义两个函数：

sigmoid_backward:用以计算 SIGMOID单元：

dZ = sigmoid_backward(dA, activation_cache)#其用到的cache值是Z值

relu_backward: 用以计算RELU的 backward propagation：

dZ = relu_backward(dA, activation_cache)

对于 g(.) 的激活函数：
sigmoid_backward 和relu_backward 的计算如下

d Z [l] = d A [l] * g' (Z [l]) (8)

代码：

# GRADED FUNCTION: linear_activation_backward

def linear_activation_backward(dA, cache, activation):
"""
 Implement the backward propagation for the LINEAR->ACTIVATION layer.

 Arguments:
 dA -- post-activation gradient for current layer l 
 cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
 activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

 Returns:
 dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
 dW -- Gradient of the cost with respect to W (current layer l), same shape as W
 db -- Gradient of the cost with respect to b (current layer l), same shape as b
 """
    linear_cache, activation_cache = cache

if activation == "relu":
### START CODE HERE ### (≈ 2 lines of code)
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
### END CODE HERE ###

elif activation == "sigmoid":
### START CODE HERE ### (≈ 2 lines of code)
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
### END CODE HERE ###

return dA_prev, dW, db

#relu_backward定义如下：
def relu_backward(dA, cache):
"""
 Implement the backward propagation for a single RELU unit.

 Arguments:
 dA -- post-activation gradient, of any shape
 cache -- 'Z' where we store for computing backward propagation efficiently

 Returns:
 dZ -- Gradient of the cost with respect to Z
 """

    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.

# When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0

assert (dZ.shape == Z.shape)

return dZ

#sigmoid_backward定义如下：
def sigmoid_backward(dA, cache):
"""
 Implement the backward propagation for a single SIGMOID unit.

 Arguments:
 dA -- post-activation gradient, of any shape
 cache -- 'Z' where we store for computing backward propagation efficiently

 Returns:
 dZ -- Gradient of the cost with respect to Z
 """

    Z = cache

    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)

assert (dZ.shape == Z.shape)

return dZ

注意上述两个激活函数dZ的求法。

测试代码：

def linear_activation_backward_test_case():
"""
 aL, linear_activation_cache = (np.array([[ 3.1980455 , 7.85763489]]), ((np.array([[-1.02387576, 1.12397796], [-1.62328545, 0.64667545], [-1.74314104, -0.59664964]]), np.array([[ 0.74505627, 1.97611078, -1.24412333]]), 5), np.array([[ 3.1980455 , 7.85763489]])))
 """
    np.random.seed(2)
    dA = np.random.randn(1,2)
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    Z = np.random.randn(1,2)
    linear_cache = (A, W, b)
    activation_cache = Z
    linear_activation_cache = (linear_cache, activation_cache)

return dA, linear_activation_cache


AL, linear_activation_cache = linear_activation_backward_test_case()

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "sigmoid")
print ("sigmoid:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db) + "\n")

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "relu")
print ("relu:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

测试代码运行结果：

sigmoid:
dA_prev = [[ 0.11017994 0.01105339]
 [ 0.09466817 0.00949723]
 [-0.05743092 -0.00576154]]
dW = [[ 0.10266786 0.09778551 -0.01968084]]
db = [[-0.05729622]]

relu:
dA_prev = [[ 0.44090989 -0. ]
 [ 0.37883606 -0. ]
 [-0.2298228 0. ]]
dW = [[ 0.44513824 0.37371418 -0.10478989]]
db = [[-0.20837892]]

7-3 L层模型的后向传播

现在我们开始对整个神经网络做后向传播，定义函数为L_model_forward。在每次的迭代过程中，我们都将 cache值=(X,W,b, z)保留，用以后向模块中梯度的计算。在L_model_forward中，我们是重复了L次上述的步骤。
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 5 : Backward流程

后向传播初始化：
对于后向传播，我们知道前向传播的输出是 A[L]=σ(Z[L]) ，我们需要计算 dAL =∂L∂A[L] ，我们用以下的公式表示：

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
# derivative of cost with respect to AL

之后，我们可以用这个后向激活的梯度 dAL 进行向后传播。再从 dAL 计算 LINEAR->SIGMOID 的后向传播结果。对于 LINEAR->RELU backward 函数，我们可以采用for 循环来处理这L-1次操作。在此期间，我们要存储 dA, dW, db，本文用grads字典来存储：

g r a d s [" d W " + s t r (l)] = d W [l] (9)

例如, 对于 l=3 ，则 dW[l] 以 grads["dW3"]形式存储。
模型如下：
[LINEAR->RELU] × (L-1) -> LINEAR -> SIGMOID
代码实现：

# GRADED FUNCTION: L_model_backward

def L_model_backward(AL, Y, caches):
"""
 Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group

 Arguments:
 AL -- probability vector, output of the forward propagation (L_model_forward())
 Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
 caches -- list of caches containing:
 every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
 the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])

 Returns:
 grads -- A dictionary with the gradients
 grads["dA" + str(l)] = ...
 grads["dW" + str(l)] = ...
 grads["db" + str(l)] = ...
 """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

# Initializing the backpropagation
### START CODE HERE ### (1 line of code)
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
### END CODE HERE ###

# Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
### START CODE HERE ### (approx. 2 lines)
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")
### END CODE HERE ###
print (L)
for l in reversed(range(L - 1)):
print (l)
# lth layer: (RELU -> LINEAR) gradients.
# Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
### START CODE HERE ### (approx. 5 lines)
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, activation = "relu")
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
### END CODE HERE ###

return grads

测试代码：

AL, Y_assess, caches = L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print ("dW1 = "+ str(grads["dW1"]))
print ("db1 = "+ str(grads["db1"]))
print ("dA1 = "+ str(grads["dA1"]))


def L_model_backward_test_case():
"""
 X = np.random.rand(3,2)
 Y = np.array([[1, 1]])
 parameters = {'W1': np.array([[ 1.78862847, 0.43650985, 0.09649747]]), 'b1': np.array([[ 0.]])}

 aL, caches = (np.array([[ 0.60298372, 0.87182628]]), [((np.array([[ 0.20445225, 0.87811744],
 [ 0.02738759, 0.67046751],
 [ 0.4173048 , 0.55868983]]),
 np.array([[ 1.78862847, 0.43650985, 0.09649747]]),
 np.array([[ 0.]])),
 np.array([[ 0.41791293, 1.91720367]]))])
 """
    np.random.seed(3)
    AL = np.random.randn(1, 2)
    Y = np.array([[1, 0]])

    A1 = np.random.randn(4,2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    Z1 = np.random.randn(3,2)
    linear_cache_activation_1 = ((A1, W1, b1), Z1)

    A2 = np.random.randn(3,2)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    Z2 = np.random.randn(1,2)
    linear_cache_activation_2 = ( (A2, W2, b2), Z2)

    caches = (linear_cache_activation_1, linear_cache_activation_2)

return AL, Y, caches

测试代码运行结果如下：

dW1 = [[ 0.41010002 0.07807203 0.13798444 0.10502167]
 [ 0. 0. 0. 0. ]
 [ 0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1 = [[-0.22007063]
 [ 0. ]
 [-0.02835349]]
dA1 = [[ 0. 0.52257901]
 [ 0. -0.3269206 ]
 [ 0. -0.32070404]
 [ 0. -0.74079187]]

7-4 参数更新

采用梯度下降进行参数的更新：

W [l] = W [l] - α d W [l] (10)

b [l] = b [l] - α d b [l] (11)

其中 α 是学习率。
代码实现：

# GRADED FUNCTION: update_parameters

def update_parameters(parameters, grads, learning_rate):
"""
 Update parameters using gradient descent

 Arguments:
 parameters -- python dictionary containing your parameters 
 grads -- python dictionary containing your gradients, output of L_model_backward

 Returns:
 parameters -- python dictionary containing your updated parameters 
 parameters["W" + str(l)] = ... 
 parameters["b" + str(l)] = ...
 """

    L = len(parameters) // 2 # number of layers in the neural network

# Update rule for each parameter. Use a for loop.
### START CODE HERE ### (≈ 3 lines of code)
for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*grads["dW"+str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*grads["db"+str(l+1)]
### END CODE HERE ###

return parameters

测试代码运行：

parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)

print ("W1 = "+ str(parameters["W1"]))
print ("b1 = "+ str(parameters["b1"]))
print ("W2 = "+ str(parameters["W2"]))
print ("b2 = "+ str(parameters["b2"]))


def update_parameters_test_case():
"""
 parameters = {'W1': np.array([[ 1.78862847, 0.43650985, 0.09649747],
 [-1.8634927 , -0.2773882 , -0.35475898],
 [-0.08274148, -0.62700068, -0.04381817],
 [-0.47721803, -1.31386475, 0.88462238]]),
 'W2': np.array([[ 0.88131804, 1.70957306, 0.05003364, -0.40467741],
 [-0.54535995, -1.54647732, 0.98236743, -1.10106763],
 [-1.18504653, -0.2056499 , 1.48614836, 0.23671627]]),
 'W3': np.array([[-1.02378514, -0.7129932 , 0.62524497],
 [-0.16051336, -0.76883635, -0.23003072]]),
 'b1': np.array([[ 0.],
 [ 0.],
 [ 0.],
 [ 0.]]),
 'b2': np.array([[ 0.],
 [ 0.],
 [ 0.]]),
 'b3': np.array([[ 0.],
 [ 0.]])}
 grads = {'dW1': np.array([[ 0.63070583, 0.66482653, 0.18308507],
 [ 0. , 0. , 0. ],
 [ 0. , 0. , 0. ],
 [ 0. , 0. , 0. ]]),
 'dW2': np.array([[ 1.62934255, 0. , 0. , 0. ],
 [ 0. , 0. , 0. , 0. ],
 [ 0. , 0. , 0. , 0. ]]),
 'dW3': np.array([[-1.40260776, 0. , 0. ]]),
 'da1': np.array([[ 0.70760786, 0.65063504],
 [ 0.17268975, 0.15878569],
 [ 0.03817582, 0.03510211]]),
 'da2': np.array([[ 0.39561478, 0.36376198],
 [ 0.7674101 , 0.70562233],
 [ 0.0224596 , 0.02065127],
 [-0.18165561, -0.16702967]]),
 'da3': np.array([[ 0.44888991, 0.41274769],
 [ 0.31261975, 0.28744927],
 [-0.27414557, -0.25207283]]),
 'db1': 0.75937676204411464,
 'db2': 0.86163759922811056,
 'db3': -0.84161956022334572}
 """
    np.random.seed(2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
    np.random.seed(3)
    dW1 = np.random.randn(3,4)
    db1 = np.random.randn(3,1)
    dW2 = np.random.randn(1,3)
    db2 = np.random.randn(1,1)
    grads = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}

return parameters, grads

运行结果如下：

W1 = [[-0.59562069 -0.09991781 -2.14584584 1.82662008]
 [-1.76569676 -0.80627147 0.51115557 -1.18258802]
 [-1.0535704 -0.86128581 0.68284052 2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196 0.0354055 1.32964895]]
b2 = [[-0.84610769]]

秒客网

DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

1-背景：

2- 准备工作：

3- 概要

4- 初始化

4-1 双层神经网络

4-2 L层神经网络

5- 前向传播模型

5-1 线性传播部分

5-2 激活部分

5-2-1 相邻两层的激活实现

5-2-2 L层模型：

6- 代价函数

7-后向传播模型

7-1 后向Linear

7-2 Linear-Activation backward

7-3 L层模型的后向传播

7-4 参数更新

相关文章