DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

时间:2021-01-15 20:14:57

1-背景:

此前,我们已经介绍过单隐藏层的神经网络模型,本文要介绍的是多隐藏层的神经网络模型。
采用非线性的如RELU激活函数

符号说明:

  • 上标 [l] 表示层号, lth
    • 例如: a[L] 是第 Lth 层的激活函数. W[L] b[L] 分别是 Lth 层的参数。
  • 上标 (i) 表示第 ith 个样本。
    • 例如: x(i) 表示第 ith 个训练样本。
  • 下标 i 表示 ith 神经元位置。
    • 例如: a[l]i 表示第 lth 层,第 ith 个神经元的激活函数。

2- 准备工作:

预先需要的一个库和文件

import numpy as np
import h5py
import matplotlib.pyplot as plt
from testCases_v2 import *
from dnn_utils_v2 import sigmoid, sigmoid_backward, relu, relu_backward

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1)#使得随机函数的调用具有一致性

3- 概要

  • 双隐藏层和L层神经网络模型的参数初始化
  • 做前向传播操作
    • 计算正向传播的LINEAR 部分,因为每个神经元节点都是由两部分组成,这在逻辑回归里面有阐述。线性部分即Z=WX+b这部分,输出部分就是A,就是将线性部分的结果输入到激活函数所产生的结果。
    • 采用RELU或者sigmoid激活函数计算结果值
    • 联合上述两个步骤,进行前向传播操作[LINEAR->ACTIVATION]
    • 对输出层之前的L-1层,做L-1次的前向传播 [LINEAR->RELU] ,并将结果输出到第L层[LINEAR->SIGMOID]。所以在前面L-1层我们的激活函数是RELU,在输出层我们的激活函数是sigmoid。
  • 计算损失函数
  • 做后向传播操作(下图红色区域部分)
    • 计算神经网络反向传播的LINEAR部分
    • 计算激活函数(RELU或者sigmoid)的梯度
    • 结合前面两个步骤,产生一个新的后向函数[LINEAR->ACTIVATION]
  • 更新参数

流程图:
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 1

注意:每个正向函数都是和反向函数相关联的。所以,正向传播模块的每一个步骤都会将反向传播需要用到的值存储在cache中。在反向传播模块,我们需要用到cache中的值计算梯度。

4- 初始化

在此,设计了两个初始化化函数,分别用以双层模型和泛化的L层模型。

4-1 双层神经网络

该模型的结构是:LINEAR -> RELU -> LINEAR -> SIGMOID
对于权重矩阵采用随机化方式进行初始化(np.random.randn(shape)*0.01),对于偏移值b矩阵则采用0矩阵即可(np.zeros(shape))。
初始化代码如下:

# GRADED FUNCTION: initialize_parameters

def initialize_parameters(n_x, n_h, n_y):
"""
Argument:
n_x -- size of the input layer
n_h -- size of the hidden layer
n_y -- size of the output layer

Returns:
parameters -- python dictionary containing your parameters:
W1 -- weight matrix of shape (n_h, n_x)
b1 -- bias vector of shape (n_h, 1)
W2 -- weight matrix of shape (n_y, n_h)
b2 -- bias vector of shape (n_y, 1)
"""


np.random.seed(1)

### START CODE HERE ### (≈ 4 lines of code)
W1 = np.random.randn(n_h, n_x)*0.01
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(n_y, n_h)*0.01
b2 = np.zeros((n_y, 1))
### END CODE HERE ###

assert(W1.shape == (n_h, n_x))
assert(b1.shape == (n_h, 1))
assert(W2.shape == (n_y, n_h))
assert(b2.shape == (n_y, 1))

parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}

return parameters

测试代码如下:

parameters = initialize_parameters(2,2,1)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

测试代码运行结果如下:

W1 = [[ 0.01624345 -0.00611756]
[-0.00528172 -0.01072969]]

b1 = [[ 0.]
[ 0.]]

W2 = [[ 0.00865408 -0.02301539]]
b2 = [[ 0.]]

4-2 L层神经网络

对于L层的神经网络由于涉及到很多的权重矩阵和偏移矩阵显得更加复杂。要特别注意的是矩阵之间的尺寸匹配。 n[l] 表示 l 层的神经元数量。例如输入 X 的尺寸是 (12288,209) ( m=209 表示样本数) :

Shape of W Shape of b Activation Shape of Activation
Layer 1 (n[1],12288) (n[1],1) Z[1]=W[1]X+b[1] (n[1],209)
Layer 2 (n[2],n[1]) (n[2],1) Z[2]=W[2]A[1]+b[2] (n[2],209)
Layer L-1 (n[L1],n[L2]) (n[L1],1) Z[L1]=W[L1]A[L2]+b[L1] (n[L1],209)
Layer L (n[L],n[L1]) (n[L],1) Z[L]=W[L]A[L1]+b[L] (n[L],209)

对于 WX+b 在python中是由于broadcasting机制存在所以可以正常执行。例如:

W=jmpknqlorX=adgbehcfib=stu(1)

Then WX+b will be:

WX+b=(ja+kd+lg)+s(ma+nd+og)+t(pa+qd+rg)+u(jb+ke+lh)+s(mb+ne+oh)+t(pb+qe+rh)+u(jc+kf+li)+s(mc+nf+oi)+t(pc+qf+ri)+u(2)

对于L层模型:

  • 模型结构: [LINEAR -> RELU] × (L-1) -> LINEAR -> SIGMOID。所以 L1 层是需要用到 ReLU激活函数的。输出层用的是sigmoid函数。
  • 权重矩阵采用仍旧是随机化初始化的方式: np.random.rand(shape) * 0.01
  • 偏移矩阵仍旧是0矩阵进行处初始化: np.zeros(shape).
  • 我们将每层的神经元数量 n[l] 信息进行存储,layer_dims。例如在平面数据分类模型中 layer_dims 的值是[2,4,1],其中输入层的神经元个数是2,隐藏层的神经元个数是4,输出层的神经元个数是1。对应的 W1尺寸= (4,2), b1尺寸= (4,1), W2尺寸= (1,4) , b2 尺寸= (1,1)。

代码如下:

# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
"""
Arguments:
layer_dims -- python array (list) containing the dimensions of each layer in our network

Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
bl -- bias vector of shape (layer_dims[l], 1)
"""


np.random.seed(3)
parameters = {}
L = len(layer_dims) # number of layers in the network

for l in range(1, L):
### START CODE HERE ### (≈ 2 lines of code)
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
### END CODE HERE ###

assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))


return parameters

测试代码:

parameters = initialize_parameters_deep([5,4,3])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

测试结果如下:

W1 = [[ 0.01788628 0.0043651 0.00096497 -0.01863493 -0.00277388]
[-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
[-0.01313865 0.00884622 0.00881318 0.01709573 0.00050034]
[-0.00404677 -0.0054536 -0.01546477 0.00982367 -0.01101068]]

b1 = [[ 0.]
[ 0.]
[ 0.]
[ 0.]]

W2 = [[-0.01185047 -0.0020565 0.01486148 0.00236716]
[-0.01023785 -0.00712993 0.00625245 -0.00160513]
[-0.00768836 -0.00230031 0.00745056 0.01976111]]

b2 = [[ 0.]
[ 0.]
[ 0.]]

5- 前向传播模型

5-1 线性传播部分

前向传播的过程,先计算如下的线性部分:

Z[l]=W[l]A[l1]+b[l](3)

其中 A[0]=X .
代码如下:

# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
"""
Implement the linear part of a layer's forward propagation.

Arguments:
A -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)

Returns:
Z -- the input of the activation function, also called pre-activation parameter
cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
"""


### START CODE HERE ### (≈ 1 line of code)
Z = np.dot(W, A) + b
### END CODE HERE ###

assert(Z.shape == (W.shape[0], A.shape[1]))
cache = (A, W, b)

return Z, cache

测试代码:

def linear_forward_test_case():
np.random.seed(1)
"""
X = np.array([[-1.02387576, 1.12397796],
[-1.62328545, 0.64667545],
[-1.74314104, -0.59664964]])
W = np.array([[ 0.74505627, 1.97611078, -1.24412333]])
b = np.array([[1]])
"""

A = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)

return A, W, b


A, W, b = linear_forward_test_case()

Z, linear_cache = linear_forward(A, W, b)
print("Z = " + str(Z))

测试代码运行结果:

Z = [[ 3.26295337 -1.23429987]]

5-2 激活部分

在激活部分,本文用到两个激活函数:

  • Sigmoid: σ(Z)=σ(WA+b)=11+e(WA+b) 。在这个步骤我们需要两个结果,一个是激活函数的结果值,另一个是包含”Z” 的”cache“值 ,这个我们在后向传播过程需要用到。
A, activation_cache = sigmoid(Z)
  • ReLU: 其数学表达式: A=RELU(Z)=max(0,Z) 。同样结果值有两部分,其一是激活函数结果值 “A” ,另一个是包含”Z“的 “cache“值。
A, activation_cache = relu(Z)

5-2-1 相邻两层的激活实现

代码实现:

# GRADED FUNCTION: linear_activation_forward

def linear_activation_forward(A_prev, W, b, activation):
"""
Implement the forward propagation for the LINEAR->ACTIVATION layer

Arguments:
A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

Returns:
A -- the output of the activation function, also called the post-activation value
cache -- a python dictionary containing "linear_cache" and "activation_cache";
stored for computing the backward pass efficiently
"""


if activation == "sigmoid":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
### START CODE HERE ### (≈ 2 lines of code)
Z, linear_cache = linear_forward(A_prev, W, b)
A, activation_cache = sigmoid(Z)
### END CODE HERE ###

elif activation == "relu":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
### START CODE HERE ### (≈ 2 lines of code)
Z, linear_cache = linear_forward(A_prev, W, b)
A, activation_cache = relu(Z)
### END CODE HERE ###

assert (A.shape == (W.shape[0], A_prev.shape[1]))
cache = (linear_cache, activation_cache)

return A, cache

#其中sigmoid和relu定义如下:
def sigmoid(Z):
"""
Implements the sigmoid activation in numpy

Arguments:
Z -- numpy array of any shape

Returns:
A -- output of sigmoid(z), same shape as Z
cache -- returns Z as well, useful during backpropagation
"""


A = 1/(1+np.exp(-Z))
cache = Z

return A, cache

def relu(Z):
"""
Implement the RELU function.

Arguments:
Z -- Output of the linear layer, of any shape

Returns:
A -- Post-activation parameter, of the same shape as Z
cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
"""


A = np.maximum(0,Z)

assert(A.shape == Z.shape)

cache = Z
return A, cache

测试代码:

def linear_activation_forward_test_case():
"""
X = np.array([[-1.02387576, 1.12397796],
[-1.62328545, 0.64667545],
[-1.74314104, -0.59664964]])
W = np.array([[ 0.74505627, 1.97611078, -1.24412333]])
b = 5
"""

np.random.seed(2)
A_prev = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
return A_prev, W, b

A_prev, W, b = linear_activation_forward_test_case()

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "sigmoid")
print("With sigmoid: A = " + str(A))

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "relu")
print("With ReLU: A = " + str(A))

测试代码运行结果:

With sigmoid: A = [[ 0.96890023 0.11013289]]
With ReLU: A = [[ 3.43896131 0. ]]

5-2-2 L层模型:

上面已经阐述了相邻两层之间的激活模型,那么对于L层的神经网络,激活函数为RELU的linear_activation_forward 需要重复L-1次,而最后的输出层采用的参数为SIGMOID的linear_activation_forward 。
L层的前向传播如下:
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 2 : [LINEAR -> RELU] × (L-1) -> LINEAR -> SIGMOID model

代码中我们用 AL 表示 A[L]=σ(Z[L])=σ(W[L]A[L1]+b[L]) ,即 Y^

Tips:

  • 复用此前的代码
  • 循环 [LINEAR->RELU] (L-1) 次
  • 注意保持 “caches” 中的数据。

代码:

# GRADED FUNCTION: L_model_forward

def L_model_forward(X, parameters):
"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation

Arguments:
X -- data, numpy array of shape (input size, number of examples)
parameters -- output of initialize_parameters_deep()

Returns:
AL -- last post-activation value
caches -- list of caches containing:
every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
the cache of linear_sigmoid_forward() (there is one, indexed L-1)
"""


caches = []
A = X
L = len(parameters) // 2 # number of layers in the neural network

# Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
for l in range(1, L):
A_prev = A
### START CODE HERE ### (≈ 2 lines of code)
A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], activation = "relu")
caches.append(cache)
### END CODE HERE ###

# Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
### START CODE HERE ### (≈ 2 lines of code)
AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "sigmoid")
caches.append(cache)
### END CODE HERE ###

assert(AL.shape == (1,X.shape[1]))

return AL, caches

测试代码:

def L_model_forward_test_case():
"""
X = np.array([[-1.02387576, 1.12397796],
[-1.62328545, 0.64667545],
[-1.74314104, -0.59664964]])
parameters = {'W1': np.array([[ 1.62434536, -0.61175641, -0.52817175],
[-1.07296862, 0.86540763, -2.3015387 ]]),
'W2': np.array([[ 1.74481176, -0.7612069 ]]),
'b1': np.array([[ 0.],
[ 0.]]),
'b2': np.array([[ 0.]])}
"""

np.random.seed(1)
X = np.random.randn(4,2)
W1 = np.random.randn(3,4)
b1 = np.random.randn(3,1)
W2 = np.random.randn(1,3)
b2 = np.random.randn(1,1)
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}

return X, parameters


X, parameters = L_model_forward_test_case()
AL, caches = L_model_forward(X, parameters)
print("AL = " + str(AL))
print("Length of caches list = " + str(len(caches)))

测试代码运行结果:

AL = [[ 0.17007265 0.2524272 ]]
Length of caches list = 2

至此,我们可以计算得到AL的值,该值包含了所有的预测结果。在caches中也记录了中间值。为此,我们可以用AL值来计算代价。

6- 代价函数

代价函数 J :

1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))(4)

代码:

# GRADED FUNCTION: compute_cost

def compute_cost(AL, Y):
"""
Implement the cost function defined by equation (7).

Arguments:
AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

Returns:
cost -- cross-entropy cost
"""


m = Y.shape[1]

# Compute loss from aL and y.
### START CODE HERE ### (≈ 1 lines of code)
cost = -np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1-Y, np.log(1-AL)), axis=1 ,keepdims=True)/m
### END CODE HERE ###

cost = np.squeeze(cost) # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(cost.shape == ())

return cost

测试代码:

def compute_cost_test_case():
Y = np.asarray([[1, 1, 1]])
aL = np.array([[.8,.9,0.4]])

return Y, aL


Y, AL = compute_cost_test_case()

print("cost = " + str(compute_cost(AL, Y)))

测试代码运行结果:

cost = 0.414931599615397

7-后向传播模型

后向传播是为了计算各个参数梯度,其模型如下:
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 3 : 前向传播和后向传播: LINEAR->RELU->LINEAR->SIGMOID
紫色模块表示前向传播, 红色模块表示反向传播

和之前的前向传播类似,后向传播模块的建立分以下三个步骤:

  • 后向LINEAR(Linear backward)
  • ReLU 或者 sigmoid 激活函数的后向LINEAR -> ACTIVATION
  • [LINEAR -> RELU] × (L-1) -> LINEAR -> SIGMOID backward (whole model)

7-1 后向Linear

对于 l 层,linear part= Z[l]=W[l]A[l1]+b[l]
假设 dZ[l]=LZ[l] 已知,我们想要计算 (dW[l],db[l]dA[l1])
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 4

三个输出 (dW[l],db[l],dA[l]) 可以通过输入 dZ[l] 计算获得。公式如下:

dW[l]=LW[l]=1mdZ[l]A[l1]T(5)

db[l]=Lb[l]=1mi=1mdZ[l](i)(6)

dA[l1]=LA[l1]=W[l]TdZ[l](7)

代码:

# GRADED FUNCTION: linear_backward

def linear_backward(dZ, cache):
"""
Implement the linear portion of backward propagation for a single layer (layer l)

Arguments:
dZ -- Gradient of the cost with respect to the linear output (of current layer l)
cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""

A_prev, W, b = cache
m = A_prev.shape[1]

### START CODE HERE ### (≈ 3 lines of code)
dW = np.dot(dZ,A_prev.T)/m
db = np.sum(dZ,axis = 1, keepdims=True)/m
dA_prev = np.dot(W.T, dZ)
### END CODE HERE ###

assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)

return dA_prev, dW, db

测试代码:

def linear_backward_test_case():
"""
z, linear_cache = (np.array([[-0.8019545 , 3.85763489]]), (np.array([[-1.02387576, 1.12397796],
[-1.62328545, 0.64667545],
[-1.74314104, -0.59664964]]), np.array([[ 0.74505627, 1.97611078, -1.24412333]]), np.array([[1]]))
"""

np.random.seed(1)
dZ = np.random.randn(1,2)
A = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
linear_cache = (A, W, b)
return dZ, linear_cache


# Set up some test inputs
dZ, linear_cache = linear_backward_test_case()

dA_prev, dW, db = linear_backward(dZ, linear_cache)
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

测试代码运行结果:

dA_prev = [[ 0.51822968 -0.19517421]
[-0.40506361 0.15255393]
[ 2.37496825 -0.89445391]]

dW = [[-0.10076895 1.40685096 1.64992505]]
db = [[ 0.50629448]]

7-2 Linear-Activation backward

对于sigmoid函数,可以定义两个函数:

  • sigmoid_backward:用以计算 SIGMOID单元:
dZ = sigmoid_backward(dA, activation_cache)#其用到的cache值是Z值
  • relu_backward: 用以计算RELU的 backward propagation:
dZ = relu_backward(dA, activation_cache)

对于 g(.) 的激活函数:
sigmoid_backwardrelu_backward 的计算如下

dZ[l]=dA[l]g(Z[l])(8)
.

代码:

# GRADED FUNCTION: linear_activation_backward

def linear_activation_backward(dA, cache, activation):
"""
Implement the backward propagation for the LINEAR->ACTIVATION layer.

Arguments:
dA -- post-activation gradient for current layer l
cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""

linear_cache, activation_cache = cache

if activation == "relu":
### START CODE HERE ### (≈ 2 lines of code)
dZ = relu_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
### END CODE HERE ###

elif activation == "sigmoid":
### START CODE HERE ### (≈ 2 lines of code)
dZ = sigmoid_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
### END CODE HERE ###

return dA_prev, dW, db

#relu_backward定义如下:
def relu_backward(dA, cache):
"""
Implement the backward propagation for a single RELU unit.

Arguments:
dA -- post-activation gradient, of any shape
cache -- 'Z' where we store for computing backward propagation efficiently

Returns:
dZ -- Gradient of the cost with respect to Z
"""


Z = cache
dZ = np.array(dA, copy=True) # just converting dz to a correct object.

# When z <= 0, you should set dz to 0 as well.
dZ[Z <= 0] = 0

assert (dZ.shape == Z.shape)

return dZ

#sigmoid_backward定义如下:
def sigmoid_backward(dA, cache):
"""
Implement the backward propagation for a single SIGMOID unit.

Arguments:
dA -- post-activation gradient, of any shape
cache -- 'Z' where we store for computing backward propagation efficiently

Returns:
dZ -- Gradient of the cost with respect to Z
"""


Z = cache

s = 1/(1+np.exp(-Z))
dZ = dA * s * (1-s)

assert (dZ.shape == Z.shape)

return dZ

注意上述两个激活函数dZ的求法。

测试代码:

def linear_activation_backward_test_case():
"""
aL, linear_activation_cache = (np.array([[ 3.1980455 , 7.85763489]]), ((np.array([[-1.02387576, 1.12397796], [-1.62328545, 0.64667545], [-1.74314104, -0.59664964]]), np.array([[ 0.74505627, 1.97611078, -1.24412333]]), 5), np.array([[ 3.1980455 , 7.85763489]])))
"""

np.random.seed(2)
dA = np.random.randn(1,2)
A = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
Z = np.random.randn(1,2)
linear_cache = (A, W, b)
activation_cache = Z
linear_activation_cache = (linear_cache, activation_cache)

return dA, linear_activation_cache


AL, linear_activation_cache = linear_activation_backward_test_case()

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "sigmoid")
print ("sigmoid:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db) + "\n")

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "relu")
print ("relu:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

测试代码运行结果:

sigmoid:
dA_prev = [[ 0.11017994 0.01105339]
[ 0.09466817 0.00949723]
[-0.05743092 -0.00576154]]

dW = [[ 0.10266786 0.09778551 -0.01968084]]
db = [[-0.05729622]]

relu:
dA_prev = [[ 0.44090989 -0. ]
[ 0.37883606 -0. ]
[-0.2298228 0. ]]

dW = [[ 0.44513824 0.37371418 -0.10478989]]
db = [[-0.20837892]]

7-3 L层模型的后向传播

现在我们开始对整个神经网络做后向传播,定义函数为L_model_forward。在每次的迭代过程中,我们都将 cache值=(X,W,b, z)保留,用以后向模块中梯度的计算。在L_model_forward中,我们是重复了L次上述的步骤。
DeepLearing学习笔记-Building your Deep Neural Network: Step by Step(第四周作业)

Figure 5 : Backward流程

后向传播初始化
对于后向传播,我们知道前向传播的输出是 A[L]=σ(Z[L]) ,我们需要计算 dAL =LA[L] ,我们用以下的公式表示:

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
# derivative of cost with respect to AL

之后,我们可以用这个后向激活的梯度 dAL 进行向后传播。再从 dAL 计算 LINEAR->SIGMOID 的后向传播结果。对于 LINEAR->RELU backward 函数,我们可以采用for 循环来处理这L-1次操作。在此期间,我们要存储 dA, dW, db,本文用grads字典来存储:

grads["dW"+str(l)]=dW[l](9)

例如, 对于 l=3 ,则 dW[l] grads["dW3"]形式存储。
模型如下:
[LINEAR->RELU] × (L-1) -> LINEAR -> SIGMOID
代码实现:

# GRADED FUNCTION: L_model_backward

def L_model_backward(AL, Y, caches):
"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group

Arguments:
AL -- probability vector, output of the forward propagation (L_model_forward())
Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
caches -- list of caches containing:
every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])

Returns:
grads -- A dictionary with the gradients
grads["dA" + str(l)] = ...
grads["dW" + str(l)] = ...
grads["db" + str(l)] = ...
"""

grads = {}
L = len(caches) # the number of layers
m = AL.shape[1]
Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

# Initializing the backpropagation
### START CODE HERE ### (1 line of code)
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
### END CODE HERE ###

# Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
### START CODE HERE ### (approx. 2 lines)
current_cache = caches[L-1]
grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")
### END CODE HERE ###
print (L)
for l in reversed(range(L - 1)):
print (l)
# lth layer: (RELU -> LINEAR) gradients.
# Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
### START CODE HERE ### (approx. 5 lines)
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, activation = "relu")
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
### END CODE HERE ###

return grads

测试代码:

AL, Y_assess, caches = L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print ("dW1 = "+ str(grads["dW1"]))
print ("db1 = "+ str(grads["db1"]))
print ("dA1 = "+ str(grads["dA1"]))


def L_model_backward_test_case():
"""
X = np.random.rand(3,2)
Y = np.array([[1, 1]])
parameters = {'W1': np.array([[ 1.78862847, 0.43650985, 0.09649747]]), 'b1': np.array([[ 0.]])}

aL, caches = (np.array([[ 0.60298372, 0.87182628]]), [((np.array([[ 0.20445225, 0.87811744],
[ 0.02738759, 0.67046751],
[ 0.4173048 , 0.55868983]]),
np.array([[ 1.78862847, 0.43650985, 0.09649747]]),
np.array([[ 0.]])),
np.array([[ 0.41791293, 1.91720367]]))])
"""

np.random.seed(3)
AL = np.random.randn(1, 2)
Y = np.array([[1, 0]])

A1 = np.random.randn(4,2)
W1 = np.random.randn(3,4)
b1 = np.random.randn(3,1)
Z1 = np.random.randn(3,2)
linear_cache_activation_1 = ((A1, W1, b1), Z1)

A2 = np.random.randn(3,2)
W2 = np.random.randn(1,3)
b2 = np.random.randn(1,1)
Z2 = np.random.randn(1,2)
linear_cache_activation_2 = ( (A2, W2, b2), Z2)

caches = (linear_cache_activation_1, linear_cache_activation_2)

return AL, Y, caches

测试代码运行结果如下:

dW1 = [[ 0.41010002 0.07807203 0.13798444 0.10502167]
[ 0. 0. 0. 0. ]
[ 0.05283652 0.01005865 0.01777766 0.0135308 ]]

db1 = [[-0.22007063]
[ 0. ]
[-0.02835349]]

dA1 = [[ 0. 0.52257901]
[ 0. -0.3269206 ]
[ 0. -0.32070404]
[ 0. -0.74079187]]

7-4 参数更新

采用梯度下降进行参数的更新:

W[l]=W[l]α dW[l](10)

b[l]=b[l]α db[l](11)

其中 α 是学习率。
代码实现:

# GRADED FUNCTION: update_parameters

def update_parameters(parameters, grads, learning_rate):
"""
Update parameters using gradient descent

Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients, output of L_model_backward

Returns:
parameters -- python dictionary containing your updated parameters
parameters["W" + str(l)] = ...
parameters["b" + str(l)] = ...
"""


L = len(parameters) // 2 # number of layers in the neural network

# Update rule for each parameter. Use a for loop.
### START CODE HERE ### (≈ 3 lines of code)
for l in range(L):
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*grads["dW"+str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*grads["db"+str(l+1)]
### END CODE HERE ###

return parameters

测试代码运行:

parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)

print ("W1 = "+ str(parameters["W1"]))
print ("b1 = "+ str(parameters["b1"]))
print ("W2 = "+ str(parameters["W2"]))
print ("b2 = "+ str(parameters["b2"]))


def update_parameters_test_case():
"""
parameters = {'W1': np.array([[ 1.78862847, 0.43650985, 0.09649747],
[-1.8634927 , -0.2773882 , -0.35475898],
[-0.08274148, -0.62700068, -0.04381817],
[-0.47721803, -1.31386475, 0.88462238]]),
'W2': np.array([[ 0.88131804, 1.70957306, 0.05003364, -0.40467741],
[-0.54535995, -1.54647732, 0.98236743, -1.10106763],
[-1.18504653, -0.2056499 , 1.48614836, 0.23671627]]),
'W3': np.array([[-1.02378514, -0.7129932 , 0.62524497],
[-0.16051336, -0.76883635, -0.23003072]]),
'b1': np.array([[ 0.],
[ 0.],
[ 0.],
[ 0.]]),
'b2': np.array([[ 0.],
[ 0.],
[ 0.]]),
'b3': np.array([[ 0.],
[ 0.]])}
grads = {'dW1': np.array([[ 0.63070583, 0.66482653, 0.18308507],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ]]),
'dW2': np.array([[ 1.62934255, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]]),
'dW3': np.array([[-1.40260776, 0. , 0. ]]),
'da1': np.array([[ 0.70760786, 0.65063504],
[ 0.17268975, 0.15878569],
[ 0.03817582, 0.03510211]]),
'da2': np.array([[ 0.39561478, 0.36376198],
[ 0.7674101 , 0.70562233],
[ 0.0224596 , 0.02065127],
[-0.18165561, -0.16702967]]),
'da3': np.array([[ 0.44888991, 0.41274769],
[ 0.31261975, 0.28744927],
[-0.27414557, -0.25207283]]),
'db1': 0.75937676204411464,
'db2': 0.86163759922811056,
'db3': -0.84161956022334572}
"""

np.random.seed(2)
W1 = np.random.randn(3,4)
b1 = np.random.randn(3,1)
W2 = np.random.randn(1,3)
b2 = np.random.randn(1,1)
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
np.random.seed(3)
dW1 = np.random.randn(3,4)
db1 = np.random.randn(3,1)
dW2 = np.random.randn(1,3)
db2 = np.random.randn(1,1)
grads = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}

return parameters, grads

运行结果如下:

W1 = [[-0.59562069 -0.09991781 -2.14584584 1.82662008]
[-1.76569676 -0.80627147 0.51115557 -1.18258802]
[-1.0535704 -0.86128581 0.68284052 2.20374577]]

b1 = [[-0.04659241]
[-1.28888275]
[ 0.53405496]]

W2 = [[-0.55569196 0.0354055 1.32964895]]
b2 = [[-0.84610769]]