随机梯度下降与卷积神经网络-码农场

练习：卷积神经网络
概述
第0步：初始化参数、加载数据
第1步：实现CNN损失函数
前向传播
计算损失函数
反向传播
梯度计算
第2步：检查梯度
第3步：学习参数
第4步：测试
Reference

练习：卷积神经网络

概述

在这次练习中，将实现用于手写数字识别的卷积神经网络。网络结构为一个卷积池化层，后面跟一个全连接层，最后是一个softmax层。池化层使用平均池化。训练时使用反向传播、随机梯度下降和动量更新。

第0步：初始化参数、加载数据

%% STEP 0: Initialize Parameters and Load Data
%  Here we initialize some parameters used for the exercise.

% Configuration
imageDim = 28;
numClasses = 10;  % Number of classes (MNIST images fall into 10 classes)
filterDim = 9;    % Filter size for conv layer
numFilters = 20;   % Number of filters for conv layer
poolDim = 2;      % Pooling dimension, (should divide imageDim-filterDim+1)

% Load MNIST Train
addpath ../common/;
images = loadMNISTImages('../common/train-images-idx3-ubyte');
images = reshape(images,imageDim,imageDim,[]);
labels = loadMNISTLabels('../common/train-labels-idx1-ubyte');
labels(labels==0) = 10; % Remap 0 to 10

% Initialize Parameters
theta = cnnInitParams(imageDim,filterDim,numFilters,poolDim,numClasses);

从参数可见，一共20个9×9的过滤器，池化区域大小2×2。加载的数据与前文相同。

第1步：实现CNN损失函数

网络一共两层，带平均池化的卷积层和全连接的softmax层，损失函数为10个分类上预测分布与实际分布的交叉熵。

前向传播

对每个图片执行每个过滤器（最好将激活值存下来，待会儿反向传播要用），然后平均响应以完成池化，这部分与前文类似。多说无益，直接看代码：

function [cost, grad, preds] = cnnCost(theta,images,labels,numClasses, filterDim,numFilters,poolDim,pred)
% Calcualte cost and gradient for a single layer convolutional
% neural network followed by a softmax layer with cross entropy
% objective.
%
% Parameters:
%  theta      -  unrolled parameter vector
%  images     -  stores images in imageDim x imageDim x numImges
%                array
%  numClasses -  number of classes to predict
%  filterDim  -  dimension of convolutional filter
%  numFilters -  number of convolutional filters
%  poolDim    -  dimension of pooling area
%  pred       -  boolean only forward propagate and return
%                predictions
%
%
% Returns:
%  cost       -  cross entropy cost
%  grad       -  gradient with respect to theta (if pred==False)
%  preds      -  list of predictions for each example (if pred==True)


if ~exist('pred','var')
    pred = false;
end;


imageDim = size(images,1); % height/width of image
numImages = size(images,3); % number of images
lambda = 3e-3; % weight decay parameter    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reshape parameters and setup gradient matrices

% Wc is filterDim x filterDim x numFilters parameter matrix
% bc is the corresponding bias

% Wd is numClasses x hiddenSize parameter matrix where hiddenSize
% is the number of output units from the convolutional layer
% bd is corresponding bias
[Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,numFilters,...
                        poolDim,numClasses);

% Same sizes as Wc,Wd,bc,bd. Used to hold gradient w.r.t above params.
Wc_grad = zeros(size(Wc));
Wd_grad = zeros(size(Wd));
bc_grad = zeros(size(bc));
bd_grad = zeros(size(bd));

%%======================================================================
%% STEP 1a: Forward Propagation
%  In this step you will forward propagate the input through the
%  convolutional and subsampling (mean pooling) layers.  You will then use
%  the responses from the convolution and pooling layer as the input to a
%  standard softmax layer.

%% Convolutional Layer
%  For each image and each filter, convolve the image with the filter, add
%  the bias and apply the sigmoid nonlinearity.  Then subsample the
%  convolved activations with mean pooling.  Store the results of the
%  convolution in activations and the results of the pooling in
%  activationsPooled.  You will need to save the convolved activations for
%  backpropagation.
convDim = imageDim-filterDim+1; % dimension of convolved output
outputDim = (convDim)/poolDim; % dimension of subsampled output

% convDim x convDim x numFilters x numImages tensor for storing activations
activations = zeros(convDim,convDim,numFilters,numImages);

% outputDim x outputDim x numFilters x numImages tensor for storing
% subsampled activations
activationsPooled = zeros(outputDim,outputDim,numFilters,numImages);

%%% YOUR CODE HERE %%%
convolvedFeatures = cnnConvolve(filterDim, numFilters, images, Wc, bc);
activationsPooled = cnnPool(poolDim, convolvedFeatures);

% Reshape activations into 2-d matrix, hiddenSize x numImages,
% for Softmax layer
activationsPooled = reshape(activationsPooled,[],numImages);

%% Softmax Layer
%  Forward propagate the pooled activations calculated above into a
%  standard softmax layer. For your convenience we have reshaped
%  activationPooled into a hiddenSize x numImages matrix.  Store the
%  results in probs.

% numClasses x numImages for storing probability that each image belongs to
% each class.
probs = zeros(numClasses,numImages);

%%% YOUR CODE HERE %%%
%Wd=(numClasses,hiddenSize)
M = Wd*activationsPooled+repmat(bd,[1,numImages]);
M = bsxfun(@minus,M,max(M,[],1));
M = exp(M);
probs = bsxfun(@rdivide, M, sum(M));

里面调用了一个cnnParamsToStack，作用是将一个unrolled的参数向量还原成weight矩阵和bias向量。

function [Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,...
                                 numFilters,poolDim,numClasses)
% Converts unrolled parameters for a single layer convolutional neural
% network followed by a softmax layer into structured weight
% tensors/matrices and corresponding biases
%
% Parameters:
%  theta      -  unrolled parameter vectore
%  imageDim   -  height/width of image
%  filterDim  -  dimension of convolutional filter
%  numFilters -  number of convolutional filters
%  poolDim    -  dimension of pooling area
%  numClasses -  number of classes to predict
%
%
% Returns:
%  Wc      -  filterDim x filterDim x numFilters parameter matrix
%  Wd      -  numClasses x hiddenSize parameter matrix, hiddenSize is
%             calculated as numFilters*((imageDim-filterDim+1)/poolDim)^2
%  bc      -  bias for convolution layer of size numFilters x 1
%  bd      -  bias for dense layer of size hiddenSize x 1

只要记住c是convolution的缩写，d是dense的缩写就行了。

接着利用上次练习写的两个函数做卷积和池化：

%%% YOUR CODE HERE %%%
convolvedFeatures = cnnConvolve(filterDim, numFilters, images, Wc, bc);
activationsPooled = cnnPool(poolDim, convolvedFeatures);

接着前向传播到隐藏层，做softmax：

%%% YOUR CODE HERE %%%
%Wd=(numClasses,hiddenSize)
M = Wd*activationsPooled+repmat(bd,[1,numImages]);
M = bsxfun(@minus,M,max(M,[],1));
M = exp(M);
probs = bsxfun(@rdivide, M, sum(M));

这里的

M = bsxfun(@minus,M,max(M,[],1));

是为了避免浮点数溢出的小trick。

计算损失函数

真实标签的分布可以通过matlab的sparse快速统计：

%%======================================================================
%% STEP 1b: Calculate Cost
%  In this step you will use the labels given as input and the probs
%  calculate above to evaluate the cross entropy objective.  Store your
%  results in cost.

cost = 0; % save objective into cost

%%% YOUR CODE HERE %%%
groundTruth = full(sparse(labels, 1:numImages, 1));
cost = -1./numImages*groundTruth(:)'*log(probs(:))+(lambda/2.)*(sum(Wd(:).^2)+sum(Wc(:).^2)); %

% Makes predictions given probs and returns without backproagating errors.
if pred
    [~,preds] = max(probs,[],1);
    preds = preds';
    grad = 0;
    return;
end;

这里计算的是regularized cross-entropy：

$\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2\end{eqnarray}$

反向传播

计算损失函数在全连接层的误差$\delta_d$，然后将误差反向传播到subsample->卷积层。可利用matlab的kron函数来upsample误差。

假设在4×4的图片上执行2×2的池化，那么反向传播时抵达池化层的误差为2×2，需要将其upsample到4×4。由于使用了平均池化，来自输入层单元的每个输入平均地为池化层贡献误差，也就是说只要复制拓展一下这些元素并平均即可。比如当池化层的误差为

$$delta =
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
\end{pmatrix}$$

调用kron(delta, ones(2,2))，该函数对2×2的delta每个位置的元素乘以全1矩阵，得到4×4的矩阵：

$$
\text{kron} \left(
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
\end{pmatrix}
,
\begin{pmatrix}
1 & 1 \\
1 & 1 \\
\end{pmatrix}
\right)
\rightarrow
\begin{pmatrix}
1 & 1 & 2 & 2 \\
1 & 1 & 2 & 2 \\
3 & 3 & 4 & 4 \\
3 & 3 & 4 & 4
\end{pmatrix}
$$

upsample之后，剩下的工作就是将该矩阵除以卷积核的大小：

% Upsample the incoming error using krondelta_pool = (1/poolDim^2) * kron(delta,ones(poolDim));

只有这样做了，才能保证upsample前后矩阵元素之和相等。

梯度计算

使用全连接网络的梯度计算公式计算全连接层的梯度：

$$ \begin{align} \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T+\lambda W, \\ \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. \end{align} $$

% images--> convolvedFeatures--> activationsPooled--> probs
% Wd = (numClasses,hiddenSize)
% bd = (hiddenSize,1)
% Wc = (filterDim,filterDim,numFilters)
% bc = (numFilters,1)
% activationsPooled = zeros(outputDim,outputDim,numFilters,numImages);
% convolvedFeatures = (convDim,convDim,numFilters,numImages)
% images(imageDim,imageDim,numImges)
delta_d = -(groundTruth-probs); % softmax layer's preactivation
Wd_grad = (1./numImages)*delta_d*activationsPooled'+lambda*Wd;
bd_grad = (1./numImages)*sum(delta_d,2);

全连接层的误差反向传播到subsampling层，只需要乘以$W_d$：

delta_s = Wd'*delta_d; %the pooling/sample layer's preactivation
delta_s = reshape(delta_s,outputDim,outputDim,numFilters,numImages);

对卷积层，需要先对误差upsample：

delta_c = zeros(convDim,convDim,numFilters,numImages);
for i=1:numImages
    for j=1:numFilters
        delta_c(:,:,j,i) = (1./poolDim^2)*kron(squeeze(delta_s(:,:,j,i)), ones(poolDim));
    end
end
delta_c = convolvedFeatures.*(1-convolvedFeatures).*delta_c;

最后一行是在乘以sigmoid的导数，也就是$\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$

然后对某个特定的过滤器，其权值梯度为对卷积层的误差项与原图片执行卷积的结果之和：

for i=1:numFilters
    Wc_i = zeros(filterDim,filterDim);
    for j=1:numImages
        Wc_i = Wc_i+conv2(squeeze(images(:,:,j)),rot90(squeeze(delta_c(:,:,i,j)),2),'valid');
    end
   % Wc_i = convn(images,rot180(squeeze(delta_c(:,:,i,:))),'valid');
    % add penalize
    Wc_grad(:,:,i) = (1./numImages)*Wc_i+lambda*Wc(:,:,i);

    bc_i = delta_c(:,:,i,:);
    bc_i = bc_i(:);
    bc_grad(i) = sum(bc_i)/numImages;
end

bias的梯度是既定过滤器的所有误差项之和，当然还要除以训练图片的数量。

第2步：检查梯度

将DEBUG设为true可以利用computeNumericalGradient函数检查损失函数和梯度的正确性。

第3步：学习参数

使用随机梯度下降和动量更新算法，其中学习率更新采用一种启发式的算法，即简单地在每次迭代后将学习率减半。

function [opttheta] = minFuncSGD(funObj,theta,data,labels, options)
% Runs stochastic gradient descent with momentum to optimize the
% parameters for the given objective.
%
% Parameters:
%  funObj     -  function handle which accepts as input theta,
%                data, labels and returns cost and gradient w.r.t
%                to theta.
%  theta      -  unrolled parameter vector
%  data       -  stores data in m x n x numExamples tensor
%  labels     -  corresponding labels in numExamples x 1 vector
%  options    -  struct to store specific options for optimization
%
% Returns:
%  opttheta   -  optimized parameter vector
%
% Options (* required)
%  epochs*     - number of epochs through data
%  alpha*      - initial learning rate
%  minibatch*  - size of minibatch
%  momentum    - momentum constant, defualts to 0.9


%%======================================================================
%% Setup
assert(all(isfield(options,{'epochs','alpha','minibatch'})), 'Some options not defined');
if ~isfield(options,'momentum')
    options.momentum = 0.9;
end;
epochs = options.epochs;
alpha = options.alpha;
minibatch = options.minibatch;
m = length(labels); % training set size
% Setup for momentum
mom = 0.5;
momIncrease = 20;
velocity = zeros(size(theta));

%%======================================================================
%% SGD loop
it = 0;
for e = 1:epochs

    % randomly permute indices of data for quick minibatch sampling
    rp = randperm(m);

    for s = 1:minibatch:(m-minibatch+1)
        it = it + 1;

        % increase momentum after momIncrease iterations
        if it == momIncrease
            mom = options.momentum;
        end;

        % get next randomly selected minibatch
        mb_data = data(:,:,rp(s:s+minibatch-1));
        mb_labels = labels(rp(s:s+minibatch-1));

        % evaluate the objective function on the next minibatch
        [cost grad] = funObj(theta,mb_data,mb_labels);

        % Instructions: Add in the weighted velocity vector to the
        % gradient evaluated above scaled by the learning rate.
        % Then update the current weights theta according to the
        % sgd update rule

        %%% YOUR CODE HERE %%%
        velocity = velocity* mom + alpha*grad;
        theta = theta - velocity;

        fprintf('Epoch %d: Cost on iteration %d is %f\n',e,it,cost);
    end;

    % aneal learning rate by factor of two after each epoch
    alpha = alpha/2.0;

end;

opttheta = theta;

end

没什么好说的，很简单的公式：

$$\begin{align}v &= \gamma v+ \alpha \nabla_{\theta} J(\theta; x^{(i)},y^{(i)}) \\\theta &= \theta -
v\end{align}$$

第4步：测试

最终结果

Epoch 3: Cost on iteration 700 is 0.130076
Epoch 3: Cost on iteration 701 is 0.130717
Epoch 3: Cost on iteration 702 is 0.108138
Accuracy is 0.985000

Reference

https://github.com/hankcs/stanford_dl_ex

http://www.cs.ucf.edu/~mtappen/cap5415/lecs/lec1.pdf

一些中文术语参考了：https://github.com/ysh329/Chinese-UFLDL-Tutorial

知识共享署名-非商业性使用-相同方式共享：码农场 » 随机梯度下降与卷积神经网络

随机梯度下降与卷积神经网络