目录
练习:卷积神经网络
概述
在这次练习中,将实现用于手写数字识别的卷积神经网络。网络结构为一个卷积池化层,后面跟一个全连接层,最后是一个softmax层。池化层使用平均池化。训练时使用反向传播、随机梯度下降和动量更新。
第0步:初始化参数、加载数据
%% STEP 0: Initialize Parameters and Load Data % Here we initialize some parameters used for the exercise. % Configuration imageDim = 28; numClasses = 10; % Number of classes (MNIST images fall into 10 classes) filterDim = 9; % Filter size for conv layer numFilters = 20; % Number of filters for conv layer poolDim = 2; % Pooling dimension, (should divide imageDim-filterDim+1) % Load MNIST Train addpath ../common/; images = loadMNISTImages('../common/train-images-idx3-ubyte'); images = reshape(images,imageDim,imageDim,[]); labels = loadMNISTLabels('../common/train-labels-idx1-ubyte'); labels(labels==0) = 10; % Remap 0 to 10 % Initialize Parameters theta = cnnInitParams(imageDim,filterDim,numFilters,poolDim,numClasses);
从参数可见,一共20个9×9的过滤器,池化区域大小2×2。加载的数据与前文相同。
第1步:实现CNN损失函数
网络一共两层,带平均池化的卷积层和全连接的softmax层,损失函数为10个分类上预测分布与实际分布的交叉熵。
前向传播
对每个图片执行每个过滤器(最好将激活值存下来,待会儿反向传播要用),然后平均响应以完成池化,这部分与前文类似。多说无益,直接看代码:
function [cost, grad, preds] = cnnCost(theta,images,labels,numClasses, filterDim,numFilters,poolDim,pred) % Calcualte cost and gradient for a single layer convolutional % neural network followed by a softmax layer with cross entropy % objective. % % Parameters: % theta - unrolled parameter vector % images - stores images in imageDim x imageDim x numImges % array % numClasses - number of classes to predict % filterDim - dimension of convolutional filter % numFilters - number of convolutional filters % poolDim - dimension of pooling area % pred - boolean only forward propagate and return % predictions % % % Returns: % cost - cross entropy cost % grad - gradient with respect to theta (if pred==False) % preds - list of predictions for each example (if pred==True) if ~exist('pred','var') pred = false; end; imageDim = size(images,1); % height/width of image numImages = size(images,3); % number of images lambda = 3e-3; % weight decay parameter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Reshape parameters and setup gradient matrices % Wc is filterDim x filterDim x numFilters parameter matrix % bc is the corresponding bias % Wd is numClasses x hiddenSize parameter matrix where hiddenSize % is the number of output units from the convolutional layer % bd is corresponding bias [Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,numFilters,... poolDim,numClasses); % Same sizes as Wc,Wd,bc,bd. Used to hold gradient w.r.t above params. Wc_grad = zeros(size(Wc)); Wd_grad = zeros(size(Wd)); bc_grad = zeros(size(bc)); bd_grad = zeros(size(bd)); %%====================================================================== %% STEP 1a: Forward Propagation % In this step you will forward propagate the input through the % convolutional and subsampling (mean pooling) layers. You will then use % the responses from the convolution and pooling layer as the input to a % standard softmax layer. %% Convolutional Layer % For each image and each filter, convolve the image with the filter, add % the bias and apply the sigmoid nonlinearity. Then subsample the % convolved activations with mean pooling. Store the results of the % convolution in activations and the results of the pooling in % activationsPooled. You will need to save the convolved activations for % backpropagation. convDim = imageDim-filterDim+1; % dimension of convolved output outputDim = (convDim)/poolDim; % dimension of subsampled output % convDim x convDim x numFilters x numImages tensor for storing activations activations = zeros(convDim,convDim,numFilters,numImages); % outputDim x outputDim x numFilters x numImages tensor for storing % subsampled activations activationsPooled = zeros(outputDim,outputDim,numFilters,numImages); %%% YOUR CODE HERE %%% convolvedFeatures = cnnConvolve(filterDim, numFilters, images, Wc, bc); activationsPooled = cnnPool(poolDim, convolvedFeatures); % Reshape activations into 2-d matrix, hiddenSize x numImages, % for Softmax layer activationsPooled = reshape(activationsPooled,[],numImages); %% Softmax Layer % Forward propagate the pooled activations calculated above into a % standard softmax layer. For your convenience we have reshaped % activationPooled into a hiddenSize x numImages matrix. Store the % results in probs. % numClasses x numImages for storing probability that each image belongs to % each class. probs = zeros(numClasses,numImages); %%% YOUR CODE HERE %%% %Wd=(numClasses,hiddenSize) M = Wd*activationsPooled+repmat(bd,[1,numImages]); M = bsxfun(@minus,M,max(M,[],1)); M = exp(M); probs = bsxfun(@rdivide, M, sum(M));
里面调用了一个cnnParamsToStack,作用是将一个unrolled的参数向量还原成weight矩阵和bias向量。
function [Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,... numFilters,poolDim,numClasses) % Converts unrolled parameters for a single layer convolutional neural % network followed by a softmax layer into structured weight % tensors/matrices and corresponding biases % % Parameters: % theta - unrolled parameter vectore % imageDim - height/width of image % filterDim - dimension of convolutional filter % numFilters - number of convolutional filters % poolDim - dimension of pooling area % numClasses - number of classes to predict % % % Returns: % Wc - filterDim x filterDim x numFilters parameter matrix % Wd - numClasses x hiddenSize parameter matrix, hiddenSize is % calculated as numFilters*((imageDim-filterDim+1)/poolDim)^2 % bc - bias for convolution layer of size numFilters x 1 % bd - bias for dense layer of size hiddenSize x 1
只要记住c是convolution的缩写,d是dense的缩写就行了。
接着利用上次练习写的两个函数做卷积和池化:
%%% YOUR CODE HERE %%% convolvedFeatures = cnnConvolve(filterDim, numFilters, images, Wc, bc); activationsPooled = cnnPool(poolDim, convolvedFeatures);
接着前向传播到隐藏层,做softmax:
%%% YOUR CODE HERE %%% %Wd=(numClasses,hiddenSize) M = Wd*activationsPooled+repmat(bd,[1,numImages]); M = bsxfun(@minus,M,max(M,[],1)); M = exp(M); probs = bsxfun(@rdivide, M, sum(M));
这里的
M = bsxfun(@minus,M,max(M,[],1));
是为了避免浮点数溢出的小trick。
计算损失函数
真实标签的分布可以通过matlab的sparse快速统计:
%%====================================================================== %% STEP 1b: Calculate Cost % In this step you will use the labels given as input and the probs % calculate above to evaluate the cross entropy objective. Store your % results in cost. cost = 0; % save objective into cost %%% YOUR CODE HERE %%% groundTruth = full(sparse(labels, 1:numImages, 1)); cost = -1./numImages*groundTruth(:)'*log(probs(:))+(lambda/2.)*(sum(Wd(:).^2)+sum(Wc(:).^2)); % % Makes predictions given probs and returns without backproagating errors. if pred [~,preds] = max(probs,[],1); preds = preds'; grad = 0; return; end;
这里计算的是regularized cross-entropy:
$\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2\end{eqnarray}$
反向传播
计算损失函数在全连接层的误差$\delta_d$,然后将误差反向传播到subsample->卷积层。可利用matlab的kron函数来upsample误差。
假设在4×4的图片上执行2×2的池化,那么反向传播时抵达池化层的误差为2×2,需要将其upsample到4×4。由于使用了平均池化,来自输入层单元的每个输入平均地为池化层贡献误差,也就是说只要复制拓展一下这些元素并平均即可。比如当池化层的误差为
$$delta =
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
\end{pmatrix}$$
调用kron(delta, ones(2,2)),该函数对2×2的delta每个位置的元素乘以全1矩阵,得到4×4的矩阵:
$$
\text{kron} \left(
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
\end{pmatrix}
,
\begin{pmatrix}
1 & 1 \\
1 & 1 \\
\end{pmatrix}
\right)
\rightarrow
\begin{pmatrix}
1 & 1 & 2 & 2 \\
1 & 1 & 2 & 2 \\
3 & 3 & 4 & 4 \\
3 & 3 & 4 & 4
\end{pmatrix}
$$
upsample之后,剩下的工作就是将该矩阵除以卷积核的大小:
% Upsample the incoming error using krondelta_pool = (1/poolDim^2) * kron(delta,ones(poolDim));
只有这样做了,才能保证upsample前后矩阵元素之和相等。
梯度计算
使用全连接网络的梯度计算公式计算全连接层的梯度:
$$ \begin{align} \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T+\lambda W, \\ \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. \end{align} $$
% images--> convolvedFeatures--> activationsPooled--> probs % Wd = (numClasses,hiddenSize) % bd = (hiddenSize,1) % Wc = (filterDim,filterDim,numFilters) % bc = (numFilters,1) % activationsPooled = zeros(outputDim,outputDim,numFilters,numImages); % convolvedFeatures = (convDim,convDim,numFilters,numImages) % images(imageDim,imageDim,numImges) delta_d = -(groundTruth-probs); % softmax layer's preactivation Wd_grad = (1./numImages)*delta_d*activationsPooled'+lambda*Wd; bd_grad = (1./numImages)*sum(delta_d,2);
全连接层的误差反向传播到subsampling层,只需要乘以$W_d$:
delta_s = Wd'*delta_d; %the pooling/sample layer's preactivation delta_s = reshape(delta_s,outputDim,outputDim,numFilters,numImages);
对卷积层,需要先对误差upsample:
delta_c = zeros(convDim,convDim,numFilters,numImages); for i=1:numImages for j=1:numFilters delta_c(:,:,j,i) = (1./poolDim^2)*kron(squeeze(delta_s(:,:,j,i)), ones(poolDim)); end end delta_c = convolvedFeatures.*(1-convolvedFeatures).*delta_c;
最后一行是在乘以sigmoid的导数,也就是$\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$
然后对某个特定的过滤器,其权值梯度为对卷积层的误差项与原图片执行卷积的结果之和:
for i=1:numFilters Wc_i = zeros(filterDim,filterDim); for j=1:numImages Wc_i = Wc_i+conv2(squeeze(images(:,:,j)),rot90(squeeze(delta_c(:,:,i,j)),2),'valid'); end % Wc_i = convn(images,rot180(squeeze(delta_c(:,:,i,:))),'valid'); % add penalize Wc_grad(:,:,i) = (1./numImages)*Wc_i+lambda*Wc(:,:,i); bc_i = delta_c(:,:,i,:); bc_i = bc_i(:); bc_grad(i) = sum(bc_i)/numImages; end
bias的梯度是既定过滤器的所有误差项之和,当然还要除以训练图片的数量。
第2步:检查梯度
将DEBUG设为true可以利用computeNumericalGradient函数检查损失函数和梯度的正确性。
第3步:学习参数
使用随机梯度下降和动量更新算法,其中学习率更新采用一种启发式的算法,即简单地在每次迭代后将学习率减半。
function [opttheta] = minFuncSGD(funObj,theta,data,labels, options) % Runs stochastic gradient descent with momentum to optimize the % parameters for the given objective. % % Parameters: % funObj - function handle which accepts as input theta, % data, labels and returns cost and gradient w.r.t % to theta. % theta - unrolled parameter vector % data - stores data in m x n x numExamples tensor % labels - corresponding labels in numExamples x 1 vector % options - struct to store specific options for optimization % % Returns: % opttheta - optimized parameter vector % % Options (* required) % epochs* - number of epochs through data % alpha* - initial learning rate % minibatch* - size of minibatch % momentum - momentum constant, defualts to 0.9 %%====================================================================== %% Setup assert(all(isfield(options,{'epochs','alpha','minibatch'})), 'Some options not defined'); if ~isfield(options,'momentum') options.momentum = 0.9; end; epochs = options.epochs; alpha = options.alpha; minibatch = options.minibatch; m = length(labels); % training set size % Setup for momentum mom = 0.5; momIncrease = 20; velocity = zeros(size(theta)); %%====================================================================== %% SGD loop it = 0; for e = 1:epochs % randomly permute indices of data for quick minibatch sampling rp = randperm(m); for s = 1:minibatch:(m-minibatch+1) it = it + 1; % increase momentum after momIncrease iterations if it == momIncrease mom = options.momentum; end; % get next randomly selected minibatch mb_data = data(:,:,rp(s:s+minibatch-1)); mb_labels = labels(rp(s:s+minibatch-1)); % evaluate the objective function on the next minibatch [cost grad] = funObj(theta,mb_data,mb_labels); % Instructions: Add in the weighted velocity vector to the % gradient evaluated above scaled by the learning rate. % Then update the current weights theta according to the % sgd update rule %%% YOUR CODE HERE %%% velocity = velocity* mom + alpha*grad; theta = theta - velocity; fprintf('Epoch %d: Cost on iteration %d is %f\n',e,it,cost); end; % aneal learning rate by factor of two after each epoch alpha = alpha/2.0; end; opttheta = theta; end
没什么好说的,很简单的公式:
$$\begin{align}v &= \gamma v+ \alpha \nabla_{\theta} J(\theta; x^{(i)},y^{(i)}) \\\theta &= \theta -
v\end{align}$$
第4步:测试
最终结果
Epoch 3: Cost on iteration 700 is 0.130076 Epoch 3: Cost on iteration 701 is 0.130717 Epoch 3: Cost on iteration 702 is 0.108138 Accuracy is 0.985000
Reference
https://github.com/hankcs/stanford_dl_ex
http://www.cs.ucf.edu/~mtappen/cap5415/lecs/lec1.pdf
楼主,你好!请问你这“文章侧边栏快速定位“和“文档中的公式“是使用的哪款插件啊
写的真心不错,努力向你学习
师兄,忽略博文情景问下吼,目前基于深度学习的日语机器翻译进展得如何了?/滑稽
我也不了解这个领域