Hinton神经网络公开课编程练习2 Learning Word Representations-码农场

数据集
热身
训练
前向传播
逻辑斯谛激活函数
softmax
交叉熵
反向传播
输出层
隐藏层
嵌入层
更新量计算
Reference

这次练习的任务是设计一个神经网络语言模型，给定前三个词语，预测第四个词语。通过训练该语言模型学习词的稠密表示，大部分代码已经写好，少量难点以选择题的方式给出。

数据集

词表大小250，训练集由30万个4-gram构成，开发集和测试集大小为5万。数据提取自很简单的句子：

 No ,  he says now .
And what did he do ?
 The money 's there .
That was less than a year ago .
But he made only the first .

热身

>> fieldnames(data)

ans =

  4×1 cell array

    'testData'
    'trainData'
    'validData'
    'vocab'

hankcs.com 2017-03-17 下午8.22.34.png

data.vocab是词表：

hankcs.com 2017-03-17 下午8.23.46.png

其他三个数据集都是4*N的矩阵，每个元素都是词表中的id。通过如下代码加载数据集：

[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

该函数自动将4-gram的前三个赋给x，最后一个赋给t，形成x-t的训练实例。100指的是Mini-batch size，运行后训练数据被分块为3725个大小为100的mini-batch：

hankcs.com 2017-03-20 下午2.55.16.png

训练

训练调用方法如下：

model = train(1);

训练框架已经搭好，少数关键点以选择题的方式提供4个选项，反注释掉正确的一项即可。该函数中的默认参数是：

% SET HYPERPARAMETERS HERE.
batchsize = 100;  % Mini-batch size.
learning_rate = 0.1;  % Learning rate; default = 0.1.
momentum = 0.5;  % Momentum; default = 0.9.
numhid1 = 50;  % Dimensionality of embedding space; default = 50.
numhid2 = 200;  % Number of units in hidden layer; default = 200.
init_wt = 0.01;  % Standard deviation of the normal distribution
                 % which is sampled to get the initial weights; default = 0.01

可见词向量维度为50，隐藏层单元数为200。输入层单元数为窗口大小，即3。

前向传播

每个迭代在所有batch上分别训练，先做一个前向传播：

  % LOOP OVER MINI-BATCHES.
  for m = 1:numbatches
    input_batch = train_input(:, :, m);
    target_batch = train_target(:, :, m);
 
    % FORWARD PROPAGATE.
    % Compute the state of each layer in the network given the input batch
    % and all weights and biases
    [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
      fprop(input_batch, ...
            word_embedding_weights, embed_to_hid_weights, ...
            hid_to_output_weights, hid_bias, output_bias);

逻辑斯谛激活函数

前向传播中留了两个选择题，一个是逻辑斯谛激活函数：

% Apply logistic activation function.
% FILL IN CODE. Replace the line below by one of the options.
hidden_layer_state = zeros(numhid2, batchsize);
% Options
% (a) hidden_layer_state = 1 ./ (1 + exp(inputs_to_hidden_units));
% (b) hidden_layer_state = 1 ./ (1 - exp(-inputs_to_hidden_units));
hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units));
% (d) hidden_layer_state = -1 ./ (1 + exp(-inputs_to_hidden_units));

对应 hankcs.com 2017-03-20 下午3.05.15.png 。

softmax

另一个是softmax层的输入计算：

%% COMPUTE STATE OF OUTPUT LAYER.
% Compute inputs to softmax.
% FILL IN CODE. Replace the line below by one of the options.
inputs_to_softmax = zeros(vocab_size, batchsize);
% Options
inputs_to_softmax = hid_to_output_weights' * hidden_layer_state +  repmat(output_bias, 1, batchsize);
% (b) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state +  repmat(output_bias, batchsize, 1);
% (c) inputs_to_softmax = hidden_layer_state * hid_to_output_weights' +  repmat(output_bias, 1, batchsize);
% (d) inputs_to_softmax = hid_to_output_weights * hidden_layer_state +  repmat(output_bias, batchsize, 1);

无非是把向量和矩阵的维度对上而已：

%   hidden_layer_state: State of units in the hidden layer as a matrix of size
%     numhid2 X batchsize

传播完计算交叉熵关于z的导数 2017年03月16日22-22-30.png ：

    % COMPUTE DERIVATIVE.
    %% Expand the target to a sparse 1-of-K vector.
    expanded_target_batch = expansion_matrix(:, target_batch);
    %% Compute derivative of cross-entropy loss function.
    error_deriv = output_layer_state - expanded_target_batch;

其中y和z分别指

2017年03月16日22-01-47.png

交叉熵

然后计算交叉熵 2017年03月16日22-21-19.png 本身：

    % MEASURE LOSS FUNCTION.
    CE = -sum(sum(...
      expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;
    count =  count + 1;
    this_chunk_CE = this_chunk_CE + (CE - this_chunk_CE) / count;
    trainset_CE = trainset_CE + (CE - trainset_CE) / m;
    fprintf(1, '\rBatch %d Train CE %.3f', m, this_chunk_CE);
    if mod(m, show_training_CE_after) == 0
      fprintf(1, '\n');
      count = 0;
      this_chunk_CE = 0;
    end
    if OctaveMode
      fflush(1);
    end

trainset_CE是每次迭代中每个batch的平均CE，而this_chunk_CE是一个batch在所有迭代中的平均CE。

反向传播

接着做反向传播。

输出层

    %% OUTPUT LAYER.
    hid_to_output_weights_gradient =  hidden_layer_state * error_deriv';
    output_bias_gradient = sum(error_deriv, 2);
    back_propagated_deriv_1 = (hid_to_output_weights * error_deriv) ...
      .* hidden_layer_state .* (1 - hidden_layer_state);

这里的求导要模糊一些，毕竟课上没有显式地讲。但softmax与squared error区别仅仅在于损失函数而已。squared error版本的导数：

链式法则.png

由于最后一个绿框来自的导数，所以将最后一个绿框替换为error_deriv即可。另外上面求的是关于weight的导数，而代码求的是交叉熵关于hidden layer的输出的导数，所以前面的绿框要改为z=wx+b关于x的导数，即改为hid_to_output_weights hankcs.com 2017-03-20 下午4.56.13.png 。

隐藏层

第一个选择题问的是交叉熵关于隐藏层权值的导数，所以要乘以x即 hankcs.com 2017-03-20 下午4.58.56.png ：

    % FILL IN CODE. Replace the line below by one of the options.
    embed_to_hid_weights_gradient = zeros(numhid1 * numwords, numhid2);
    % Options:
    % (a) embed_to_hid_weights_gradient = back_propagated_deriv_1' * embedding_layer_state;
    embed_to_hid_weights_gradient = embedding_layer_state * back_propagated_deriv_1';
    % (c) embed_to_hid_weights_gradient = back_propagated_deriv_1;
    % (d) embed_to_hid_weights_gradient = embedding_layer_state;

然后bias其实是乘以单位向量，也就是求和：

    % FILL IN CODE. Replace the line below by one of the options.
    hid_bias_gradient = zeros(numhid2, 1);
    % Options
    hid_bias_gradient = sum(back_propagated_deriv_1, 2);
    % (b) hid_bias_gradient = sum(back_propagated_deriv_1, 1);
    % (c) hid_bias_gradient = back_propagated_deriv_1;
    % (d) hid_bias_gradient = back_propagated_deriv_1';

接着再一次利用链式法则，重复乘上输入层的 hankcs.com 2017-03-20 下午4.56.13.png ：

    % FILL IN CODE. Replace the line below by one of the options.
    back_propagated_deriv_2 = zeros(numhid2, batchsize);
    % Options
    back_propagated_deriv_2 = embed_to_hid_weights * back_propagated_deriv_1;
    % (b) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights;
    % (c) back_propagated_deriv_2 = back_propagated_deriv_1' * embed_to_hid_weights;
    % (d) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights';

得到交叉熵关于EMBEDDING LAYER的输出的导数back_propagated_deriv_2，一个(numwords * numhid1, batch_size)的矩阵。之所以numwords * numhid1，是因为EMBEDDING LAYER的输出是三个词的EMBEDDING的拼接。

嵌入层

对EMBEDDING LAYER来讲，word_embedding_weights_gradient是一个vocab_size * numhid1的矩阵，分别对应每个词的EMBEDDING。因为每个词输入的原始形式就是一个vocab_size的one-hot向量，乘以vocab_size * numhid1的矩阵，恰好得到numhid1大小的词向量。

怎么从back_propagated_deriv_2中找到每个embedding_weight的导数呢？

    word_embedding_weights_gradient(:) = 0;
    for w = 1:numwords
       word_embedding_weights_gradient = word_embedding_weights_gradient + ...
         expansion_matrix(:, input_batch(w, :)) * ...
         (back_propagated_deriv_2(1 + (w - 1) * numhid1 : w * numhid1, :)');
    end

从行号1 + (w – 1) * numhid1 到 w * numhid1的这个子矩阵的转置就是batch_size个训练实例中的第w个词语的weight的导数。也就是一个batch_size*numhid1的矩阵。但batch_size中的每个的词语都有vocab_size种可能，需要把这个导数对应到某一个确定的词语上去。这时就利用了expansion_matrix(:, input_batch(w, :))去拿到每个batch中第w个词语对应的one-hot向量，组成一个vocab_size*batch_size的矩阵，两者相乘得到vocab_size*numhid1，也即每个词语对应的embedding的导数。不过一共产生了numwords个word_embedding_weights_gradient，这里的做法是将其累加起来。在Google的word2vec中，是预先将输入的多个词向量求和（而不是拼接）。这里是先拼接，最后将梯度向量求和。两者有微妙的不同。

更新量计算

    % UPDATE WEIGHTS AND BIASES.
    word_embedding_weights_delta = ...
      momentum .* word_embedding_weights_delta + ...
      word_embedding_weights_gradient ./ batchsize;
    word_embedding_weights = word_embedding_weights...
      - learning_rate * word_embedding_weights_delta;
 
    embed_to_hid_weights_delta = ...
      momentum .* embed_to_hid_weights_delta + ...
      embed_to_hid_weights_gradient ./ batchsize;
    embed_to_hid_weights = embed_to_hid_weights...
      - learning_rate * embed_to_hid_weights_delta;
 
    hid_to_output_weights_delta = ...
      momentum .* hid_to_output_weights_delta + ...
      hid_to_output_weights_gradient ./ batchsize;
    hid_to_output_weights = hid_to_output_weights...
      - learning_rate * hid_to_output_weights_delta;
 
    hid_bias_delta = momentum .* hid_bias_delta + ...
      hid_bias_gradient ./ batchsize;
    hid_bias = hid_bias - learning_rate * hid_bias_delta;
 
    output_bias_delta = momentum .* output_bias_delta + ...
      output_bias_gradient ./ batchsize;
    output_bias = output_bias - learning_rate * output_bias_delta;

这里的delta的计算原理是一样的：delta = momentum * prev_delta + gradient/batchsize，估计是一种trick，防止一个迭代改变太多。delta算出来后乘上学习率的相反数，加到weight上去，所谓批梯度下降。

之后就没什么好说的，在开发集和测试集上跑跑评估：

>> model = train(3);
Average Training CE 3.919
Finished Training.
Final Training CE 3.919

Running validation ...
Final Validation CE 3.769

Running test ...
Final Test CE 3.776
Training took 40.41 seconds
>> display_nearest_words('she', model, 5)
he 0.50
they 1.23
we 1.27
i 1.61
my 1.62

Reference

https://github.com/hankcs/coursera-neural-net

知识共享署名-非商业性使用-相同方式共享：码农场 » Hinton神经网络公开课编程练习2 Learning Word Representations

Hinton神经网络公开课编程练习2 Learning Word Representations

数据集

热身

训练

前向传播

逻辑斯谛激活函数

softmax

交叉熵

反向传播

输出层

隐藏层

嵌入层

更新量计算

Reference

评论欢迎留言

我的作品

数据集

热身

训练

前向传播

逻辑斯谛激活函数

softmax

交叉熵

反向传播

输出层

隐藏层

嵌入层

更新量计算

Reference

评论 欢迎留言

我的作品

评论欢迎留言