这次练习的任务是设计一个神经网络语言模型,给定前三个词语,预测第四个词语。通过训练该语言模型学习词的稠密表示,大部分代码已经写好,少量难点以选择题的方式给出。
数据集
词表大小250,训练集由30万个4-gram构成,开发集和测试集大小为5万。数据提取自很简单的句子:
No , he says now . And what did he do ? The money 's there . That was less than a year ago . But he made only the first .
热身
>> fieldnames(data) ans = 4×1 cell array 'testData' 'trainData' 'validData' 'vocab'
data.vocab是词表:
其他三个数据集都是4*N的矩阵,每个元素都是词表中的id。通过如下代码加载数据集:
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);
该函数自动将4-gram的前三个赋给x,最后一个赋给t,形成x-t的训练实例。100指的是Mini-batch size,运行后训练数据被分块为3725个大小为100的mini-batch:
训练
训练调用方法如下:
model = train(1);
训练框架已经搭好,少数关键点以选择题的方式提供4个选项,反注释掉正确的一项即可。该函数中的默认参数是:
% SET HYPERPARAMETERS HERE. batchsize = 100; % Mini-batch size. learning_rate = 0.1; % Learning rate; default = 0.1. momentum = 0.5; % Momentum; default = 0.9. numhid1 = 50; % Dimensionality of embedding space; default = 50. numhid2 = 200; % Number of units in hidden layer; default = 200. init_wt = 0.01; % Standard deviation of the normal distribution % which is sampled to get the initial weights; default = 0.01
可见词向量维度为50,隐藏层单元数为200。输入层单元数为窗口大小,即3。
前向传播
每个迭代在所有batch上分别训练,先做一个前向传播:
% LOOP OVER MINI-BATCHES. for m = 1:numbatches input_batch = train_input(:, :, m); target_batch = train_target(:, :, m); % FORWARD PROPAGATE. % Compute the state of each layer in the network given the input batch % and all weights and biases [embedding_layer_state, hidden_layer_state, output_layer_state] = ... fprop(input_batch, ... word_embedding_weights, embed_to_hid_weights, ... hid_to_output_weights, hid_bias, output_bias);
逻辑斯谛激活函数
前向传播中留了两个选择题,一个是逻辑斯谛激活函数:
% Apply logistic activation function. % FILL IN CODE. Replace the line below by one of the options. hidden_layer_state = zeros(numhid2, batchsize); % Options % (a) hidden_layer_state = 1 ./ (1 + exp(inputs_to_hidden_units)); % (b) hidden_layer_state = 1 ./ (1 - exp(-inputs_to_hidden_units)); hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); % (d) hidden_layer_state = -1 ./ (1 + exp(-inputs_to_hidden_units));
对应。
softmax
另一个是softmax层的输入计算:
%% COMPUTE STATE OF OUTPUT LAYER. % Compute inputs to softmax. % FILL IN CODE. Replace the line below by one of the options. inputs_to_softmax = zeros(vocab_size, batchsize); % Options inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); % (b) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, batchsize, 1); % (c) inputs_to_softmax = hidden_layer_state * hid_to_output_weights' + repmat(output_bias, 1, batchsize); % (d) inputs_to_softmax = hid_to_output_weights * hidden_layer_state + repmat(output_bias, batchsize, 1);
无非是把向量和矩阵的维度对上而已:
% hidden_layer_state: State of units in the hidden layer as a matrix of size % numhid2 X batchsize
传播完计算交叉熵关于z的导数:
% COMPUTE DERIVATIVE. %% Expand the target to a sparse 1-of-K vector. expanded_target_batch = expansion_matrix(:, target_batch); %% Compute derivative of cross-entropy loss function. error_deriv = output_layer_state - expanded_target_batch;
其中y和z分别指
交叉熵
然后计算交叉熵本身:
% MEASURE LOSS FUNCTION. CE = -sum(sum(... expanded_target_batch .* log(output_layer_state + tiny))) / batchsize; count = count + 1; this_chunk_CE = this_chunk_CE + (CE - this_chunk_CE) / count; trainset_CE = trainset_CE + (CE - trainset_CE) / m; fprintf(1, '\rBatch %d Train CE %.3f', m, this_chunk_CE); if mod(m, show_training_CE_after) == 0 fprintf(1, '\n'); count = 0; this_chunk_CE = 0; end if OctaveMode fflush(1); end
trainset_CE是每次迭代中每个batch的平均CE,而this_chunk_CE是一个batch在所有迭代中的平均CE。
反向传播
接着做反向传播。
输出层
%% OUTPUT LAYER. hid_to_output_weights_gradient = hidden_layer_state * error_deriv'; output_bias_gradient = sum(error_deriv, 2); back_propagated_deriv_1 = (hid_to_output_weights * error_deriv) ... .* hidden_layer_state .* (1 - hidden_layer_state);
这里的求导要模糊一些,毕竟课上没有显式地讲。但softmax与squared error区别仅仅在于损失函数而已。squared error版本的导数:
由于最后一个绿框来自的导数,所以将最后一个绿框替换为error_deriv即可。另外上面求的是关于weight的导数,而代码求的是交叉熵关于hidden layer的输出的导数,所以前面的绿框要改为z=wx+b关于x的导数,即改为hid_to_output_weights。
隐藏层
第一个选择题问的是交叉熵关于隐藏层权值的导数,所以要乘以x即:
% FILL IN CODE. Replace the line below by one of the options. embed_to_hid_weights_gradient = zeros(numhid1 * numwords, numhid2); % Options: % (a) embed_to_hid_weights_gradient = back_propagated_deriv_1' * embedding_layer_state; embed_to_hid_weights_gradient = embedding_layer_state * back_propagated_deriv_1'; % (c) embed_to_hid_weights_gradient = back_propagated_deriv_1; % (d) embed_to_hid_weights_gradient = embedding_layer_state;
然后bias其实是乘以单位向量,也就是求和:
% FILL IN CODE. Replace the line below by one of the options. hid_bias_gradient = zeros(numhid2, 1); % Options hid_bias_gradient = sum(back_propagated_deriv_1, 2); % (b) hid_bias_gradient = sum(back_propagated_deriv_1, 1); % (c) hid_bias_gradient = back_propagated_deriv_1; % (d) hid_bias_gradient = back_propagated_deriv_1';
接着再一次利用链式法则,重复乘上输入层的:
% FILL IN CODE. Replace the line below by one of the options. back_propagated_deriv_2 = zeros(numhid2, batchsize); % Options back_propagated_deriv_2 = embed_to_hid_weights * back_propagated_deriv_1; % (b) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights; % (c) back_propagated_deriv_2 = back_propagated_deriv_1' * embed_to_hid_weights; % (d) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights';
得到交叉熵关于EMBEDDING LAYER的输出的导数back_propagated_deriv_2,一个(numwords * numhid1, batch_size)的矩阵。之所以numwords * numhid1,是因为EMBEDDING LAYER的输出是三个词的EMBEDDING的拼接。
嵌入层
对EMBEDDING LAYER来讲,word_embedding_weights_gradient是一个vocab_size * numhid1的矩阵,分别对应每个词的EMBEDDING。因为每个词输入的原始形式就是一个vocab_size的one-hot向量,乘以vocab_size * numhid1的矩阵,恰好得到numhid1大小的词向量。
怎么从back_propagated_deriv_2中找到每个embedding_weight的导数呢?
word_embedding_weights_gradient(:) = 0; for w = 1:numwords word_embedding_weights_gradient = word_embedding_weights_gradient + ... expansion_matrix(:, input_batch(w, :)) * ... (back_propagated_deriv_2(1 + (w - 1) * numhid1 : w * numhid1, :)'); end
从行号1 + (w – 1) * numhid1 到 w * numhid1的这个子矩阵的转置就是batch_size个训练实例中的第w个词语的weight的导数。也就是一个batch_size*numhid1的矩阵。但batch_size中的每个的词语都有vocab_size种可能,需要把这个导数对应到某一个确定的词语上去。这时就利用了expansion_matrix(:, input_batch(w, :))去拿到每个batch中第w个词语对应的one-hot向量,组成一个vocab_size*batch_size的矩阵,两者相乘得到vocab_size*numhid1,也即每个词语对应的embedding的导数。不过一共产生了numwords个word_embedding_weights_gradient,这里的做法是将其累加起来。在Google的word2vec中,是预先将输入的多个词向量求和(而不是拼接)。这里是先拼接,最后将梯度向量求和。两者有微妙的不同。
更新量计算
% UPDATE WEIGHTS AND BIASES. word_embedding_weights_delta = ... momentum .* word_embedding_weights_delta + ... word_embedding_weights_gradient ./ batchsize; word_embedding_weights = word_embedding_weights... - learning_rate * word_embedding_weights_delta; embed_to_hid_weights_delta = ... momentum .* embed_to_hid_weights_delta + ... embed_to_hid_weights_gradient ./ batchsize; embed_to_hid_weights = embed_to_hid_weights... - learning_rate * embed_to_hid_weights_delta; hid_to_output_weights_delta = ... momentum .* hid_to_output_weights_delta + ... hid_to_output_weights_gradient ./ batchsize; hid_to_output_weights = hid_to_output_weights... - learning_rate * hid_to_output_weights_delta; hid_bias_delta = momentum .* hid_bias_delta + ... hid_bias_gradient ./ batchsize; hid_bias = hid_bias - learning_rate * hid_bias_delta; output_bias_delta = momentum .* output_bias_delta + ... output_bias_gradient ./ batchsize; output_bias = output_bias - learning_rate * output_bias_delta;
这里的delta的计算原理是一样的:delta = momentum * prev_delta + gradient/batchsize,估计是一种trick,防止一个迭代改变太多。delta算出来后乘上学习率的相反数,加到weight上去,所谓批梯度下降。
之后就没什么好说的,在开发集和测试集上跑跑评估:
>> model = train(3); Average Training CE 3.919 Finished Training. Final Training CE 3.919 Running validation ... Final Validation CE 3.769 Running test ... Final Test CE 3.776 Training took 40.41 seconds >> display_nearest_words('she', model, 5) he 0.50 they 1.23 we 1.27 i 1.61 my 1.62
Reference
https://github.com/hankcs/coursera-neural-net
知识共享署名-非商业性使用-相同方式共享:码农场 » Hinton神经网络公开课编程练习2 Learning Word Representations