这是分类算法在欺诈检测方面的应用。
5.4.1交易数据中关于欺诈检测的一个用例
假设有如下样例数据:
正常交易集合:data/ch05/fraud/descriptions.txt
AMAZON.COM
USAIRWAY
EXPEDIA TRAVEL
欺诈交易集合:data/ch05/fraud/fraud-descriptions.txt
CAFE QWERTY
whole flash
food ASDFG
以及利用这些集合生成的训练数据集:
每条交易由如 下的属性值所确定(按序罗列):
•用户ID
•交易ID。
•交易的描述。
•交易总额。
•交易的GPS坐标。
•交易的坐标。
•—个用于确定交易是(true)否(false)属于欺诈的二值变量。
目标是创建一个分类器,基于上面的数据学习如何辨识一个欺诈交易。
5.4.2神经网络概览
由具备IO的神经节点和其他神经节点构成。
5.4.3 —个可用的神经网络欺诈检测器
还是三步骤:训练、检验、生产:
package com.hankcs; import iweb2.ch5.usecase.fraud.NNFraudClassifier; import iweb2.ch5.usecase.fraud.data.TransactionDataset; import iweb2.ch5.usecase.fraud.data.TransactionLoader; import iweb2.ch5.usecase.fraud.util.FraudErrorEstimator; public class ch5_3_FraudNN { public static void main(String[] args) throws Exception { // 载入训练集 TransactionDataset ds = TransactionLoader.loadTrainingDataset(); // 收集每个用户的消费习惯 ds.calculateUserStats(); // //CREATE the classifier // // 分类器的实现,是对神经网络模型的包装 NNFraudClassifier nnFraudClassifier = new NNFraudClassifier(ds); // Give it a name. // It will be used later when we serialize the classifier nnFraudClassifier.setName("MyNeuralClassifier"); // //TRAIN the classifier // // Configure classifier with attributes that will be used as inputs into NN // 使用交易属性:总额、位置与描述 nnFraudClassifier.useDefaultAttributes(); // Set the number of training iterations // 数据会在网络中传播多少次 nnFraudClassifier.setNTrainingIterations(10); // Start the training ... nnFraudClassifier.train(); // // STORE the classifier // // 序列化防宕机 nnFraudClassifier.save(); // You can load a previously saved classifier // 载入一个己训练好的分类器 NNFraudClassifier nnClone = NNFraudClassifier.load(nnFraudClassifier.getName()); // Classify a couple of samples from Training set // This should be a legitimate transaction // 准备好要对两个交易进行分类,第一个ID (1)是合法交易 nnClone.classify("1"); // This should be a fraudulent transaction // 第二个ID (305)属于欺诈交易。这只是一个检查性的测试 nnClone.classify("305"); // Now, calculate error rate for test set // 创建了一个新的数据集 TransactionDataset testDS = TransactionLoader.loadTestDataset(); // 辅助类,它帮助我们评估分类器的精确度 FraudErrorEstimator auditor = new FraudErrorEstimator(testDS, nnClone); auditor.run(); } }
ds.calculateUserStats()里每个用户的消费习惯包含合法交易的最大金额和最小金额;合法交易描述中的单词集合;交易位置范围和中心点:
输出:
saved classifier in file: C:\iWeb2\data\ch05\MyNeuralClassifier loaded classifier from file: MyNeuralClassifier Transaction: >> 1:1:EXPEDIA TRAVEL:63.29:856.0:717.0:false Assessment: >> This is a VALID_TXN Transaction: >> 1:305:CANADIAN PHARMACY:3978.57:52.0:70.0:true Assessment: >> This is a FRAUD_TXN Total test dataset txns: 1100, Number of fraud txns:100 Classified correctly: 1100, Misclassified valid txns: 0, Misclassified fraud txns: 0
看起来失误率是0,但假如我们将data/ch05/fraud/test-txns.txt里面的“BLACK DIAMOND COFFEE”换成“TAOBAO”的话,就会发现有失误了:
saved classifier in file: C:\iWeb2\data\ch05\MyNeuralClassifier loaded classifier from file: MyNeuralClassifier Transaction: >> 1:1:EXPEDIA TRAVEL:63.29:856.0:717.0:false Assessment: >> This is a VALID_TXN Transaction: >> 1:305:CANADIAN PHARMACY:3978.57:52.0:70.0:true Assessment: >> This is a FRAUD_TXN - n_txnamt = 0.33646216373137205 - n_location = 0.6601082057290067 - n_description = 0.0 - userid = 25.0 - txnid = 500523 - txnamt = 63.79 - location_x = 533.0 - location_y = 503.0 - description = TAOBAO --> VALID_TXN - n_txnamt = 1.0138677641585399 - n_location = 0.5745841533228392 - n_description = 0.0 - userid = 26.0 - txnid = 500574 - txnamt = 127.97 - location_x = 734.0 - location_y = 507.0 - description = TAOBAO --> VALID_TXN - n_txnamt = 0.35626185958254264 - n_location = 0.658153849503683 - n_description = 0.0 - userid = 23.0 - txnid = 500273 - txnamt = 47.76 - location_x = 966.0 - location_y = 991.0 - description = TAOBAO --> VALID_TXN - n_txnamt = 0.48453914767096135 - n_location = 0.655796929157372 - n_description = 0.0 - userid = 21.0 - txnid = 500025 - txnamt = 50.47 - location_x = 980.0 - location_y = 996.0 - description = TAOBAO --> VALID_TXN Total test dataset txns: 1100, Number of fraud txns:100 Classified correctly: 1096, Misclassified valid txns: 4, Misclassified fraud txns: 0
这是因为第一次用的测试数据跟训练集数据的属性值是相同,而第二次的TAOBAO对于分类器来说是个陌生的描述。这39个TAOBAO交易中有4个被冤枉了。
5.4.4神经网络欺诈检测器剖析
最重要的一步是训练神经网络:
/** * 训练神经网络 * @param nIterations 实例在神经网络中传播的次数 */ private void trainNeuralNetwork(int nIterations) { for (int i = 1; i <= nIterations; i++) { for (Instance instance : ts.getInstances().values()) { double[] nnInput = createNNInputs(instance); double[] nnExpectedOutput = createNNOutputs(instance); nn.train(nnInput, nnExpectedOutput); } if (verbose) { System.out.println("finished training pass: " + i + " out of " + nIterations); } } }
nn指的是TransactionNN,也就是—个特别的用于欺诈检测案例的神经网络:
public TransactionNN(String name) { super(name); createNN351(); }
这个神经网络的规模是351:
/** * 三个输入节点、五个隐层节点与一个输出层节点 */ private void createNN351() { // 1. Define Layers, Nodes and Node Biases Layer inputLayer = createInputLayer( 0, // layer id 3 // number of nodes ); Layer hiddenLayer = createHiddenLayer( 1, // layer id 5, // number of nodes new double[]{1, 1.5, 1, 0.5, 1} // node biases // 节点额外权值 ); Layer outputLayer = createOutputLayer( 2, // layer id 1, // number of nodes new double[]{1.5} // node biases ); setInputLayer(inputLayer); setOutputLayer(outputLayer); addHiddenLayer(hiddenLayer); // 2. Define links and weights between nodes // Id format: <layerId:nodeIdwithinLayer> // Weights for links from Input Layer to Hidden Layer // 我们逐个为节点间建立连接(突触) setLink("0:0", "1:0", 0.25); setLink("0:0", "1:1", -0.5); setLink("0:0", "1:2", 0.25); setLink("0:0", "1:3", 0.25); setLink("0:0", "1:4", -0.5); setLink("0:1", "1:0", 0.25); setLink("0:1", "1:1", -0.5); setLink("0:1", "1:2", 0.25); setLink("0:1", "1:3", 0.25); setLink("0:1", "1:4", -0.5); setLink("0:2", "1:0", 0.25); setLink("0:2", "1:1", -0.5); setLink("0:2", "1:2", 0.25); setLink("0:2", "1:3", 0.25); setLink("0:2", "1:4", -0.5); // Weights for links from Hidden Layer to Output Layer setLink("1:0", "2:0", -0.5); setLink("1:1", "2:0", 0.5); setLink("1:2", "2:0", -0.5); setLink("1:3", "2:0", -0.5); setLink("1:4", "2:0", 0.5); if (isVerbose()) { System.out.println("NN created"); } }
对于351的规模,3指的是交易金额的标准化、交易描述的雅克比系数、用户交易中心点和当前交易点的距离这三个输入。
其中setLink()是很重要的方法:
/** * 建立突触链接 * @param fromNodeId 起点 * @param toNodeId 重点 * @param w 权值 */ public void setLink(String fromNodeId, String toNodeId, double w)
5.4.5创建通用神经网络的基类
也就是TransactionNN的基类、神经网络的通用实现——BaseNN类。
BaseNN (结构层面):通用神经网络基类代码摘录
/** * 为网络创建输入层,它以层的ID和节点数量作为参数,实例化一个BaseLayer对象 * @param layerId * @param nNodes * @return */ public Layer createInputLayer(int layerId, int nNodes) { BaseLayer baseLayer = new BaseLayer(layerId); for (int i = 0; i < nNodes; i++) { // 节点 Node node = createInputNode(layerId + ":" + i); // 突触(入链) Link inlink = new BaseLink(); inlink.setFromNode(node); // 初始权重为1,训练过程中保持不变 inlink.setWeight(1.0); node.addInlink(inlink); baseLayer.addNode(node); } return baseLayer; } /** * 为网络创建隐层,它以层的ID、节点数量以及这些节点的偏移值作为参数 * @param layerId * @param nNodes * @param bias * @return */ public Layer createHiddenLayer(int layerId, int nNodes, double[] bias) { if (bias.length != nNodes) { throw new RuntimeException("Each node should have bias defined."); } BaseLayer baseLayer = new BaseLayer(layerId); for (int i = 0; i < nNodes; i++) { Node node = createHiddenNode(layerId + ":" + i); node.setBias(bias[i]); baseLayer.addNode(node); } return baseLayer; } /** * 构造输出层 * @param layerId * @param nNodes * @param bias * @return */ public Layer createOutputLayer(int layerId, int nNodes, double[] bias) { if (bias.length != nNodes) { throw new RuntimeException("Each node should have bias defined."); } BaseLayer baseLayer = new BaseLayer(layerId); for (int i = 0; i < nNodes; i++) { Node node = createOutputNode(layerId + ":" + i); node.setBias(bias[i]); baseLayer.addNode(node); } return baseLayer; }
BaseNN (操作层面):通用神经网络基类代码摘录:
/** * 训练 * @param tX 输入节点 * @param tY 输出节点 */ public void train(double[] tX, double[] tY) { double lastError = 0.0; int i = 0; while (true) // 提升分类器的精度 { i++; // Evaluate sample double[] y = classify(tX); double err = error(tY, y); if (Double.isInfinite(err) || Double.isNaN(err)) { // Couldn't even evaluate the error. Stop. // 如果无法计算误差,跳出 throw new RuntimeException( "Training failed. Couldn't evaluate the error: " + err + ". Try some other NN configuration, parameters."); } double convergence = Math.abs(err - lastError); if (err <= ERROR_THRESHOLD) { // Good enough. No need to adjust weights for this sample. // 误差小于阀值,够好了,跳出 lastError = err; if (verbose) { System.out.print("Error Threshold: " + ERROR_THRESHOLD); System.out.print(" | Error Achieved: " + err); System.out.print(" | Number of Iterations: " + i); System.out.println(" | Absolute convergence: " + convergence); } break; } if (convergence <= CONVERGENCE_THRESHOLD) { // If we made almost no progress stop. // No change. Stop. // 误差收敛速度不明显,跳出 if (verbose) { System.out.print("Error Threshold: " + ERROR_THRESHOLD); System.out.print(" | Error Achieved: " + err); System.out.print(" | Number of Iterations: " + i); System.out.println(" | Absolute convergence: " + convergence); } break; } lastError = err; // Set expected values so that we can determine the error // 把输出节点的值设为期望值 outputLayer.setExpectedOutputValues(tY); /* * Calculate weight adjustments in the whole network */ // 调整输出节点的权重 outputLayer.calculateWeightAdjustments(); for (Layer hLayer : hiddenLayers) { // layer order doesn't matter because we will update weights later // 调整中间层的权重 hLayer.calculateWeightAdjustments(); // WeightIncrements } /* * Update Weights */ outputLayer.updateWeights(); for (Layer hLayer : hiddenLayers) { // layer order doesn't matter. hLayer.updateWeights(); } } //System.out.println("i = " + i + ", err = " + lastError); }
神经网络不断地计算输出——对比误差——调整突触的权重——计算输出——对比误差……直到误差够小了或者误差降不下来了就终止。关于神经网络算法的基础《智能Web算法》没有深入,我也没有这方面的需求,所以就这样吧。《智能Web算法》终究是一本普及性质的书,学术性的东西还是得看论文吧。