这一节的目的:
-
利用二分类区分正常邮件与垃圾邮件。
-
将邮件分为精细的类别。
5.3.1 朴素贝叶斯分类
于第二章用到的贝叶斯分类器一样,连测试数据都是一样的。
主程序:
package com.hankcs; import iweb2.ch5.usecase.email.EmailClassifier; import iweb2.ch5.usecase.email.data.Email; import iweb2.ch5.usecase.email.data.EmailData; import iweb2.ch5.usecase.email.data.EmailDataset; public class ch5_1_EmailClass { public static void main(String[] args) throws Exception { // Create and train classifier // 载入文档列表 EmailDataset trainEmailDS = EmailData.createTrainingDataset(); // 只考虑前10个最频繁的单词 EmailClassifier emailFilter = new EmailClassifier(trainEmailDS, 10); emailFilter.train(); // Let's classify some emails from training set. If we can't get them right // then we are in trouble 🙂 // 小规模地测试已知的两封邮件 Email email = null; email = trainEmailDS.findEmailById("biz-04.html"); emailFilter.classify(email); email = trainEmailDS.findEmailById("usa-03.html"); emailFilter.classify(email); // Now, let's classify previously unseen emails // 从未知的测试数据集中载入电子邮件 EmailDataset testEmailDS = EmailData.createTestDataset(); email = testEmailDS.findEmailById("biz-01.html"); emailFilter.classify(email); email = testEmailDS.findEmailById("sport-01.html"); emailFilter.classify(email); email = testEmailDS.findEmailById("usa-01.html"); emailFilter.classify(email); email = testEmailDS.findEmailById("world-01.html"); emailFilter.classify(email); email = testEmailDS.findEmailById("spam-biz-01.html"); emailFilter.classify(email); } }
输出:
P(i,c) = 0.06666666666666667, P(c) = 0.8823529411764706, P(i) = 0.06535947712418301 P(i,c) = 0.05555555555555555, P(c) = 0.11764705882352941, P(i) = 0.06535947712418301 Classified biz-04.html as NOT SPAM P(i,c) = 0.06666666666666667, P(c) = 0.8823529411764706, P(i) = 0.06535947712418301 P(i,c) = 0.05555555555555555, P(c) = 0.11764705882352941, P(i) = 0.06535947712418301 Classified usa-03.html as NOT SPAM P(i,c) = 0.13333333333333333, P(c) = 0.8823529411764706, P(i) = 0.12418300653594772 P(i,c) = 0.05555555555555555, P(c) = 0.11764705882352941, P(i) = 0.12418300653594772 Classified biz-01.html as NOT SPAM P(i,c) = 0.06666666666666667, P(c) = 0.8823529411764706, P(i) = 0.06535947712418301 P(i,c) = 0.05555555555555555, P(c) = 0.11764705882352941, P(i) = 0.06535947712418301 Classified sport-01.html as NOT SPAM P(i,c) = 0.05555555555555555, P(c) = 0.8823529411764706, P(i) = 0.055555555555555546 P(i,c) = 0.05555555555555555, P(c) = 0.11764705882352941, P(i) = 0.055555555555555546 Classified usa-01.html as NOT SPAM P(i,c) = 0.2, P(c) = 0.8823529411764706, P(i) = 0.18300653594771243 P(i,c) = 0.05555555555555555, P(c) = 0.11764705882352941, P(i) = 0.18300653594771243 Classified world-01.html as NOT SPAM P(i,c) = 0.05555555555555555, P(c) = 0.8823529411764706, P(i) = 0.10784313725490195 P(i,c) = 0.5, P(c) = 0.11764705882352941, P(i) = 0.10784313725490195 Classified spam-biz-01.html as SPAM
EmailClassifier 通过 EmailDataset 实例的 getTrainingSet 方法来获得它的训练集。载入这些邮件实例之后,分类器对其自身进行训练,以学习怎么基于训练集中的分配佶从把概念映射到实例。EmailClassifier在训练中并不会使用所有的邮件。它只使用一个属性,该属性的取值是在EmailInstance构造的时候生成的:
/** * 创建实例 * @param emailCategory * @param email * @param topNTerms */ public EmailInstance(String emailCategory, Email email, int topNTerms) { super(); this.id = email.getId(); // email category is our concept/class this.setConcept(new BaseConcept(emailCategory)); /** * TODO: 5.3 -- Considering more attributes as part of the EmailInstance * * -- Separate "subject" and "body" * -- timestamp * -- "from" * -- "to" * -- "to" cardinality */ // extract top N terms from email content and subject // 将邮件的标题和正文拼接起来 String text = email.getSubject() + " " + email.getTextBody(); // 找出前N个最频繁的词 Content content = new Content(email.getId(), text, topNTerms); Map<String, Integer> tfMap = content.getTFMap(); attributes = new StringAttribute[1]; String attrName = "Email_Text_Attribute"; String attrValue = ""; for (Map.Entry<String, Integer> tfEntry : tfMap.entrySet()) { attrValue = attrValue + " " + tfEntry.getKey(); } // 创建唯一的属性,也就是找出前N个最频繁的词 attributes[0] = new StringAttribute(attrName, attrValue); }
朴素贝叶斯分类器的细节
分类器会从训练实例中学习到实例与类别间的关联,然后给一个指定的实例提供与之相应的类别。其实贝叶斯算法在《智能Web算法》2.4 根据用户点击改进搜索结果早已接触过了,这里直接上注释代码,详细的解释还是看2.4吧。
package iweb2.ch5.classification.bayes; import iweb2.ch5.classification.core.TrainingSet; import iweb2.ch5.classification.core.intf.Classifier; import iweb2.ch5.ontology.core.AttributeValue; import iweb2.ch5.ontology.intf.Attribute; import iweb2.ch5.ontology.intf.Concept; import iweb2.ch5.ontology.intf.Instance; import java.util.ArrayList; import java.util.HashMap; import java.util.Map; /** * A basic implementation of the Naive Bayes algorithm. * <p/> * The emphasis is on teaching the algorithm, not optimizing its performance. * * @author babis */ public class NaiveBayes implements Classifier { /** * NaiveBayes分类器实例的名称<br></br> * You can use the NaiveBayes classifier in many occasions * So, let's give it a name to identify the instance of the Classifier. */ private String name; /** * 训练集,一旦设定就不能更改,但是可以往训练集中添加新的实例<br></br> * Every classifier needs a training set. Notice that both the name * of the classifier and its training set are intentionally set during * the Construction phase. * <p/> * Once you created an instance of the NaiveBayes classifier you cannot * set its TrainingSet but you can always get the reference to it and * add instances. */ protected TrainingSet tSet; /** * 训练集中各个概念的实例数量<br></br> * These are the probabilities for each concept */ protected Map<Concept, Double> conceptPriors; /** * 条件概率,用户A查询Q时想看到X链接的概率<br></br> * 或者观察到实例Y时在观察到概念X的概率 * This structure contains the fundamental calculation elements of * the Naive Bayes method, i.e. the conditional probabilities. */ protected Map<Concept, Map<Attribute, AttributeValue>> p; /** * 训练时用到的属性<br></br> * These are the attribute indices that we should consider for training */ protected ArrayList<String> attributeList; /** * 辅助变量<br></br> * An auxiliary variable */ protected boolean verbose = false; /** * 唯一的构造方法,需要一个名称和一个训练集<br></br> * The only constructor for this classifier takes a name and * a training set as arguments. * * @param name the name of the classifier * @param set the training set for this classifier */ public NaiveBayes(String name, TrainingSet set) { this.name = name; tSet = set; conceptPriors = new HashMap<Concept, Double>(tSet.getNumberOfConcepts()); verbose = false; } /** * 返回发生概率最高的类别,作为对给定实例的分类结果。 * @param instance * @return */ public Concept classify(Instance instance) { Concept bestConcept = null; double bestP = 0.0; if (tSet == null || tSet.getConceptSet().size() == 0) { throw new RuntimeException("You have to train classifier first."); } if (verbose) { System.out.println("\n*** Classifying instance: " + instance.toString() + "\n"); } for (Concept c : tSet.getConceptSet()) { double p = getProbability(c, instance); if (verbose) { System.out.printf("P(%s|%s) = %.15f\n", c.getName(), instance.toString(), p); } if (p >= bestP) { bestConcept = c; bestP = p; } } return bestConcept; } /** * Training simply sets the probability for each concept * 负责分类器的训练 */ public boolean train() { long t0 = System.currentTimeMillis(); boolean hasTrained = false; // 确保至少有一个属性可用 if (attributeList == null || attributeList.size() == 0) { System.out.print("Can't train the classifier without specifying the attributes for training!"); System.out.print("Use the method --> trainOnAttribute(Attribute a)"); } else { calculateConceptPriors(); calculateConditionalProbabilities(); hasTrained = true; } if (verbose) { System.out.print(" Naive Bayes training completed in "); System.out.println((System.currentTimeMillis() - t0) + " (ms)"); } return hasTrained; } public void trainOnAttribute(String aName) { if (attributeList == null) { attributeList = new ArrayList<String>(); } attributeList.add(aName); } /** * Strictly speaking these are not the prior probabilities but just the counts. * However, we want to reuse these counts and the priors can be obtained by a simple division. * 计算先验概率p(X),在这个实现中,我们实际记录的是频次值,概念的先验概率还需要用这个频次除以训练集中实例的总数而求得。 */ private void calculateConceptPriors() { for (Concept c : tSet.getConceptSet()) { //Calculate the priors for the concepts // 通过遍历训练集中的所有实例,可以得到每个概念出现的频次。 int totalConceptCount = 0; for (Instance i : tSet.getInstances().values()) { if (i.getConcept().equals(c)) { totalConceptCount++; } } conceptPriors.put(c, new Double(totalConceptCount)); } } /** * 计算某特定的属性值在一个概念中出现的频次 */ protected void calculateConditionalProbabilities() { p = new HashMap<Concept, Map<Attribute, AttributeValue>>(); for (Instance i : tSet.getInstances().values()) { for (Attribute a : i.getAtrributes()) { if (a != null && attributeList.contains(a.getName())) { if (p.get(i.getConcept()) == null) { p.put(i.getConcept(), new HashMap<Attribute, AttributeValue>()); } Map<Attribute, AttributeValue> aMap = p.get(i.getConcept()); AttributeValue aV = aMap.get(a); if (aV == null) { aV = new AttributeValue(a.getValue()); aMap.put(a, aV); } else { aV.count(); } } } } } /** * 计算点击实例i属于概念c的后验概率,也即是用户查询i时的真实意图是c的概率<br></br> * This method calculates the <I>posterior probability</I> that we deal with * concept <CODE>c</CODE> provided that we observed instance <CODE>i</CODE>. * This is the application of Bayes theorem. * * @param c 实例可能属于的概念 is a probable concept for instance <CODE>i</CODE> * @param i 待检验实例 is the observed instance * @return 后验概率 观察到示例<code>i</code>且属于概念<code>c</code> posterior probability of <CODE>c</CODE> given instance <CODE>i</CODE> */ public double getProbability(Concept c, Instance i) { double cP = 0; // 如果训练集中已经有了这个概念,就可以用贝叶斯公式计算后验概率 if (tSet.getConceptSet().contains(c)) { cP = (getProbability(i, c) * getProbability(c)) / getProbability(i); System.out.println("P(i,c) = " + getProbability(i, c) + ", P(c) = " + getProbability(c) + ", P(i) = " + getProbability(i)); } else { // We have never seen this concept before // assign to it a "reasonable" value // 新的概念,设为一个“合理”的值 cP = 1 / (tSet.getNumberOfConcepts() + 1.0); } return cP; } /** * 获取观察到实例i的概率 This method calculates the denumerator of Bayes theorem * @param i * @return the probability of observing <CODE>Instance</CODE> i */ public double getProbability(Instance i) { double cP = 0; for (Concept c : getTset().getConceptSet()) { cP += getProbability(i, c) * getProbability(c); } return (cP == 0) ? (double) 1 / tSet.getSize() : cP; } /** * 获取c的先验概率,根据训练集中该概念出现的次数取得 * @param c * @return */ public double getProbability(Concept c) { Double trInstanceCount = conceptPriors.get(c); if (trInstanceCount == null) { trInstanceCount = 0.0; } return trInstanceCount / tSet.getSize(); } /** * 已知用户想要的的是c,用户查询i的概率 * @param i * @param c * @return */ public double getProbability(Instance i, Concept c) { double cP = 1; // 遍历实例的每个属性 for (Attribute a : i.getAtrributes()) { // 如果是训练集要用到这个属性 if (a != null && attributeList.contains(a.getName())) { Map<Attribute, AttributeValue> aMap = p.get(c); AttributeValue aV = aMap.get(a); if (aV == null) { // the specific attribute value is not present for the current concept. // Can you justify the following estimate? // Can you think of a better choice? // 从来没有观察到的概念 cP *= ((double) 1 / (tSet.getSize() + 1)); } else { // 实例i的似然率等于观测到各个属性的概率的乘积 // 一个属性的概率=属性出现的次数/训练集中概念的实例数量 cP *= (aV.getCount() / conceptPriors.get(c)); } } } return (cP == 1) ? (double) 1 / tSet.getNumberOfConcepts() : cP; } /** * @return the name */ public String getName() { return name; } /** * 获取训练集 * @return the tSet */ public TrainingSet getTset() { return tSet; } }
一个通用的邮件分类器
EmailClassifier继承自NaiveBayes,做了几点进化:
一是保证训练集被载入:
@Override public boolean train() { // 确保训练数据已经被装载进来 if (emailDataset.getSize() == 0) { System.out.println("Can't train classifier - training dataset is empty."); return false; } for (String attrName : getTset().getAttributeNameSet()) { trainOnAttribute(attrName); } super.train(); return true; }
二是对属性值匹配的模糊化,因为只有一个属性值且每个属性值都不一样,所以需要找到与a“最相似”的属性值,将其频数增加。这里的相似利用交集除并集的算法衡量:
/** * 寻找属性a在属性集aMap中关键字重合最多的那个属性的值 * 因为属性是一个标题-关键词字符串,每个邮件都不一样,为了避免这种情况,用这个方法来匹配两个属性值 * @param aMap * @param a * @return */ private AttributeValue findBestAttributeValue(Map<Attribute, AttributeValue> aMap, Attribute a) { JaccardCoefficient jaccardCoeff = new JaccardCoefficient(); String aValue = (String) a.getValue(); String[] aTerms = aValue.split(" "); Attribute bestMatch = null; double bestSim = 0.0; /* * Here we only check representative attribute values. Other attribute values * associated with representative attribute values will be * ignored by this implementation. */ for (Attribute attr : aMap.keySet()) { String attrValue = (String) attr.getValue(); String[] attrTerms = attrValue.split(" "); double sim = jaccardCoeff.similarity(aTerms, attrTerms); if (sim > jaccardThreshold && sim > bestSim) { bestSim = sim; bestMatch = attr; } } return aMap.get(bestMatch); }
这个模糊化也应用于“已知用户想要的的是c,用户查询i的概率”的计算上:
@Override public double getProbability(Instance i, Concept c) { double cP = 1; for (Attribute a : i.getAtrributes()) { if (a != null && attributeList.contains(a.getName())) { Map<Attribute, AttributeValue> aMap = p.get(c); AttributeValue bestAttributeValue = findBestAttributeValue(aMap, a); if (bestAttributeValue == null) { // the specific attribute value is not present for the current concept. // Can you justify the following estimate? // Can you think of a better choice? cP *= ((double) 1 / (tSet.getSize() + 1)); } else { cP *= (bestAttributeValue.getCount() / conceptPriors.get(c)); } } } return (cP == 1) ? (double) 1 / tSet.getNumberOfConcepts() : cP; }
然后条件概率里面省掉了一个循环,并且模糊了匹配:
@Override protected void calculateConditionalProbabilities() { p = new HashMap<Concept, Map<Attribute, AttributeValue>>(); for (Instance i : tSet.getInstances().values()) { // In this specific implementation we have exactly one attribute // In general, you need a loop over the attributes // 因为我们只有一个属性,就不用循环了 Attribute a = i.getAtrributes()[0]; Map<Attribute, AttributeValue> aMap = p.get(i.getConcept()); if (aMap == null) { aMap = new HashMap<Attribute, AttributeValue>(); p.put(i.getConcept(), aMap); } /** * TODO: 5.3 */ AttributeValue bestAttributeValue = findBestAttributeValue(aMap, a); if (bestAttributeValue != null) { bestAttributeValue.count(); } else { AttributeValue aV = new AttributeValue(a.getValue()); // register attribute as representative attribute aMap.put(a, aV); } } }
5.3.2 基于规则的分类
所谓规则就是if-then的编程子句,举个栗子:
when Email( $s : subject ) eval( classificationResult.isSimilar($s, "viagra" ) ) then classificationResult.setSpamEmail(true);
这个规则就是如果主题中含有“viagra”,就认定其为垃圾邮件。这些规则通常是通过手工或者半手工录入的。
DROOLS规则引擎
这是一个基于Java的规则引擎,是JBoss中的一个开源项目。Drools规则引擎由两个主要的模块构成:模式匹配模块与议程模块。模式匹配模块负责识别哪个规则可被事实所触发,一旦这些规则被识别出来,它们会被安放到议程模块中。Drools实现并扩展了Rete算法,不过这个算法不是《智能Web算法》讨论的重点。
Drools提供了一个脚本语言:
package demo; import iweb2.ch5.usecase.email.data.Email; import iweb2.ch5.classification.rules.ClassificationResult; global ClassificationResult classificationResult; rule "Tests for viagra in subject" salience 100 when Email( $s : subject ) eval( classificationResult.isSimilar($s, "viagra" ) ) then classificationResult.setSpamEmail(true); end rule "Tests for 'drugs' in subject" salience 100 when Email( $s : subject ) eval( classificationResult.isSimilar($s, "drugs" ) ) then classificationResult.setSpamEmail(true); end
这个脚本需要主程序载入运行:
package com.hankcs; import iweb2.ch5.classification.rules.EmailRuleClassifier; import iweb2.ch5.usecase.email.data.EmailData; import iweb2.ch5.usecase.email.data.EmailDataset; public class ch5_2_EmailRules { public static void main(String[] args) throws Exception { // Load email data EmailDataset ds = EmailData.createTestDataset(); // Create classifier based on rules // Expecting one spam email // 创建了一个基于规则的邮件分类器 EmailRuleClassifier classifier = new EmailRuleClassifier("c:/iWeb2/data/ch05/spamRules.drl"); // 让分类器对其自身进行“训练”。对于基于规则的系统,规则就是知识,所以这种系统中“训练” 一步所需做的仅仅只是把规则从文件中载入进来。 classifier.train(); // 把数据集与一个描述性信息传递给分类器 classifier.run(ds,"Expecting one spam email. :-("); // There should be no spam emails. // Rule that checks for known email address should win over rules that detect spam content. // // classifier = new EmailRuleClassifier("c:/iWeb2/data/ch05/spamRulesWithConflict.drl"); // // classifier.train(); // // classifier.run(ds,"No spam emails here. Hurray!\n"); } }
运行结果:
Expecting one spam email. 🙁 __________________________________________________ Classifying email: world-01.html ... Rules classified email: world-01.html as: NOT-SPAM Classifying email: spam-biz-01.html ... Invoked ClassificationResult.setSpamEmail(true) Rules classified email: spam-biz-01.html as: SPAM Classifying email: sport-01.html ... Rules classified email: sport-01.html as: NOT-SPAM Classifying email: usa-01.html ... Rules classified email: usa-01.html as: NOT-SPAM Classifying email: biz-01.html ... Rules classified email: biz-01.html as: NOT-SPAM __________________________________________________
spam-biz-01.html文件所代表的垃圾邮 件触发了分类器,因为其标题包含的单词“dugs”匹配上了垃圾规则,一旦条件被满足,该规则即可发动。
代码细节
规则引擎基于Drools,打通了drl脚本与Java的界限,看到这里面的运行时编译器,真的对Java的灵活性感到佩服。实现:
package iweb2.ch5.classification.rules; import iweb2.ch5.usecase.email.data.Email; import java.io.BufferedInputStream; import java.io.FileInputStream; import java.io.InputStreamReader; import java.io.Reader; import java.util.Properties; import org.drools.RuleBase; import org.drools.RuleBaseFactory; import org.drools.WorkingMemory; import org.drools.compiler.PackageBuilder; import org.drools.compiler.PackageBuilderConfiguration; import org.drools.rule.Package; public class RuleEngine { private RuleBase rules; public RuleEngine(String rulesFile) throws RuleEngineException { try { // 读取规则文件 Reader source = new InputStreamReader( new BufferedInputStream(new FileInputStream(rulesFile))); // switch to JANINO compiler // 切换到运行时编译器 Properties properties = new Properties(); properties.setProperty("drools.dialect.java.compiler", "JANINO"); PackageBuilderConfiguration cfg = new PackageBuilderConfiguration(properties); // build a rule package PackageBuilder builder = new PackageBuilder(cfg); // parse and compile rules // 解析规则 builder.addPackageFromDrl(source); // 包含自定义的规则 Package pkg = builder.getPackage(); // 运行时的规则容器 rules = RuleBaseFactory.newRuleBase(); rules.addPackage(pkg); } catch (Exception e) { throw new RuleEngineException( "Could not load/compile rules from file: '" + rulesFile + "' ", e); } } public void executeRules(ClassificationResult classificationResult, Email email) { WorkingMemory workingMemory = rules.newStatefulSession(); // 在drl脚本里使用Java对象! workingMemory.setGlobal("classificationResult", classificationResult); workingMemory.insert(email); workingMemory.fireAllRules(); } }
冲突解决
如果两个规则有冲突,比如一个规则认为E是垃圾邮件,另一个认为不是,或者规则A修改了事实,这个动作触发了B,B修改了事实又触发了A导致死循环。
Drools通过规则属性来解决冲突。
—个简单的垃圾邮件过滤规则(带冲突):
package demo; import iweb2.ch5.usecase.email.data.Email; import iweb2.ch5.classification.rules.ClassificationResult; global ClassificationResult classificationResult; rule "Rule 1: Tests for viagra in subject" salience 100 # no-loop true # lock-on-active true when email: Email( $s : subject ) eval( classificationResult.isSimilar($s, "viagra" ) ) then email.setRuleFired(1); classificationResult.setSpamEmail(true); # update(email); end rule "Rule 2: Tests for 'drugs' in subject" # no-loop true # lock-on-active true salience 100 when email: Email( $s : subject ) eval( classificationResult.isSimilar($s, "drugs" ) ) then email.setRuleFired(2); # email.setSubject("processed"); # email.setFrom("processed"); classificationResult.setSpamEmail(true); # update(email); end rule "Rule 3: Tests for known sender address" # no-loop true # lock-on-active true salience 10 when email: Email( $sender : from ) eval( classificationResult.isSimilar($sender, "friend@senderhost" ) ) then email.setRuleFired(3); classificationResult.setSpamEmail(false); # update(email); end
salience表示一个规则的重要程度,salience越小越重要。
输出:
Classifying email: world-01.html ... Rules classified email: world-01.html as: NOT-SPAM Classifying email: spam-biz-01.html ... Invoked Email.setRuleFired(2), current value ruleFired=0, emailId: spam-biz-01.html Invoked ClassificationResult.setSpamEmail(true) Invoked Email.setRuleFired(3), current value ruleFired=2, emailId: spam-biz-01.html Invoked ClassificationResult.setSpamEmail(false) Rules classified email: spam-biz-01.html as: NOT-SPAM Classifying email: sport-01.html ... Rules classified email: sport-01.html as: NOT-SPAM Classifying email: usa-01.html ... Rules classified email: usa-01.html as: NOT-SPAM Classifying email: biz-01.html ... Rules classified email: biz-01.html as: NOT-SPAM __________________________________________________