如果仅仅依靠索引来提供检索结果的话,将会受到垃圾网页的干扰,将2.1节的源码中添加垃圾网页的那行注释取消,会发现垃圾网页排名第一:
Search results using Lucene index scores: Query: armstrong Document Title: Cheap medicine -- low interest loans Document URL: file:/c:/iWeb2/data/ch02/spam-01.html --> Relevance Score: 0.591894507408142 _______________________________________________________________________ Document Title: Lance Armstrong meets goal in painful marathon debut Document URL: file:/c:/iWeb2/data/ch02/sport-01.html --> Relevance Score: 0.370989531278610 _______________________________________________________________________ Document Title: New York 'tour' Lance's toughest Document URL: file:/c:/iWeb2/data/ch02/sport-03.html --> Relevance Score: 0.291807949542999 _______________________________________________________________________ Document Title: New York City Marathon Document URL: file:/c:/iWeb2/data/ch02/sport-02.html --> Relevance Score: 0.210920616984367 _______________________________________________________________________
垃圾网页依靠堆砌关键字来提高索引排名,但是没有多少正常网页会链接到它们。所以可以利用网页间链接的有向图来区分一个网页有多“不垃圾”,这个区分值可以作为PageRank的浅显理解。
要计算PageRank值必须先计算H矩阵。
在一个拥有n个网页的网络中,定义链接矩阵H(n x n),统计网页i有n个外链,如果i有一个外链到网页j,那么H[i][j] = 1 / n,否则为0。
PageRank的计算是一个迭代过程:
p(k+1)=p(k) * H p(0) = 1 / n
H矩阵的计算非常简单,但是不能直接用于计算PageRank值。因为收敛性不好,需要迭代很多次才能使PR收敛。并且H矩阵没有考虑下列两个现实:
-
当一个网页没有外链的时候,用户会在浏览器地址栏里输入一个相关的网址,比如sport-03.html
此时悬孤节点的贡献不再是0,而应该是 1 / n,n是所有网页的数量。
-
用户上网上久了腻味了,直接关闭了浏览器或者跳转到一个不相干的网站,比如www.hankcs.com。
此时定义一个alpha值,代表用户不随机跳转的概率。那么随机跳转的贡献为(1 – alpha) * 1 / n
接下来,利用上面两个现实,定义一个G矩阵来替代它:
/** * 计算PR * * @param alpha 不随机跳转的概率 * @param epsilon */ public void findPageRank(double alpha, double epsilon) { // A counter for our iterations // 迭代计数器 int k = 0; // auxiliary variable // 辅助变量 PageRankMatrixH matrixH = getH(); // The H matrix has size nxn and the PageRank vector has size n // 矩阵H的大小是nXn,PR向量的大小是n int n = matrixH.getSize(); //auxiliary variable // 辅助变量,n的倒数 double inv_n = (double) 1 / n; // This is the actual nxn matrix of double values // 实际的双精度H矩阵 double[][] H = matrixH.getMatrix(); // A dummy variable that holds our error -- // arbitrarily set to an initial value of 1 // 用于存放误差的虚拟变量,简单地设为1 double error = 1; // This holds the values of the PageRank vector // PR向量的值 pR = new double[n]; // This is a copy of the PageRank vector from the previous iteration // The only reason that we need this is for evaluating the error // 前一次迭代所得到的PR值,只在计算误差时用得上 double[] tmpPR = new double[n]; // Set the initial values (ad hoc) // 设置PR初始值(逐个设置) for (int i = 0; i < n; i++) { pR[i] = inv_n; } /* * Book Section 2.3 -- Altering the H matrix: Dangling nodes * 获取悬孤节点的贡献 */ double[][] dNodes = getDanglingNodeMatrix(); /** * 修改G矩阵,计算随机跳转的贡献 * TODO: 2.5 -- Altering the G matrix: Teleportation (Book Section 2.3) * * The following code defines the contribution of the dangling nodes, * i.e. jumping randomly on a page that is not connected with the one * that our surfer is currently viewing * * Notice that it is the same for all pages. An interesting variation * of the algorithm would introduce a "teleportation" contribution that * relates the probability of an arbitrary transition to the degree of * interest that a user has for the content of a page. * * Exercise: Could that be done? If so, how? What problems can you see * with that variation? */ double tNodes = (1 - alpha) * inv_n; //Replace the H matrix with the G matrix //用矩阵G替换矩阵H for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { H[i][j] = alpha * H[i][j] + dNodes[i][j] + tNodes; } } // 打印G矩阵 StringBuilder txt = new StringBuilder("G Matrix\n\n"); for (int i = 0; i < n; i++) { double lineTotal = 0; for (int j = 0; j < n; j++) { txt.append(" "); txt.append(H[i][j]); lineTotal += H[i][j]; txt.append(", "); } txt.append("total:" + lineTotal); txt.append("\n"); } System.out.println(txt.toString()); // 打印迭代前的初始pR System.out.println("\n Before Iteration, PageRank convergence error: " + error); for (int i = 0; i < n; i++) { System.out.println("Index: " + i + " --> PageRank: " + pR[i]); } // Iterate until convergence. // We have found the PageRank values if our error is smaller than epsilon // 迭代直到收敛 // 如果误差小于 epsilon 就认为是收敛的 while (error >= epsilon) { // Make a copy of the PageRank vector before we update it // 备份 for (int i = 0; i < n; i++) { tmpPR[i] = pR[i]; } double dummy = 0; // Now we get the next point in the iteration // 现在到了迭代中的下一个点 for (int i = 0; i < n; i++) { dummy = 0; for (int j = 0; j < n; j++) { dummy += pR[j] * H[j][i]; } pR[i] = dummy; } // Get the error, so that we can check convergence // 得到误差,检查收敛 error = norm(pR, tmpPR); // Display the progress // 打印进度 System.out.println("\n Iteration: " + k + ", PageRank convergence error: " + error); for (int i = 0; i < n; i++) { System.out.println("Index: " + i + " --> PageRank: " + pR[i]); } //increase the value of the counter by one // 计数加一 k++; } // Report the final values List<RelevanceScore> allRankings = new ArrayList<RelevanceScore>(); for (int i = 0; i < n; i++) { String url = matrixH.getIndexMapping().getValue(i); RelevanceScore r = new RelevanceScore(url, pR[i]); allRankings.add(r); } RelevanceScore.sort(allRankings); System.out.println("\n______________ Calculation Results _______________\n"); for (RelevanceScore r : allRankings) { System.out.printf("Page URL: %-42s --> Rank: %.15f\n", r.getId(), r.getScore()); } System.out.println("\n____________________________________________________\n"); }
G矩阵相对于H矩阵,关键修改了
H[i][j] = alpha * H[i][j] + dNodes[i][j] + tNodes;
迭代若干次,直到前后两次迭代PR值之差小于一个预定义的参数epsilon结束,最后得出PR向量。
有了PR,就可以综合前面提到的index的score,根据某种逻辑计算出一个综合score,改善搜索结果:
package com.hankcs; import iweb2.ch2.shell.FetchAndProcessCrawler; import iweb2.ch2.shell.LuceneIndexer; import iweb2.ch2.shell.MySearcher; import iweb2.ch2.shell.PageRank; /** * @author hankcs */ public class ch2_3_LuceneAndPageRank { public static void main(String[] args) throws Exception { // // -- Data (Business news URLs and 3 spam pages) // FetchAndProcessCrawler crawler = new FetchAndProcessCrawler("C:/iWeb2/data/ch02",5,200); crawler.setUrls("biz"); // 垃圾网页 crawler.addUrl("file:///c:/iWeb2/data/ch02/spam-biz-01.html"); crawler.addUrl("file:///c:/iWeb2/data/ch02/spam-biz-02.html"); crawler.addUrl("file:///c:/iWeb2/data/ch02/spam-biz-03.html"); crawler.run(); // // -- Lucene // LuceneIndexer luceneIndexer = new LuceneIndexer(crawler.getRootDir()); // 索引所有网页 luceneIndexer.run(); MySearcher oracle = new MySearcher(luceneIndexer.getLuceneDir()); // // -- Lucene Index ranking only // oracle.search("nvidia",5); oracle.search("economy news",5); // // -- PageRank // PageRank pageRank = new PageRank(crawler.getCrawlData()); pageRank.setAlpha(0.99); pageRank.setEpsilon(0.00000001); // 构建PR pageRank.build(); // // -- Combined ranking // oracle.search("nvidia",5,pageRank); oracle.search("economy news",5,pageRank); } }
其中search方法多了一个PR参数:
/** * 一个综合了索引搜索和PR算法的方法,能够提供更好的关联结果 ** A method that combines the score of an index based search * and the score of the PageRank algorithm to achieve better * relevance results. * @param query 搜索词 * @param numberOfMatches 最大匹配数 * @param pR * @return */ public SearchResult[] search(String query, int numberOfMatches, Rank pR) { SearchResult[] docResults = search(query, numberOfMatches); String url; int n = pR.getH().getSize(); /** * TODO: 2.3 -- The PageRank scaling factor m (Book Section 2.3) * * When the number of pages in your graph are few, the PageRank values * need some boosting. As the number of pages increases m approaches the * value 1 quickly because 1/n goes to zero. */ // 计算缩放因子 double m = 1 - (double) 1 / n; // actualNumberOfMatches <= numberOfMatches int actualNumberOfMatches = docResults.length; for (int i = 0; i < actualNumberOfMatches; i++) { url = docResults[i].getUrl(); // 计算混合分数 double hScore = docResults[i].getScore() * Math.pow(pR.getPageRank(url), m); // Update the score of the results docResults[i].setScore(hScore); } // sort results by score // 排序 SearchResult.sortByScore(docResults); String header = "Search results using combined Lucene scores and page rank scores:"; boolean showTitle = false; printResults(header, "Query: " + query, docResults, showTitle); return docResults; }