
如果仅仅依靠索引来提供检索结果的话,将会受到垃圾网页的干扰,将2.1节的源码中添加垃圾网页的那行注释取消,会发现垃圾网页排名第一:
Search results using Lucene index scores: Query: armstrong Document Title: Cheap medicine -- low interest loans Document URL: file:/c:/iWeb2/data/ch02/spam-01.html --> Relevance Score: 0.591894507408142 _______________________________________________________________________ Document Title: Lance Armstrong meets goal in painful marathon debut Document URL: file:/c:/iWeb2/data/ch02/sport-01.html --> Relevance Score: 0.370989531278610 _______________________________________________________________________ Document Title: New York 'tour' Lance's toughest Document URL: file:/c:/iWeb2/data/ch02/sport-03.html --> Relevance Score: 0.291807949542999 _______________________________________________________________________ Document Title: New York City Marathon Document URL: file:/c:/iWeb2/data/ch02/sport-02.html --> Relevance Score: 0.210920616984367 _______________________________________________________________________
垃圾网页依靠堆砌关键字来提高索引排名,但是没有多少正常网页会链接到它们。所以可以利用网页间链接的有向图来区分一个网页有多“不垃圾”,这个区分值可以作为PageRank的浅显理解。
要计算PageRank值必须先计算H矩阵。
在一个拥有n个网页的网络中,定义链接矩阵H(n x n),统计网页i有n个外链,如果i有一个外链到网页j,那么H[i][j] = 1 / n,否则为0。
PageRank的计算是一个迭代过程:
p(k+1)=p(k) * H p(0) = 1 / n
H矩阵的计算非常简单,但是不能直接用于计算PageRank值。因为收敛性不好,需要迭代很多次才能使PR收敛。并且H矩阵没有考虑下列两个现实:
-
当一个网页没有外链的时候,用户会在浏览器地址栏里输入一个相关的网址,比如sport-03.html
此时悬孤节点的贡献不再是0,而应该是 1 / n,n是所有网页的数量。
-
用户上网上久了腻味了,直接关闭了浏览器或者跳转到一个不相干的网站,比如www.hankcs.com。
此时定义一个alpha值,代表用户不随机跳转的概率。那么随机跳转的贡献为(1 – alpha) * 1 / n
接下来,利用上面两个现实,定义一个G矩阵来替代它:
/**
* 计算PR
*
* @param alpha 不随机跳转的概率
* @param epsilon
*/
public void findPageRank(double alpha, double epsilon)
{
// A counter for our iterations
// 迭代计数器
int k = 0;
// auxiliary variable
// 辅助变量
PageRankMatrixH matrixH = getH();
// The H matrix has size nxn and the PageRank vector has size n
// 矩阵H的大小是nXn,PR向量的大小是n
int n = matrixH.getSize();
//auxiliary variable
// 辅助变量,n的倒数
double inv_n = (double) 1 / n;
// This is the actual nxn matrix of double values
// 实际的双精度H矩阵
double[][] H = matrixH.getMatrix();
// A dummy variable that holds our error --
// arbitrarily set to an initial value of 1
// 用于存放误差的虚拟变量,简单地设为1
double error = 1;
// This holds the values of the PageRank vector
// PR向量的值
pR = new double[n];
// This is a copy of the PageRank vector from the previous iteration
// The only reason that we need this is for evaluating the error
// 前一次迭代所得到的PR值,只在计算误差时用得上
double[] tmpPR = new double[n];
// Set the initial values (ad hoc)
// 设置PR初始值(逐个设置)
for (int i = 0; i < n; i++)
{
pR[i] = inv_n;
}
/*
* Book Section 2.3 -- Altering the H matrix: Dangling nodes
* 获取悬孤节点的贡献
*/
double[][] dNodes = getDanglingNodeMatrix();
/**
* 修改G矩阵,计算随机跳转的贡献
* TODO: 2.5 -- Altering the G matrix: Teleportation (Book Section 2.3)
*
* The following code defines the contribution of the dangling nodes,
* i.e. jumping randomly on a page that is not connected with the one
* that our surfer is currently viewing
*
* Notice that it is the same for all pages. An interesting variation
* of the algorithm would introduce a "teleportation" contribution that
* relates the probability of an arbitrary transition to the degree of
* interest that a user has for the content of a page.
*
* Exercise: Could that be done? If so, how? What problems can you see
* with that variation?
*/
double tNodes = (1 - alpha) * inv_n;
//Replace the H matrix with the G matrix
//用矩阵G替换矩阵H
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
H[i][j] = alpha * H[i][j] + dNodes[i][j] + tNodes;
}
}
// 打印G矩阵
StringBuilder txt = new StringBuilder("G Matrix\n\n");
for (int i = 0; i < n; i++)
{
double lineTotal = 0;
for (int j = 0; j < n; j++)
{
txt.append(" ");
txt.append(H[i][j]);
lineTotal += H[i][j];
txt.append(", ");
}
txt.append("total:" + lineTotal);
txt.append("\n");
}
System.out.println(txt.toString());
// 打印迭代前的初始pR
System.out.println("\n Before Iteration, PageRank convergence error: " + error);
for (int i = 0; i < n; i++)
{
System.out.println("Index: " + i + " --> PageRank: " + pR[i]);
}
// Iterate until convergence.
// We have found the PageRank values if our error is smaller than epsilon
// 迭代直到收敛
// 如果误差小于 epsilon 就认为是收敛的
while (error >= epsilon)
{
// Make a copy of the PageRank vector before we update it
// 备份
for (int i = 0; i < n; i++)
{
tmpPR[i] = pR[i];
}
double dummy = 0;
// Now we get the next point in the iteration
// 现在到了迭代中的下一个点
for (int i = 0; i < n; i++)
{
dummy = 0;
for (int j = 0; j < n; j++)
{
dummy += pR[j] * H[j][i];
}
pR[i] = dummy;
}
// Get the error, so that we can check convergence
// 得到误差,检查收敛
error = norm(pR, tmpPR);
// Display the progress
// 打印进度
System.out.println("\n Iteration: " + k + ", PageRank convergence error: " + error);
for (int i = 0; i < n; i++)
{
System.out.println("Index: " + i + " --> PageRank: " + pR[i]);
}
//increase the value of the counter by one
// 计数加一
k++;
}
// Report the final values
List<RelevanceScore> allRankings = new ArrayList<RelevanceScore>();
for (int i = 0; i < n; i++)
{
String url = matrixH.getIndexMapping().getValue(i);
RelevanceScore r = new RelevanceScore(url, pR[i]);
allRankings.add(r);
}
RelevanceScore.sort(allRankings);
System.out.println("\n______________ Calculation Results _______________\n");
for (RelevanceScore r : allRankings)
{
System.out.printf("Page URL: %-42s --> Rank: %.15f\n", r.getId(), r.getScore());
}
System.out.println("\n____________________________________________________\n");
}
G矩阵相对于H矩阵,关键修改了
H[i][j] = alpha * H[i][j] + dNodes[i][j] + tNodes;
迭代若干次,直到前后两次迭代PR值之差小于一个预定义的参数epsilon结束,最后得出PR向量。
有了PR,就可以综合前面提到的index的score,根据某种逻辑计算出一个综合score,改善搜索结果:
package com.hankcs;
import iweb2.ch2.shell.FetchAndProcessCrawler;
import iweb2.ch2.shell.LuceneIndexer;
import iweb2.ch2.shell.MySearcher;
import iweb2.ch2.shell.PageRank;
/**
* @author hankcs
*/
public class ch2_3_LuceneAndPageRank
{
public static void main(String[] args) throws Exception
{
//
// -- Data (Business news URLs and 3 spam pages)
//
FetchAndProcessCrawler crawler = new FetchAndProcessCrawler("C:/iWeb2/data/ch02",5,200);
crawler.setUrls("biz");
// 垃圾网页
crawler.addUrl("file:///c:/iWeb2/data/ch02/spam-biz-01.html");
crawler.addUrl("file:///c:/iWeb2/data/ch02/spam-biz-02.html");
crawler.addUrl("file:///c:/iWeb2/data/ch02/spam-biz-03.html");
crawler.run();
//
// -- Lucene
//
LuceneIndexer luceneIndexer = new LuceneIndexer(crawler.getRootDir());
// 索引所有网页
luceneIndexer.run();
MySearcher oracle = new MySearcher(luceneIndexer.getLuceneDir());
//
// -- Lucene Index ranking only
//
oracle.search("nvidia",5);
oracle.search("economy news",5);
//
// -- PageRank
//
PageRank pageRank = new PageRank(crawler.getCrawlData());
pageRank.setAlpha(0.99);
pageRank.setEpsilon(0.00000001);
// 构建PR
pageRank.build();
//
// -- Combined ranking
//
oracle.search("nvidia",5,pageRank);
oracle.search("economy news",5,pageRank);
}
}
其中search方法多了一个PR参数:
/**
* 一个综合了索引搜索和PR算法的方法,能够提供更好的关联结果
** A method that combines the score of an index based search
* and the score of the PageRank algorithm to achieve better
* relevance results.
* @param query 搜索词
* @param numberOfMatches 最大匹配数
* @param pR
* @return
*/
public SearchResult[] search(String query, int numberOfMatches, Rank pR)
{
SearchResult[] docResults = search(query, numberOfMatches);
String url;
int n = pR.getH().getSize();
/**
* TODO: 2.3 -- The PageRank scaling factor m (Book Section 2.3)
*
* When the number of pages in your graph are few, the PageRank values
* need some boosting. As the number of pages increases m approaches the
* value 1 quickly because 1/n goes to zero.
*/
// 计算缩放因子
double m = 1 - (double) 1 / n;
// actualNumberOfMatches <= numberOfMatches
int actualNumberOfMatches = docResults.length;
for (int i = 0; i < actualNumberOfMatches; i++)
{
url = docResults[i].getUrl();
// 计算混合分数
double hScore = docResults[i].getScore() * Math.pow(pR.getPageRank(url), m);
// Update the score of the results
docResults[i].setScore(hScore);
}
// sort results by score
// 排序
SearchResult.sortByScore(docResults);
String header = "Search results using combined Lucene scores and page rank scores:";
boolean showTitle = false;
printResults(header, "Query: " + query, docResults, showTitle);
return docResults;
}
码农场