
依靠索引和用户点击的确可以搜索数十万个文档,但是由于文档中不存在“链接”这个概念,那就无法应用PR来对其排名了。《智能Web算法》里介绍了DocRank算法来排序文档。
package com.hankcs;
import iweb2.ch2.shell.*;
/**
* @author hankcs
*/
public class ch2_5_LuceneAndDocRank
{
public static void main(String[] args) throws Exception
{
//
// -- Data (Business news in Word document format and 3 spam documents)
//
FetchAndProcessCrawler c = new FetchAndProcessCrawler("C:/iWeb2/data/ch02",5,200);
c.setUrls("biz-docs");
c.addDocSpam();
c.run();
//
// -- Lucene
//
LuceneIndexer lidx = new LuceneIndexer(c.getRootDir());
lidx.run();
MySearcher oracle = new MySearcher(lidx.getLuceneDir());
//
// -- Lucene Index ranking only
//
oracle.search("nvidia",5);
//
// -- PageRank
//
DocRank dr = new DocRank(lidx.getLuceneDir(), 7);
dr.setAlpha(0.9);
dr.setEpsilon(0.00000001);
dr.build();
//
// -- Combined ranking
//
oracle.search("nvidia",5,dr);
}
}
DocRank中也有一个H矩阵,不过不是基于链接而是基于关键字的。我们利用Lucene的搜索词向量来表示文档中所有关键字与其出现的频数。如果X文档的搜索词向量与Y文档的搜索词向量越接近,则Y文档对X越重要。取两个文档中出现最频繁的词汇的交集,对交集的每个词计算其在X、Y中频数的比,对这个比值求双曲正切值并四舍五入。我忘记了双曲正切值的定义,复习一下:

最后求出所有比值的总和。这个总和i就是矩阵H中第X行第Y列的元素,相当于X有i个链接链接到Y文档。
详细的计算如下:
private PageRankMatrixH buildMatrixH(IndexReader idxR)
throws IOException {
// 只考虑内容已经采集并经过处理的URL。only consider URLs that with fetched and parsed content
List<Integer> allDocs = getProcessedDocs(idxR);
PageRankMatrixH docMatrix = new PageRankMatrixH( allDocs.size() );
for(int i = 0, n = allDocs.size(); i < n; i++) {
for(int j = 0, k = allDocs.size(); j < k; j++) {
double similarity = 0.0d;
Document docX = idxR.document(i);
String xURL= docX.get("url");
if ( i == j ) {
// 防止没有廉耻的自吹自擂,矩阵对角线设为0 Avoid shameless self-promotion 😉
docMatrix.addLink(xURL, xURL, similarity);
} else {
// 获取搜索词和它们出现的频次的向量
TermFreqVector x = idxR.getTermFreqVector(i, "content");
TermFreqVector y = idxR.getTermFreqVector(j, "content");
similarity = getImportance(x.getTerms(), x.getTermFrequencies(),
y.getTerms(), y.getTermFrequencies());
// add link from docX to docY
Document docY = idxR.document(j);
String yURL = docY.get("url");
docMatrix.addLink(xURL, yURL, similarity);
}
}
}
docMatrix.calculate();
return docMatrix;
}
/*
* Calculates importance of document Y in the context of document X
*/
/**
* 计算文档Y对文档X的重要性
* @param xTerms
* @param xTermFreq
* @param yTerms
* @param yTermFreq
* @return
*/
private double getImportance(String[] xTerms, int[] xTermFreq,
String[] yTerms, int[] yTermFreq) {
// xTerms 是第一个文档中出现频率最高的词 is an array of the most frequent terms for the first document
Map<String, Integer> xFreqMap = buildFreqMap(xTerms, xTermFreq);
// yTerms is an array of the most frequent terms for the second document
Map<String, Integer> yFreqMap = buildFreqMap(yTerms, yTermFreq);
// 两者交集 sharedTerms is the intersection of the two sets
Set<String> sharedTerms = new HashSet<String>(xFreqMap.keySet());
sharedTerms.retainAll(yFreqMap.keySet());
double sharedTermsSum = 0.0;
// Notice that this way of assigning importance is not symmetric.
// That is, if you swap X with Y then you get a different value;
// unless the frequencies are equal, of course!
double xF, yF;
for(String term : sharedTerms) {
// 在x,y两个文档中出现次数的比值
xF = xFreqMap.get(term).doubleValue();
yF = yFreqMap.get(term).doubleValue();
// 取双曲正切值(单调递增,趋近于1)累加
sharedTermsSum += Math.round(Math.tanh(yF/xF));
}
return sharedTermsSum;
}
有了H矩阵,G矩阵的计算跟上一章一摸一样,最后分数的总和也是一样的。
码农场