依靠索引和用户点击的确可以搜索数十万个文档,但是由于文档中不存在“链接”这个概念,那就无法应用PR来对其排名了。《智能Web算法》里介绍了DocRank算法来排序文档。
package com.hankcs; import iweb2.ch2.shell.*; /** * @author hankcs */ public class ch2_5_LuceneAndDocRank { public static void main(String[] args) throws Exception { // // -- Data (Business news in Word document format and 3 spam documents) // FetchAndProcessCrawler c = new FetchAndProcessCrawler("C:/iWeb2/data/ch02",5,200); c.setUrls("biz-docs"); c.addDocSpam(); c.run(); // // -- Lucene // LuceneIndexer lidx = new LuceneIndexer(c.getRootDir()); lidx.run(); MySearcher oracle = new MySearcher(lidx.getLuceneDir()); // // -- Lucene Index ranking only // oracle.search("nvidia",5); // // -- PageRank // DocRank dr = new DocRank(lidx.getLuceneDir(), 7); dr.setAlpha(0.9); dr.setEpsilon(0.00000001); dr.build(); // // -- Combined ranking // oracle.search("nvidia",5,dr); } }
DocRank中也有一个H矩阵,不过不是基于链接而是基于关键字的。我们利用Lucene的搜索词向量来表示文档中所有关键字与其出现的频数。如果X文档的搜索词向量与Y文档的搜索词向量越接近,则Y文档对X越重要。取两个文档中出现最频繁的词汇的交集,对交集的每个词计算其在X、Y中频数的比,对这个比值求双曲正切值并四舍五入。我忘记了双曲正切值的定义,复习一下:
最后求出所有比值的总和。这个总和i就是矩阵H中第X行第Y列的元素,相当于X有i个链接链接到Y文档。
详细的计算如下:
private PageRankMatrixH buildMatrixH(IndexReader idxR) throws IOException { // 只考虑内容已经采集并经过处理的URL。only consider URLs that with fetched and parsed content List<Integer> allDocs = getProcessedDocs(idxR); PageRankMatrixH docMatrix = new PageRankMatrixH( allDocs.size() ); for(int i = 0, n = allDocs.size(); i < n; i++) { for(int j = 0, k = allDocs.size(); j < k; j++) { double similarity = 0.0d; Document docX = idxR.document(i); String xURL= docX.get("url"); if ( i == j ) { // 防止没有廉耻的自吹自擂,矩阵对角线设为0 Avoid shameless self-promotion 😉 docMatrix.addLink(xURL, xURL, similarity); } else { // 获取搜索词和它们出现的频次的向量 TermFreqVector x = idxR.getTermFreqVector(i, "content"); TermFreqVector y = idxR.getTermFreqVector(j, "content"); similarity = getImportance(x.getTerms(), x.getTermFrequencies(), y.getTerms(), y.getTermFrequencies()); // add link from docX to docY Document docY = idxR.document(j); String yURL = docY.get("url"); docMatrix.addLink(xURL, yURL, similarity); } } } docMatrix.calculate(); return docMatrix; } /* * Calculates importance of document Y in the context of document X */ /** * 计算文档Y对文档X的重要性 * @param xTerms * @param xTermFreq * @param yTerms * @param yTermFreq * @return */ private double getImportance(String[] xTerms, int[] xTermFreq, String[] yTerms, int[] yTermFreq) { // xTerms 是第一个文档中出现频率最高的词 is an array of the most frequent terms for the first document Map<String, Integer> xFreqMap = buildFreqMap(xTerms, xTermFreq); // yTerms is an array of the most frequent terms for the second document Map<String, Integer> yFreqMap = buildFreqMap(yTerms, yTermFreq); // 两者交集 sharedTerms is the intersection of the two sets Set<String> sharedTerms = new HashSet<String>(xFreqMap.keySet()); sharedTerms.retainAll(yFreqMap.keySet()); double sharedTermsSum = 0.0; // Notice that this way of assigning importance is not symmetric. // That is, if you swap X with Y then you get a different value; // unless the frequencies are equal, of course! double xF, yF; for(String term : sharedTerms) { // 在x,y两个文档中出现次数的比值 xF = xFreqMap.get(term).doubleValue(); yF = yFreqMap.get(term).doubleValue(); // 取双曲正切值(单调递增,趋近于1)累加 sharedTermsSum += Math.round(Math.tanh(yF/xF)); } return sharedTermsSum; }
有了H矩阵,G矩阵的计算跟上一章一摸一样,最后分数的总和也是一样的。