这是世界上最大的在线电影租赁商,技术核心是一个推荐系统
3.4.1 电影数据集的介绍及推荐器
数据集来自MovieLens,推荐器需要做三个改进:
1. 数据标准化。如果某个用户是水军(总是打高分),那么最好考虑他的相对评分比。
2. 邻居的选择。在协同过滤中会找出一批条目或用户来推导未评分条目的分值,如何选择最佳的邻居?
3. 邻居的权重。每个邻居的评分有多重要?
主程序代码:
package com.hankcs; import iweb2.ch3.collaborative.data.MovieLensData; import iweb2.ch3.collaborative.data.MovieLensDataset; import iweb2.ch3.collaborative.recommender.MovieLensDelphi; public class ch3_6_MovieLens { public static void main(String[] args) throws Exception { // Create the dataset MovieLensDataset ds = MovieLensData.createDataset(); // Create the recommender MovieLensDelphi delphi = new MovieLensDelphi(ds); // Pick users and create recommendations // 随便挑一个用户做测验 iweb2.ch3.collaborative.model.User u1 = ds.getUser(1); delphi.recommend(u1); iweb2.ch3.collaborative.model.User u155 = ds.getUser(155); delphi.recommend(u155); iweb2.ch3.collaborative.model.User u876 = ds.getUser(876); delphi.recommend(u876); } }
报错:
Exception in thread "main" java.lang.RuntimeException: Failed to load rating from file (file: 'C:\iWeb2\data\ch03\MovieLens\ratings.dat'):
make sure that you are using at least: -Xmx1024m
at iweb2.ch3.collaborative.data.MovieLensDataset.loadRatings(MovieLensDataset.java:310)
at iweb2.ch3.collaborative.data.MovieLensDataset.loadData(MovieLensDataset.java:118)
at iweb2.ch3.collaborative.data.MovieLensDataset.<init>(MovieLensDataset.java:97)
at iweb2.ch3.collaborative.data.MovieLensData.createDataset(MovieLensData.java:36)
at iweb2.ch3.collaborative.data.MovieLensData.createDataset(MovieLensData.java:19)
at iweb2.ch3.collaborative.data.MovieLensData.createDataset(MovieLensData.java:15)
at com.hankcs.ch3_6_MovieLens.main(ch3_6_MovieLens.java:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.io.FileNotFoundException: C:\iWeb2\data\ch03\MovieLens\ratings.dat (系统找不到指定的文件。)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileReader.<init>(FileReader.java:72)
at iweb2.ch3.collaborative.data.MovieLensDataset.getReader(MovieLensDataset.java:265)
at iweb2.ch3.collaborative.data.MovieLensDataset.loadRatings(MovieLensDataset.java:300)
… 11 more
Process finished with exit code 1
解决方法是解压C:\iWeb2\data\ch03\MovieLens\ml.zip到所在目录
3.4.2 数据标准化与相关系数
相关系数r的计算位于iweb2/ch3/collaborative/similarity/PearsonCorrelation.java
public double calculate() { if (n == 0) { return 0.0; } double rho = 0.0d; // 为每个向量计算平均值 double avgX = getAverage(x); double avgY = getAverage(y); // 为每个向量计算标准差 double sX = getStdDev(avgX, x); double sY = getStdDev(avgY, y); double xy = 0; for (int i = 0; i < n; i++) { // 计算平均偏差后的点积,也就是协方差 xy += (x[i] - avgX) * (y[i] - avgY); } //No variation -- all points have the same values for either X or Y or both // 标准差要做分母,防止分母为0 if (sX == ZERO || sY == ZERO) { double indX = ZERO; double indY = ZERO; for (int i = 1; i < n; i++) { // 下面这种计算是错误的,反例 1 2 0,结果为0,但它们显然都不一样 // indX += (x[0] - x[i]); // indY += (y[0] - y[i]); // 改为绝对值就好了 indX += Math.abs(x[0] - x[i]); indY += Math.abs(y[0] - y[i]); } if (indX == ZERO && indY == ZERO) { // All points refer to the same value // This is a degenerate case of correlation // 所有点都取了相同的值,这是相关性的退化形式 return 1.0; } else { //Either the values of the X vary or the values of Y if (sX == ZERO) { sX = sY; } else { sY = sX; } } } rho = xy / (n * (sX * sY)); return rho; }
数据标准化指的是不用原始的用户评分构成的向量,而是用原用户评分减去条目的平均分值而得到的新向量。
public PearsonCorrelation(Dataset ds, Item iA, Item iB) { double aAvgR = iA.getAverageRating(); double bAvgR = iB.getAverageRating(); Integer[] uid = Item.getSharedUserIds(iA, iB); n = uid.length; // 对条目平均分值的数据标准化 x = new double[n]; y = new double[n]; User u; double urA = 0; double urB = 0; for (int i = 0; i < n; i++) { u = ds.getUser(uid[i]); urA = u.getItemRating(iA.getId()).getRating(); urB = u.getItemRating(iB.getId()).getRating(); x[i] = urA - aAvgR; y[i] = urB - bAvgR; } }
最终的评分计算:
/** * 包含重标准化与重新标度的评分计算 * @param user * @param item * @return */ private double estimateItemBasedRating(User user, Item item) { double itemRating = item.getAverageRating(); int itemId = item.getId(); int userId = user.getId(); double itemAvgRating = item.getAverageRating(); double weightedDeltaSum = 0.0; int sumN = 0; // check if the user has already rated the item // 检查用户是否已给该条目评过分 Rating existingRatingByUser = user.getItemRating(item.getId()); if (existingRatingByUser != null) { itemRating = existingRatingByUser.getRating(); } else { double similarityBetweenItems = 0; double weightedDelta = 0; double delta = 0; for (Item anotherItem : dataSet.getItems()) { // only consider items that were rated by the user // 只考虑用户评过分的条目 Rating anotherItemRating = anotherItem.getUserRating(userId); if (anotherItemRating != null) { // 又是一次数据重标准化 delta = itemAvgRating - anotherItemRating.getRating(); // 计算两个条目的相似度 similarityBetweenItems = itemSimilarityMatrix.getValue(itemId, anotherItem.getId()); if (Math.abs(similarityBetweenItems) > similarityThreshold) { // 相似度 * 标准化评分 weightedDelta = similarityBetweenItems * delta; weightedDeltaSum += weightedDelta; sumN++; } } } if (sumN > 0) { // 均值减去偏差 itemRating = itemAvgRating - (weightedDeltaSum / sumN); } } return itemRating; }