
这是世界上最大的在线电影租赁商,技术核心是一个推荐系统
3.4.1 电影数据集的介绍及推荐器
数据集来自MovieLens,推荐器需要做三个改进:
1. 数据标准化。如果某个用户是水军(总是打高分),那么最好考虑他的相对评分比。
2. 邻居的选择。在协同过滤中会找出一批条目或用户来推导未评分条目的分值,如何选择最佳的邻居?
3. 邻居的权重。每个邻居的评分有多重要?
主程序代码:
package com.hankcs;
import iweb2.ch3.collaborative.data.MovieLensData;
import iweb2.ch3.collaborative.data.MovieLensDataset;
import iweb2.ch3.collaborative.recommender.MovieLensDelphi;
public class ch3_6_MovieLens
{
public static void main(String[] args) throws Exception
{
// Create the dataset
MovieLensDataset ds = MovieLensData.createDataset();
// Create the recommender
MovieLensDelphi delphi = new MovieLensDelphi(ds);
// Pick users and create recommendations
// 随便挑一个用户做测验
iweb2.ch3.collaborative.model.User u1 = ds.getUser(1);
delphi.recommend(u1);
iweb2.ch3.collaborative.model.User u155 = ds.getUser(155);
delphi.recommend(u155);
iweb2.ch3.collaborative.model.User u876 = ds.getUser(876);
delphi.recommend(u876);
}
}
报错:
Exception in thread "main" java.lang.RuntimeException: Failed to load rating from file (file: 'C:\iWeb2\data\ch03\MovieLens\ratings.dat'):
make sure that you are using at least: -Xmx1024m
at iweb2.ch3.collaborative.data.MovieLensDataset.loadRatings(MovieLensDataset.java:310)
at iweb2.ch3.collaborative.data.MovieLensDataset.loadData(MovieLensDataset.java:118)
at iweb2.ch3.collaborative.data.MovieLensDataset.<init>(MovieLensDataset.java:97)
at iweb2.ch3.collaborative.data.MovieLensData.createDataset(MovieLensData.java:36)
at iweb2.ch3.collaborative.data.MovieLensData.createDataset(MovieLensData.java:19)
at iweb2.ch3.collaborative.data.MovieLensData.createDataset(MovieLensData.java:15)
at com.hankcs.ch3_6_MovieLens.main(ch3_6_MovieLens.java:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.io.FileNotFoundException: C:\iWeb2\data\ch03\MovieLens\ratings.dat (系统找不到指定的文件。)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileReader.<init>(FileReader.java:72)
at iweb2.ch3.collaborative.data.MovieLensDataset.getReader(MovieLensDataset.java:265)
at iweb2.ch3.collaborative.data.MovieLensDataset.loadRatings(MovieLensDataset.java:300)
… 11 more
Process finished with exit code 1
解决方法是解压C:\iWeb2\data\ch03\MovieLens\ml.zip到所在目录
3.4.2 数据标准化与相关系数
相关系数r的计算位于iweb2/ch3/collaborative/similarity/PearsonCorrelation.java
public double calculate()
{
if (n == 0)
{
return 0.0;
}
double rho = 0.0d;
// 为每个向量计算平均值
double avgX = getAverage(x);
double avgY = getAverage(y);
// 为每个向量计算标准差
double sX = getStdDev(avgX, x);
double sY = getStdDev(avgY, y);
double xy = 0;
for (int i = 0; i < n; i++)
{
// 计算平均偏差后的点积,也就是协方差
xy += (x[i] - avgX) * (y[i] - avgY);
}
//No variation -- all points have the same values for either X or Y or both
// 标准差要做分母,防止分母为0
if (sX == ZERO || sY == ZERO)
{
double indX = ZERO;
double indY = ZERO;
for (int i = 1; i < n; i++)
{
// 下面这种计算是错误的,反例 1 2 0,结果为0,但它们显然都不一样
// indX += (x[0] - x[i]);
// indY += (y[0] - y[i]);
// 改为绝对值就好了
indX += Math.abs(x[0] - x[i]);
indY += Math.abs(y[0] - y[i]);
}
if (indX == ZERO && indY == ZERO)
{
// All points refer to the same value
// This is a degenerate case of correlation
// 所有点都取了相同的值,这是相关性的退化形式
return 1.0;
}
else
{
//Either the values of the X vary or the values of Y
if (sX == ZERO)
{
sX = sY;
}
else
{
sY = sX;
}
}
}
rho = xy / (n * (sX * sY));
return rho;
}
数据标准化指的是不用原始的用户评分构成的向量,而是用原用户评分减去条目的平均分值而得到的新向量。
public PearsonCorrelation(Dataset ds, Item iA, Item iB)
{
double aAvgR = iA.getAverageRating();
double bAvgR = iB.getAverageRating();
Integer[] uid = Item.getSharedUserIds(iA, iB);
n = uid.length;
// 对条目平均分值的数据标准化
x = new double[n];
y = new double[n];
User u;
double urA = 0;
double urB = 0;
for (int i = 0; i < n; i++)
{
u = ds.getUser(uid[i]);
urA = u.getItemRating(iA.getId()).getRating();
urB = u.getItemRating(iB.getId()).getRating();
x[i] = urA - aAvgR;
y[i] = urB - bAvgR;
}
}
最终的评分计算:
/**
* 包含重标准化与重新标度的评分计算
* @param user
* @param item
* @return
*/
private double estimateItemBasedRating(User user, Item item)
{
double itemRating = item.getAverageRating();
int itemId = item.getId();
int userId = user.getId();
double itemAvgRating = item.getAverageRating();
double weightedDeltaSum = 0.0;
int sumN = 0;
// check if the user has already rated the item
// 检查用户是否已给该条目评过分
Rating existingRatingByUser = user.getItemRating(item.getId());
if (existingRatingByUser != null)
{
itemRating = existingRatingByUser.getRating();
}
else
{
double similarityBetweenItems = 0;
double weightedDelta = 0;
double delta = 0;
for (Item anotherItem : dataSet.getItems())
{
// only consider items that were rated by the user
// 只考虑用户评过分的条目
Rating anotherItemRating = anotherItem.getUserRating(userId);
if (anotherItemRating != null)
{
// 又是一次数据重标准化
delta = itemAvgRating - anotherItemRating.getRating();
// 计算两个条目的相似度
similarityBetweenItems = itemSimilarityMatrix.getValue(itemId, anotherItem.getId());
if (Math.abs(similarityBetweenItems) > similarityThreshold)
{
// 相似度 * 标准化评分
weightedDelta = similarityBetweenItems * delta;
weightedDeltaSum += weightedDelta;
sumN++;
}
}
}
if (sumN > 0)
{
// 均值减去偏差
itemRating = itemAvgRating - (weightedDeltaSum / sumN);
}
}
return itemRating;
}
码农场