
通过用户对某些音乐的评分来预测对其他音乐的评分,进而给他推荐喜好的音乐。包含三个重要概念:用户、条目、评分。
3.1.1 距离与相似度的概念
距离是指两个评分的差异程度,通常使用欧几里得距离:通过对差值求平方然后求和,然后开方得到。
相似度为两个用户之间的相似程度,计算公式是 sim = 1 – tanh(sqrt(距离 / 相同条目数))。改进的计算公式是 sim = sim * 相同条目数 / 最大可能相同条目数,其中,最大可能相同条目数为两个用户打过分的条目数的较小者。
3.1.2 走进相似度的计算
公式参考上面。
/**
* 计算用户间的相似度
*
* @param u 待比较用户的引用
* @param simType 使用哪一种相似性
* @return
*/
public double getSimilarity(MusicUser u, int simType)
{
double sim = 0.0d;
int commonItems = 0;
/**
* TODO: 3.1 -- Types of similarity (Book section 3.1.2)
*
* In the following switch, we include two types of similarity
* You can extend the functionality of this method by adding more
* types. For example, the Jaccard similarity could be defined as
* the ratio of the intersection over the union of the items between
* two users. In other words,
* Number of songs in common
* Jaccard Similarity = -------------------------------------------
* Number of all songs listened by either user
*
* Are more complicated similarity metrics more accurate?
*/
switch (simType)
{
case 0:
for (Rating r : this.ratingsByItemId.values())
{
// 找到所有共同的条目
for (Rating r2 : u.ratingsByItemId.values())
{
//Find the same item
if (r.getItemId() == r2.getItemId())
{
commonItems++;
// 对评分的差的平方求和
sim += Math.pow((r.getRating() - r2.getRating()), 2);
}
}
}
// If there are not common items, we cannot tell whether
// the users are similar or not. So, we let it return 0.
// 若无相同条目,就无法分辨两个用户是否相似,返回0
if (commonItems > 0)
{
//This is the RMSE, which is more like the distance
sim = Math.sqrt(sim / commonItems);
// Similarity should be between 0 and 1
// For the value 0, the two users are as disimilar as they come
// For the value 1, their preferences (based on the available data) are identical.
//
// Here is a function that accomplishes exactly that
sim = 1.0d - Math.tanh(sim);
// 相似度介于0~1之间,0表示两用户完全不相似
}
break;
// ---------------------------------------------------------
case 1:
for (Rating r : this.ratingsByItemId.values())
{
for (Rating r2 : u.ratingsByItemId.values())
{
//Find the same item
if (r.getItemId() == r2.getItemId())
{
commonItems++;
sim += Math.pow((r.getRating() - r2.getRating()), 2);
}
}
}
// If there are not common items, we cannot tell whether
// the users are similar or not. So, we let it return 0.
if (commonItems > 0)
{
// Same as before (case 0)
sim = Math.sqrt(sim / commonItems);
// Similarity should be between 0 and 1
// For the value 0, the two users are as disimilar as they come
// For the value 1, their preferences (based on the available data) are identical.
//
// Here is a function that accomplishes exactly that
sim = 1.0d - Math.tanh(sim);
// However, the above calculation takes into account only the common items
// It does not account for the number of items that could have in common
// So, let us consider the following
// This is the maximum number of items that the two users can have in common
// 相同条目最大可能值
int maxCommonItems = Math.min(this.ratingsByItemId.size(), u.ratingsByItemId.size());
// Adjust the similarity to account for the importance of the common terms
// through the ratio of the common items over the number of all possible common items
// 考虑到相同条目的重要性,以相同条目与最大可能相同条目的比值作为相似度
sim = sim * ((double) commonItems / (double) maxCommonItems);
}
break;
}
//Let us know what it is
System.out.print("\n"); //Just for pretty printing in the Shell
System.out.print(" User Similarity between");
System.out.print(" " + this.getName());
System.out.print(" and " + u.getName());
System.out.println(" is equal to " + sim);
System.out.print("\n"); //Just for pretty printing in the Shell
return sim;
}
3.1.3 什么才是最好的相似度计算公式?
除了上面的计算公式,还可使用Jaccard计算公式 Jaccard = 交集大小 / 并集大小,据研究称欧几里得相似度最好。
码农场