通过用户对某些音乐的评分来预测对其他音乐的评分,进而给他推荐喜好的音乐。包含三个重要概念:用户、条目、评分。
3.1.1 距离与相似度的概念
距离是指两个评分的差异程度,通常使用欧几里得距离:通过对差值求平方然后求和,然后开方得到。
相似度为两个用户之间的相似程度,计算公式是 sim = 1 – tanh(sqrt(距离 / 相同条目数))。改进的计算公式是 sim = sim * 相同条目数 / 最大可能相同条目数,其中,最大可能相同条目数为两个用户打过分的条目数的较小者。
3.1.2 走进相似度的计算
公式参考上面。
/** * 计算用户间的相似度 * * @param u 待比较用户的引用 * @param simType 使用哪一种相似性 * @return */ public double getSimilarity(MusicUser u, int simType) { double sim = 0.0d; int commonItems = 0; /** * TODO: 3.1 -- Types of similarity (Book section 3.1.2) * * In the following switch, we include two types of similarity * You can extend the functionality of this method by adding more * types. For example, the Jaccard similarity could be defined as * the ratio of the intersection over the union of the items between * two users. In other words, * Number of songs in common * Jaccard Similarity = ------------------------------------------- * Number of all songs listened by either user * * Are more complicated similarity metrics more accurate? */ switch (simType) { case 0: for (Rating r : this.ratingsByItemId.values()) { // 找到所有共同的条目 for (Rating r2 : u.ratingsByItemId.values()) { //Find the same item if (r.getItemId() == r2.getItemId()) { commonItems++; // 对评分的差的平方求和 sim += Math.pow((r.getRating() - r2.getRating()), 2); } } } // If there are not common items, we cannot tell whether // the users are similar or not. So, we let it return 0. // 若无相同条目,就无法分辨两个用户是否相似,返回0 if (commonItems > 0) { //This is the RMSE, which is more like the distance sim = Math.sqrt(sim / commonItems); // Similarity should be between 0 and 1 // For the value 0, the two users are as disimilar as they come // For the value 1, their preferences (based on the available data) are identical. // // Here is a function that accomplishes exactly that sim = 1.0d - Math.tanh(sim); // 相似度介于0~1之间,0表示两用户完全不相似 } break; // --------------------------------------------------------- case 1: for (Rating r : this.ratingsByItemId.values()) { for (Rating r2 : u.ratingsByItemId.values()) { //Find the same item if (r.getItemId() == r2.getItemId()) { commonItems++; sim += Math.pow((r.getRating() - r2.getRating()), 2); } } } // If there are not common items, we cannot tell whether // the users are similar or not. So, we let it return 0. if (commonItems > 0) { // Same as before (case 0) sim = Math.sqrt(sim / commonItems); // Similarity should be between 0 and 1 // For the value 0, the two users are as disimilar as they come // For the value 1, their preferences (based on the available data) are identical. // // Here is a function that accomplishes exactly that sim = 1.0d - Math.tanh(sim); // However, the above calculation takes into account only the common items // It does not account for the number of items that could have in common // So, let us consider the following // This is the maximum number of items that the two users can have in common // 相同条目最大可能值 int maxCommonItems = Math.min(this.ratingsByItemId.size(), u.ratingsByItemId.size()); // Adjust the similarity to account for the importance of the common terms // through the ratio of the common items over the number of all possible common items // 考虑到相同条目的重要性,以相同条目与最大可能相同条目的比值作为相似度 sim = sim * ((double) commonItems / (double) maxCommonItems); } break; } //Let us know what it is System.out.print("\n"); //Just for pretty printing in the Shell System.out.print(" User Similarity between"); System.out.print(" " + this.getName()); System.out.print(" and " + u.getName()); System.out.println(" is equal to " + sim); System.out.print("\n"); //Just for pretty printing in the Shell return sim; }
3.1.3 什么才是最好的相似度计算公式?
除了上面的计算公式,还可使用Jaccard计算公式 Jaccard = 交集大小 / 并集大小,据研究称欧几里得相似度最好。