本章包括
深入了解基于用户的推荐系统
相识度度量
基于商品的和其它推荐系统
我们已经学习了怎么评估推荐系统和怎么表示推荐系统的数据输入,现在我们来详细看看推荐系统。前一章简单介绍了Mahout中实现的两类推荐系统:基于用户的和基于商品的。实际上,在第二章,你已经接触到了基于用户的推荐系统。两类算法都依赖相似性度量,而且有很多方法度量相似性,包括皮尔逊相关性,对数似然函数,斯皮尔曼相关性等。最终,也可以接触到其它样式的推荐算法,包括slop-one,奇异值分解,基于集群的推荐引擎。
4.1 理解基于用户的推荐系统
基于用户的推荐算法的特征是:它依赖用户之间相似性的度量。实际上,你每天都可能接触到了这种算法。
什么时候推荐是错误的
先看看下面一个成年人去CD店买CD的场景:
成年人:我要为一个青少年找一章CD。
雇员:什么样的青少年?
成年人:就像这些天来的青少年。
雇员:什么样的音乐或类型?
成年人:我糊涂了,我不知道。
雇员:我想,类似New 2 Town的专辑吧。
成年人:就买它了。
但是这种情况经常错误,因为某些青少年喜欢音乐,可能其它的喜欢相册呢?所以,从相似的人群中去运行推荐将更合理。
什么时候推荐是正确的
让我们重现这个场景,并把它构建好一些:
成年人:我要为一个青少年找一章CD。
雇员:他喜欢什么类型的音乐?
成年人:我不知道,但是他的最好的朋友总是穿一件T-shirt。
雇员:哦,非常流行的克利夫兰的新金属乐对……
这就对了,推荐系统基于两个好朋友的在音乐上相似的假设,对于这个可信赖的维度,这次预测讲准一些。
4.2 探索基于用户的推荐系统
算法
基于用户的推荐系统就如这样,它推荐商品给某个用户,如下所示:
for 用户u没有参考值的每个项i for 其它对项i含有参考值的用户v 计算用户u和用户v的相识度 通过用s,加权用户v对i的参考值,计算平均值(s越大,平均值越大) return 通过排序,返回最优的项 |
通过对每项都这么计算将是非常慢的,现实中,通常先计算用户的邻居,只有邻居含有的项才会被考虑:
for 每个其它用户w 计算用户u和w的相识度s 通过相似度,获取最优的用户,即当前用户u的邻居n for 邻居n拥有的项i,但用户u没有 for 邻居n中拥有项i的用户v 计算u和v的相识度s 通过用s,加权用户v对i的参考值,计算平均值(s越大,平均值越大) |
这就Mahout中实现的标准的基于用户的推荐算法了。
用GenericUserBasedRecommender实现算法
在第二章中我们举了第一个基于用户的推荐引擎。现在让我们返回这个例子并探索它的每个组件的使用吧。
DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity); |
UserSimilarity封装了用户之间的相似性,User-Neighborhood按最相似分组了用户,它们是标准的基于用户的相关组件。有许多定义最相似用户邻居的方法:最相似的5个?20个?还是特定值?举个例子,想象你要创建一个关于你婚礼的客人列表,你想邀请你最好的朋友和家庭,但是你有太多的朋友,而预算却不足。所以你必须决定怎么邀请比如限制50个人和家庭?或者40,100?或者你想邀请所有最要好的朋友?或者只邀请真正最要好的朋友?这种情况就类似于你决定怎么选择一个相似用户邻居。Mahout不仅仅是一个单一的推荐引擎,它可以插入或参数化来创建一个真正的在特定领域的推荐引擎。
组件如下:
数据模型,通过DataModel实现
用户到用户的相似性度量,通过UserSimilarity实现
用户邻居定义,通过UserNeighborhood实现
推荐引擎,通过Recommender实现(GenericUser-BasedRecommender)
使用GroupLens探索
让我们返回GroupLens这个数据集,并且使用100倍多的数据,从http://grouplens.org去下载1亿评分的MovieLens数据,即从http://www.grouplens.org/node/73下载,并从中提取出ratings.dat。这和100,000次评分的那个数据集不同,ua.base已经是准备好的File-DataModel。你可以转换成CSV文件,也可以自己实现一个Data Model。幸好的是,Mahout已经实现了一个GroupLensDataModel,参考如下代码:
DataModel model = new GroupLensDataModel(new File("ratings.dat")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity); LoadEvaluator.runLoad(recommender); |
请调节JVM的堆内存,不然运行时会内存不足。
探索用户的邻居数据
我们来评估推荐系统的精度,如下的代码所示。我们考虑是否可能来配置和修改邻居算法的实现。
代码4.3
DataModel model = new GroupLensDataModel(new File("ratings.dat")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.95, 0.05); System.out.println(score); |
方法evaluate()中的0.05表示5%的数据将用来进行评估。评估是个费时的过程,使用全部数据几乎不可能,为了方便使用较小的值。参数中的0.95表示95%的数据用来构建模型,剩下的5%用于测试。这个score的结果在0.89左右。
固定大小的邻居数据
上面代码的推荐引擎导出的邻居足足100个相似用户(NearestNUserNeighborhood里是这么设置的),这绝对是了从100个用户中去找最相似的。如果是10呢?推荐系统将依赖少数的相似用户,但是会排出一些相似性小的用户。试着用10替代100,评估结果将在0.98左右,意味着更大的评估结果是不好的,这可以这么解释,10个相似用户太少了。试着使用500,结果将为0.75,嗯,更好了。你还可以试试其它值。
基于阈值的邻居数据
如果你不想用个数来限制相似用户,试着采用相似性。使用new ThresholdUserNeighborhood(0.7, similarity, model)试试吧。0.7就是相关性的阈值系数,此时评估结果将是0.84左右。改变系数,0.9输出0.92,0.5输出0.78.
4.3 探索相似性度量标准
基于用户的推荐系统另一个重要的部分是用户相似性度量的实现。它是整个基于用户的推荐系统中最重要的。
基于皮尔逊相关性相似性度量
现在我们来看看基于皮尔逊相关性相似性度量的实现。皮尔逊相关性是一个基于-1到1直接的数值,用r来表示,r描述的是两个变量间线性相关强弱的程度,若r>0,表明两个变量是正相关,即一个变量的值越大,另一个变量的值也会越大;若r<0,表明两个变量是负相关,即一个变量的值越大另一个变量的值反而会越小。r 的绝对值越大表明相关性越强,要注意的是这里并不存在因果关系。若r=0,表明两个变量间不是线性相关,但有可能是其他方式的相关。更多可参考皮尔逊积矩相关系数
这个概念广泛用于统计,可以用来度量用户之间的相似性。它计算用户的参考值来度量用户之间相关性的高低。我们来看看下面数据:
1,101,5.0 1,102,3.0 1,103,2.5 2,101,2.0 2,102,2.5 2,103,5.0 2,104,2.0 3,101,2.5 3,104,4.0 3,105,4.5 3,107,5.0 4,101,5.0 4,103,3.0 4,104,4.5 4,106,4.0 5,101,4.0 5,102,3.0 5,103,2.0 5,104,4.0 5,105,3.5 5,106,4.0 |
我们注意到用户1和5看起来相似,因为他们的参考值看起来差不多。对应项101,102和103:101都是最好的,102其次,103都不好。同样的理由,用户1和2就不相似了。注意到用户1并没有104到106的参考值,相似性计算只能从已知项上计算。通过以上的数据,我们可以导出用户1和其它用户之间同101,102和103计算相关性的表:
表4.1
项101 | 项102 | 项103 | 和用户1的相关性 | |
用户1 | 5.0 | 3.0 | 2.5 | 1.000 |
用户2 | 2.0 | 2.5 | 5.0 | -0.764 |
用户3 | 2.5 | – | – | – |
用户4 | 5.0 | – | 3.0 | 1.000 |
用户5 | 4.0 | 3.0 | 2.0 | 0.945 |
皮尔逊相似性问题
虽然结果确实和直觉类似,但是皮尔逊相似性有一些奇怪的特性。
首先,它并没把用户参考值重叠项的数目的纳入计量,这可能在上下文推荐引擎中表现比较弱。两个用户看了200个相同的电影,他们常常在评分上不一致,他们应该比只看了2个电影的情况更相似。但如上表所示,用户1和用户5对3个项都有类似的评分,看起来爱好更相似,但是用户1和用户4却通过2项的评分得到更高的相关性1。看起来不合直觉。
其次,如果两个用户只重叠了一个项,那么相关系数无法计算出来。这也是为什么用户1和用户3无法计算相关性。当少量的稀疏数据集时,这将是一个问题。
最后,如果用户的参考值都一致,那么相关系数将无法计算。例如,如果用户5对3项的评分都为3.0,那么用户5与用户1的相关系数将无法计算。
利用权值
PearsonCorrelationSimilarity提出了一个关于标准计算的扩展方案,叫做权值。它并不直接从项上去计算,而是每项乘上一个权值。
使用欧几里得距离定义相似性
我们使用欧几里得距离相似性重现运行代码4.3的代码。这种实现基于用户之间的距离。想象一下,每个用户由许多维度构成(这儿是项),每个维度的值为参考值。通过这些维度计算出用户之间的距离,越大表示越不相似。为了好表达,我们把结果d修改为1/(1+d)返回,这样就成了越小,就越不相似了。如下表4.2所示,我们也可以就算出他们之间的欧几里得距离:
表4.2
项101 | 项102 | 项103 | 距离 | 和用户1的相似性 | |
用户1 | 5.0 | 3.0 | 2.5 | 0.000 | 1.000 |
用户2 | 2.0 | 2.5 | 5.0 | 3.937 | 0.203 |
用户3 | 2.5 | – | – | 2.500 | 0.286 |
用户4 | 5.0 | – | 3.0 | 0.500 | 0.667 |
用户5 | 4.0 | 3.0 | 2.0 | 1.118 | 0.472 |
使用EuclideanDistanceSimilarity重新运行代码4.3,结果为0.75,结果更好了。注意到用户1和3的距离也计算出来了。
使用余弦定义相似性
The cosine measure similarity is another similarity metric that depends on envisioning
user preferences as points in space. Hold in mind the image of user preferences as
points in an n-dimensional space. Now imagine two lines from the origin, or point
(0,0,…,0), to each of these two points. Whentwo users are similar, they’ll have similar
ratings, and so will be relatively close in space—at least, they’ll be in roughly the same
direction from the origin. The angle formed between these two lines will be relatively
small. In contrast, whenthe two users are dissimilar, their points will be distant, and
likely in different directions from the origin, forming a wide angle.
This angle can be used as the basis for a similarity metric in the same way that the
Euclidean distance was used to form a similarity metric. In this case, the cosine of the
angle leads to a similarity value. If you’re rusty on trigonometry, all you need to
remember to understand this is that the cosine value is always between –1 and 1: the
cosine of a small angle is near1, and the cosine of a large angle near 180 degrees is
close to –1. This is good, because small angles should map to high similarity, near 1,
and large angles should map to near –1.
You may be searching for something like CosineMeasureSimilarityin Mahout,
but you’ve already found it, under an unexpected name: PearsonCorrelation-Similarity. The cosine measure similarity and Pearson correlation aren’t the same
thing, but if you bother to work out the math, you’ll discover that they actually reduce
to the same computation when the two series of input values each have a mean of 0
(centered). The Mahout implementation centers the input, so the two are the same.
The cosine measure similarity is commonly referenced in research on collabora-tive filtering. You can employ this similarity metric by simply using Pearson-CorrelationSimilarity.
用斯皮尔曼等级相关系数定义相似性
The Spearman correlation is an interesting variant on the Pearson correlation, for our
purposes. Rather than compute a correlation based on the original preference values,
it computes a correlation based on the relative rankof preference values. Imagine that,
for each user, their least-preferred item’spreference value is overwritten with a 1.
Then the next-least-preferred item’spreference value is changed to 2, and so on. To
illustrate this, imagine that you were rating movies and gave your least-preferred
movie one star, the next-least favorite two stars, and so on. Then, a Pearson correla-tion is computed on the transformed values. This is the Spearman correlation.
This process loses some information. Although it preserves the essence of the pref-erence values—their ordering—it removes information about exactly how much more
each item was liked than the last. This may or may not be a good idea; it’s somewhere
between keeping preference values and forgetting them entirely—two possibilities
that we’ve explored before.
Table 4.3 shows the resulting Spearman correlations. Its simplifications on this
already-simple data set result in some extreme values: in fact, all correlations are 1 or –1 here, depending on whether a user’s preference values run with or counter to
user 1’s preferences. As with the Pearson correlation, no value can be computed
between users 1 and 3.
SpearmanCorrelationSimilarityimplements this idea. You could try using this as
the UserSimilarityin the evaluator code from before. Run it, and take a long coffee
break. Turn in for the night. It won’t finish anytime soon. This implementation is far
slower because it must do some nontrivial work to compute and store these ranks. The
Spearman correlation–based similarity metricis expensive to compute, and is there-
fore possibly more of academicinterest than practical use. For some small data sets,
though, it may be desirable.
This is a fine time to introduce one of many caching wrapper implementations
available in Mahout. CachingUserSimilarityis a UserSimilarityimplementation
that wraps another UserSimilarityimplementation and caches its results. That is, it
delegates computation to another, given implementation, and remembers those
results internally. Later, when it’s asked for a user-user similarity value that was previ-
ously computed, it can answerimmediately rather than ask the given implementation
to compute it again. In this way, you can add caching to any similarity implementa-
tion. When the cost of performing a computation is relatively high, as here, it can be
worthwhile to employ. The cost, of course, is the memory consumed by the cache.
Instead of just using SpearmanCorrelationSimilarity, try using the implementa-
tion shown in the following listing.
UserSimilarity similarity = new CachingUserSimilarity(
new SpearmanCorrelationSimilarity(model), model);
It’s also advisable to decrease the amount of test data from 5 percent to 1 percent by
increasing the trainingPercentageargument to evaluate()from 0.95 to 0.99. It
would also be wise to decrease the evaluation percentage from 5 percent to 1 percent
by changing the last parameter from 0.05 to 0.01. This will allow the evaluation to fin-
ish in more like tens of minutes.
The result should be near0.80. Again, broad conclusions are difficult to draw: on
this particular data set, it was not quite as effective as other similarity metrics.
Table 4.3 The preference values transformed into ranks, and the resulting
Spearman correlation between user 1 and each of the other users
Item 101 Item 102 Item 103
Correlation
to user 1
User 1 3.0 2.0 1.0 1.0
User 2 1.0 2.0 3.0 –1.0
User 3 1.0 – – –
User 4 2.0 – 1.0 1.0
User 5 3.0 2.0 1.0 1.0
Listing 4.5 Employing caching with a UserSimilarityimplementation
使用Tanimoto系数定义忽略参考值的相似性
Interestingly, there are also User-Similarityimplementations that ignore
preference values entirely. They don’t
care whether a user expresses a high or
low preference for an item—only that the
user expresses a preference at all. As we
discussed in chapter 3,it doesn’t necessar-ily hurt to ignore preference values.
TanimotoCoefficientSimilarityis
one such implementation, based on (sur-prise) the Tanimoto coefficient. This
value is also known as the Jaccard coeffi-cient. It’s the number of items that two
users express some preference for,
divided by the number of items that
either user expresses some preferencefor, as illustrated in figure 4.3.
In other words, it’s the ratio of the size of the intersection to the size of the union
of their preferred items. It has the required properties: when two users’ items com-pletely overlap, the result is 1.0. When they have nothing in common, it’s 0.0. The
value is never negative, but that’s OK. It’s possible to expand the results into the
range –1 to 1 with some simple math: similarity = 2 • similarity – 1. It won’t matter to
the framework.
Table 4.4 shows the value of the Tanimoto coefficient similarity between user 1 and
other users.
Note that this similarity metric doesn’t depend only on the items that bothusers
have some preference for, but that eitheruser has some preference for. Hence, all
seven items appear in the calculation, unlike before.
You’re likely to use this similarity metricif, and only if, your underlying data con-tains only Boolean preferences and you haveno preference values to begin with. If
Table 4.4 The similarity values between user 1 and other users, computed using the
Tanimoto coefficient. Note that preference values themselves are omitted, because they
aren’t used in the computation.
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Similarity
to user 1
User 1 XXX 1.0 User 2 XXXX 0.75 User 3 X X X X 0.17
User 4 XXXX 0.4 User 5 XXXXXX 0.5
you do have preference values, you might use this metric because you believe there’s
more signal than noise in the values. You would usually do better with a metric that
uses the preference information. In the GroupLens data set, using the Tanimoto coef-ficient gives a slightly worse score of 0.82.
使用对数似然函数测试比较得到更好的相似度量
Log-likelihood–based similarity is similar tothe Tanimoto coefficient–based similarity,
though it’s more difficult to understand intuitively. It’s another metric that doesn’t
take account of individual preference values. The math involved in computing this
similarity metric is beyond the scope of this book to explain. Like the Tanimoto coeffi-cient, it’s based on the number of items in common between two users, but its value is
more an expression of how unlikelyit is for two users to have so much overlap, given
the total number of items out there and the number of items each user has a prefer-ence for.
To illustrate, consider two movie fans who have each seen and rated several movies
but have only bothseen Star Warsand Casablanca. Are they similar? If they’ve each seen
hundreds of movies, it wouldn’t mean much. Many people have seen these movies,
and if these two have seen many movies but only managed to overlap in these two,
they’re probably not similar. On the other hand, if each user has seen just a few mov-ies, and these two were on bothusers’ lists, it would seem to imply that they’re similar
people when it comes to movies; the overlap would be significant.
The Tanimoto coefficient already encapsulates some of this thinking, because it
looks at the ratio of the size of the intersection of their interests to the union. The log-likelihood is computing something slightly different. It’s trying to assess how unlikelyit
is that the overlap between the two users is just due to chance. That is to say, two dis-similar users will no doubt happen to rate a couple movies in common, but two similar
users will show an overlap that looks quite unlikely to be mere chance. With some sta-tistical tests, this similarity metric attemptsto determine just how strongly unlikely it is
that two users have no resemblance in their tastes; the more unlikely, the more similar
the two should be. The resulting similarity value may be interpreted as a probability
that the overlap isn’t due to chance.
These similarity values are shown for our small data set in table 4.5. As you can see,
user 1’s similarity with user 1, or any other user with identical preferred items, isn’t 1.0
under this measure. Using a log-likelihood–basedsimilarity metric is as easy as insert-ing new LogLikelihoodSimilarityin listing 4.3, as before.
Table 4.5 The similarity values between user 1 and other users, computed using the log-likelihood similarity metric
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Similarity
to user 1
User 1 X X X 0.90
User 2 XXX X 0.84
Although it’s hard to generalize, log-likelihood–based similarity will probably outper-
orm Tanimoto coefficient–based similarity. Itis, in a sense, a more intelligent metric.
Rerunning the evaluation shows that, at least for this data set and recommender, it
mproves performance over TanimotoCoefficientSimilarity, to 0.73.
User 3 X X X X 0.55
User 4 X X X X 0.16
User 5 XXX XXX 0.55
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Similarity
to user 1
推断参考值
Sometimes too littledata is a problem. In a few cases, for example, the Pearson correla-tion was unable to compute any similarity value at all because some pairs of users over-lap in only one item. The Pearson correlation also can’t take account of preference
values for items for which only one user has expressed a preference.
What if a default value were filled in for all the missing data points? For example,
the framework could pretend that each user has rated everyitem by inferringprefer-ences for items for which the user hasn’t explicitly expressed a preference. This sort of
strategy is enabled via the PreferenceInferrerinterface, which currently has one
implementation, AveragingPreferenceInferrer. This implementation computes the
average preference value for each user and fills in this average as the preference value
for any item not already associated with the user. It can be enabled on a User-Similarityimplementation with a call to setPreferenceInferrer().
Although this strategy is available, it’s not usually helpful in practice. It’s provided
primarily because it’s mentioned in early research papers on recommender engines.
In theory, making up information purely based on existing information isn’t adding
anything, but it certainly does slow down computations drastically. It’s available for
experimentation but will likely not be useful when applied to real data sets.
You’ve now seen all of the user-user similarity metrics that are implemented within
Mahout. This knowledge pays double dividends, because the very same implementa-tions in Mahout also provide an analogous notion of item-item similarity. That is, the
same computations can be applied to define how similar items are, not just users. This
is just in time, because the notion of similarity is likewise the basis of the next style of
Mahout recommender implementation you’ll meet, the item-based recommender.
4.4 基于商品的推荐系统
We’ve looked at user-based recommenders in Mahout—not one recommender, but
tools to build a nearly limitless number of variations on the basic user-based approach, by plugging different and differently configured components into the
implementation.
It’s natural to look next at item-basedrecommenders. This section will be shorter,
because several of the components in the preceding discussion (data models, similar-ity implementations) also apply to item-based recommenders.
Item-based recommendation isderived from how similar itemsare to items, instead
of usersto users. In Mahout, this means they’re based on an ItemSimilarityimple-mentation instead of UserSimilarity. To illustrate, let’s return to the pair we left in
the music store, doing their best to pick an album that a teenage boy would like. Imag-ine yet another line of reasoning they could have adopted:
ADULT: I’m looking for a CD for a teenage boy.
EMPLOYEE: What kind of music or bands does he like?
ADULT: He wears a Bowling In Hades T-shirt all the time and seems to have all of
their albums. Anything else you’d recommend?
EMPLOYEE: Well, about everyone I know that likes Bowling In Hades seems to like the new
Rock Mobster album.
This sounds reasonable. It’s different fromprevious examples, too. The record store
employee is recommending an item that’ssimilar to something they already know
the boy likes. This isn’t the same as before, where the question was, “Who is similar
to the boy, and what do they like?” Here the question is, “What is similar to what the
boy likes?”
Figure 4.4 attempts to illustrate this difference, which is the essential difference
between user-based and item-based recommenders. They go by different paths to find
recommended items: via similar users, and via similar items, respectively.
算法
The algorithm will feel familiar to you since you have seen user-based recommenders
already. This is also how the algorithm is implemented in Mahout.
for every item ithat uhas no preference for yet
for every item jthat uhas a preference for
compute a similarity sbetween iand j
add u’s preference for j, weighted by s, to a running average
return the top items, ranked by weighted average The third line shows how it’s based on item-item similarities, not user-user similarities
as before. The algorithms are similar, but not entirely symmetric. They do have nota-bly different properties. For instance, the running time of an item-based recom-mender scales up as the number of items increases, whereas a user-based
recommender’s running timegoes up as the number of users increases.
This suggests one reason that you might choose an item-based recommender: if
the number of items is relatively low compared to the number of users, the perfor-mance advantage could be significant.
Also, items are typically less subject to change than users. When items are things
like DVDs, it’s reasonable to expect that over time, as you acquire more data, estimates
of the similarities between items converge. There is no reason to expect them to change
radically or frequently. Some of the same may be said of users, but users can change
over time and new knowledge of users is likely to come in bursts of new information
that must be digested quickly. To connect this to the last example, it’s likely that Bowl-ing In Hades albums and Rock Mobster albums will remain as similar to each other
next year as today. But it’s a lot less likelythat the same fans mentioned above will have
the same tastes next year, and so their similarities will change more.
If item-item similarities are more fixed,they’re better candidates for precomputa-tion. Precomputing similarities takes work, but it speeds up recommendations at run-time. This could be desirable in contextswhere delivering recommendations quickly
at runtime is essential—think about a news site that must potentially deliver recom-mendations immediately witheach news article view.
In Mahout, the GenericItemSimilarityclass can be used to precompute and
store the results from any ItemSimilarity. It can be applied to any of the implemen-tations you have seen, and can be added to the code snippets that follow, if you wish.
探索基于商品的推荐系统
Let’s insert a simple item-based recommender into our familiar evaluation frame-work, using the following code. It deploys a GenericItemBasedRecommenderrather
than GenericUserBasedRecommender, and it requires a different and simpler set of
dependencies.
public Recommender buildRecommender(DataModel model)
throws TasteException {
ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
return new GenericItemBasedRecommender(model, similarity);
}
PearsonCorrelationSimilaritystill works here because it also implements the
ItemSimilarityinterface, which is entirely analogous to the UserSimilarityinter-face. It implements the same notion of similarity, based on the Pearson correlation,
but between items instead of users. That is, it compares series of preferences
expressed by many users, for one item, rather than by one user for many items.
GenericItemBasedRecommenderis simpler. It only needs a DataModeland Item-Similarity—there’s no ItemNeighborhood. You might wonder at the apparent asym-metry. Recall that the item-based recommendation process already begins with a
limited number of starting points: the items that the user in question already
expresses a preference for. This is analogous to the neighborhood of similar users that
the user-based approach first identifies. It doesn’t make sense in the second half of the
algorithm to compute neighborhoods aroundeach of the user’s preferred items.
You’re invited to experiment with differentsimilarity metrics, as we did for users.
Not all of the implementations of UserSimilarityalso implement ItemSimilarity.
You already know how to evaluate the accuracy of this item-based recommender when
using various similarity metrics on the now-familiar GroupLens data set. The results
are reproduced for convenience in table 4.6.
One thing you may notice is that this recommender setup runs significantly faster.
That’s not surprising, given that the data set has about 70,000 users and 10,000 items.
Item-based recommenders are generally faster when there are fewer items than users.
You may, as a result, wish to increase the percentage of data used in the evaluation
to 20 percent or so (pass 0.2 as the final argument to evaluate()). This should result
in a more reliable evaluation. Note that there’s little apparent difference among these
implementations on this data set.
Implementation Similarity
PearsonCorrelationSimilarity 0.75
PearsonCorrelationSimilarity+ weighting 0.75
EuclideanDistanceSimilarity 0.76
EuclideanDistanceSimilarity+ weighting 0.78
TanimotoCoefficientSimilarity 0.77
LogLikelihoodSimilarity 0.77
4.5 Slope-one推荐系统
Did you like the movie Carlito’s Way? Most people who liked this movie, it seems, also
liked another film starring Al Pacino, Scarface. But people tend to like Scarfacea bit
more. We’d imagine most people who think of Carlito’s Wayas a four-star movie would
give Scarfacefive stars. So if you told us you thought Carlito’s Waywas a three-star
movie, we might guess you’d give Scarfacefour stars—one more than the other film.
If you agree with this sort of reasoning, you’ll like the slope-one recommender
(http://en.wikipedia.org/wiki/Slope_One). It estimates preferences for new items
based on average difference in the preference value (diffs) between a new item and
the other items the user prefers. For example, let’s say that, on average, people rate Scarfacehigher by 1.0 than Car-lito’s Way. Let’s also say that everyone rates Scarfacethe same as The Godfather, on aver-age. And now, there’s a user who rates Carlito’s Wayas 2.0, and The Godfatheras 4.0.
What’s a good estimate of their preference for Scarface?
Based on Carlito’s Way, a good guess would be 2.0 + 1.0 = 3.0. Based on The Godfa-ther, it might be 4.0 + 0.0 = 4.0. A better guess still might be the average of the two: 3.5.
This is the essence of the slope-one recommender approach.
算法
The slope-one name comes fromthe fact that the recommender algorithm starts with
the assumption that there’s some linear relationship between the preference values
for one item and another, that it’s valid to estimate the preferences for some item Y
based on the preferences for item X, via some linear function like Y= mX+ b. Then the
slope-one recommendermakes the additional simplifying assumption that m=1: slope
one. It just remains to find b= Y-X, the (average) difference in preference value, for
every pair of items.
This means the algorithm consists of a significant preprocessing phase, in which all
item-item preference value differences are computed:
for every item i
for every other item j
for every user uexpressing preference for both iand j
add the difference in u’s preference for iand jto an average
And then, the recommendation algorithm looks like this:
for every item ithe user uexpresses no preference for
for every item jthat user u expresses a preference for
find the average preference difference between jand i
add this diff to u’s preference value for j
add this to a running average
return the top items, ranked by these averages
The average diffs over the small sample recommender inputs from several prior exam-ples in the book are shown in table 4.7.
Table 4.7 Average differences in preference values between all pairs of items. Cells along the
diagonal are 0.0. Cells in the bottom left are simply the negative of their counterparts across the
diagonal, so these aren’t represented explicitly. Some diffs don’t exist, such as 102-107,
because no user expressed a preference for both 102 and 107.
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Item 101 –0.833 0.875 0.25 0.75 –0.5 2.5
Item 102 0.333 0.25 0.5 1.0 -Item 103 0.167 1.5 1.5 -Item 104 0.0 –0.25 1.0
Slope-one is attractive because the online portion of the algorithm is fast. Like an
item-based recommender, its performance doesn’t depend upon the number of users
in the data model. It depends only uponthe average preference difference between
every pair of items, which can be precomputed. Further, its underlying data structure
can be efficiently updated: when a preference changes, it’s simple to update the rele-
vant diff values. In contexts where preferences may change quickly, this is an asset.
Note that the memory requirements necessary to store all of these item-item differ-
ences in preference values grow as the squareof the number of items. Twice as many
items means four times the memory!
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Item 105 0.5 0.5
Item 106 –
Item 107
Slope-one实践
You can easily try the slope-one recommender. It takes no similarity metric as a neces-sary argument:
new SlopeOneRecommender(model)
After running a standard evaluation using, again, the GroupLens 10 million rating
data set, you’ll get a result near 0.65. That’sthe best yet. Indeed, the simple slope-one
approach works well in manycases. This algorithm doesn’t make use of a similarity
metric, unlike the other approaches so far.It has relatively few knobs to twiddle.
Like the Pearson correlation, the simplest form of the slope-one algorithm has a
vulnerability: item-item diffs are given equal weighting regardless of how reliable they
are—how much data they’re based upon. Let’s say only one user in the history of
movie watching has rated both Carlito’s Wayand The Notebook. It’s possible; they’re
quite different films. It’s easy to compute a diff for these two films, but would it be as
useful as the diff between Carlito’s Wayand The Godfather, which are averaged over
thousands of users? It sounds unlikely. The latter diff is probably more reliable
because it’s an average overa higher number of users.
Again, some form of weighting may help to improve these recommendations.
SlopeOneRecommenderoffers two types of weighting:weighting based on count and on
standard deviation. Recall that slope-one estimates the preference values by adding
diffs to all of the user’s current preference values and then averaging all of those
results together to form an estimate. Count weighting will weightmore heavily those
diffs that are based on more data. The average becomes a weighted average, where the
diff count is the weight—the number of users on which the diff is based.
Similarly, standard deviation weighting will weight according to the standard devia-tion of difference in preference value.A lower standard deviation means a higher weighting. If the difference in preferencevalue between two films is very consistent
across many users, it seems more reliable and should be given more weight. If it varies
considerably from user to user, it should be deemphasized.
These variants turn out to be enough of a good idea that they’re enabled by
default. You already used this strategy when you ran the previous evaluation. Disable
them to see the effect, as follows.
DiffStorage diffStorage = new MemoryDiffStorage(
model, Weighting.UNWEIGHTED, Long.MAX_VALUE));
return new SlopeOneRecommender(
model,
Weighting.UNWEIGHTED,
Weighting.UNWEIGHTED,
diffStorage);
The result is 0.67—only slightly worse on this data set.
不同存储和内存考虑
Slope-one does have its price: memory consumption. In fact, if you tweak the evalua-tion to use just 10 percent of all data (about 100,000 ratings), even a gigabyte of heap
space won’t be enough. The diffs are used sofrequently, and it’s so relatively expen-sive to compute them, that they do need to be computed and stored ahead of time.
But keeping them all in memory can get expensive. It may become necessary to store
diffs elsewhere.
Fortunately, implementations like MySQLJDBCDiffStorageexist for this purpose,
allowing diffs to be computed and updated from a database. It must be used in con-junction with a JDBC-backed DataModelimplementation like MySQLJDBCDataModel, as
shown in the next listing.
AbstractJDBCDataModel model = new MySQLJDBCDataModel();
DiffStorage diffStorage = new MySQLJDBCDiffStorage(model);
Recommender recommender = new SlopeOneRecommender(
model, Weighting.WEIGHTED, Weighting.WEIGHTED, diffStorage);
As with MySQLJDBCDataModel, the table name and column names used by MySQLJDBC-DiffStoragecan be customized via constructor parameters.
预先计算
Precomputing the item-item diffsis significant work. Althoughit’s more likely that the
size of your data will cause problems with memory requirements before the time
required to compute these diffs becomes problematic, you might be wondering if
there are ways to distribute this computation to complete it faster. Diffs can be updated easily at runtime in response to new information, so a relatively infrequent
offline precomputation process is feasible in this model.
Distributing the diff computation via Hadoop issupported. Chapter 6 will introduce
all of the Hadoop-related recommender support in Mahout, to explore this process.
4.6 新的实验性推荐系统
Mahout also contains implementations of other approaches to recommendation. The
three implementations presented briefly in this section are newer: the implementa-tions may still be evolving, or the technique may be more recent and experimental.
All are worthy ideas that may be useful for use or for modification.
基于奇异值分解的推荐系统
Among the most intriguing of these implementations is SVDRecommender, based on
the singular value decomposition (SVD). This is an important technique in linear alge-bra that pops up in machine learning techniques. Fully understanding it requires
some advanced matrix math and understanding of matrix factorization, but this isn’t
necessary to appreciate SVD’s application torecommenders.
To outline the intuition behind what SVDdoes for recommenders, let’s say you ask
a friend what sort of music she likes, and she lists the following artists:
Brahms Louis Armstrong
Chopin Schumann
Miles Davis John Coltrane
Tchaikovsky Charlie Parker
She might as well have summarized that she likes classical and jazz music. That com-municates less precise information, but not a great deal less. From either statement,
you could (probably correctly) infer that she would appreciate Beethoven more than
the classic rock band Deep Purple.
Of course, recommender engines operate ina world of many specific data points,
not generalities. The input is user preferences for a lot of particular items—more like
the preceding list than thissummary. It would be nice to operate on a smaller set of
data, all else equal, for reasons of performance. If, for example, iTunes could base its
Genius recommendations not on billionsof individual song ratings, but instead on mil-lionsof ratings of genres, it would be faster—and, as a basis for recommending music,
it might not be much worse.
Here, SVDis the magic that can do the equivalent of the preceding summarization.
It boils down the world of user preferences for individual items to a world of user
preferences for more general and less numerous features(like genre). This is, poten-tially, a much smaller set of data.
Although this process loses some information, it can sometimes improve recom-mendation results. The process smooths the input in useful ways. For example, imag-ine two car enthusiasts. One loves Corvettes, and the other loves Camaros, and they want car recommendations. These enthusiasts have similar tastes: both love a Chevro-let sports car. But in a typical data model for this problem, these two cars would be dif-ferent items. Without any overlap in their preferences, these two users would be
deemed unrelated. However, an SVD-based recommender would perhaps find the sim-ilarity. The SVDoutput may contain features that correspond to concepts like Chevrolet
or sports car, with which both users would be associated. From the overlap in features, a
similarity could be computed.
Using the SVDRecommenderis as simple as this:
new SVDRecommender(model, new ALSWRFactorizer(model, 10, 0.05, 10))
SVDRecommenderuses a Factorizerto perform its work; a fine first choice is the
ALSWRFactorizerimplementation used here. Wewon’t discuss your choices of
Factorizerin detail here.
The first numeric argument is the number of features that the SVDshould target.
There’s no right answer for this; it would beequivalent to the number of genres that
you condensed someone’s musical taste into, in this section’s first example. The sec-ond argument is lambda and it controls a feature of the factorizer called regulariza-tion. The last argument is the number of training steps to run. Think of this as
controlling the amount of time it should spend producing this summary; larger values
mean longer training.
This approach can give good results (0.69 on the GroupLens data set). At the
moment, the major issue with the implementation is that it computes the SVDin
memory. This requires the entire data set to fit in memory, and it’s precisely when
this isn’tthe case that this technique is appealing, because it can shrink the input
without compromising output quality significantly. In the future, this algorithm will
be reimplemented in terms of Hadoop, so the necessarily massive SVDcomputation
can be distributed across multiple machines. It isn’t yet available at this stage of
Mahout’s evolution.
基于线性插值的推荐系统
Linear interpolation is a somewhat different take on item-based recommendation,
implemented as KnnItemBasedRecommender. Knnis short for k nearest neighbors, which
is an idea also embodied in NearestNUserNeighborhood. As you saw earlier in
figure 4.1, a UserNeighborhoodimplementation selects a fixed number of most simi-lar users as a neighborhood of similar users. This linear interpolation algorithm uses
the concept of a user neighborhood, but in a different way.
KnnItemBasedRecommenderstill estimates preference values by means of a
weighted average of the items the user already has a preference for, but the weights
aren’t the results of a similarity metric. Instead, the algorithm calculates the optimal
set of weights to use between all pairs ofitems by means of some linear algebra—
here’s where the linear interpolation comes in. Yes, it’s possible to optimize the
weights with some mathematical wizardry.In reality, it would be very expensive tocompute this across all pairs of items, so
instead it first calculates a neighborhood ofitems most similar tothe target item—the
item for which a preference is being estimated. It chooses the nnearest neighbors, in
much the same way that NearestNUserNeighborhooddid. One can try KnnItemBased-
Recommenderas shown in the following listing.
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
Optimizer optimizer = new NonNegativeQuadraticOptimizer();
return new KnnItemBasedRecommender(model, similarity, optimizer, 10);
This code will cause the recommender to use a log-likelihood similarity metric to cal-
culate nearest-10 neighborhoods of items.Then it’ll use a quadratic programming–
based strategy to calculate the linear interpolation. The details of this calculation are
outside the scope of the book.
The implementation is quite functional, but in its current form, it’s also slow on
moderately sized data sets. It should be viewedas viable for small data sets, or for study
and extension. On the GroupLens data set, it yields an evaluation result of 0.76.
Listing 4.9 Deploying KnnItemBasedRecommender
基于集群的推荐系统
Cluster-based recommendation is best thought of as a variant on user-based recom-mendation. Instead of recommending items to users, items are recommended to clus-ters of similar users. This entails a preprocessing phase, in which all users are
partitioned into clusters. Recommendations are then produced for each cluster, such
that the recommended items are most interesting to the largest number of users.
The upside of this approach is that recommendation is fast at runtime because
almost everything is precomputed. One could argue that
the recommendations are less personal this way, because
recommendations are computedfor a group rather than
an individual. This approach may be more effective at
producing recommendations for new users, who have lit-tle preference data available. As long as the user can be
attached to a reasonably relevant cluster, the recommen-dations ought to be as good as they’ll be when more is
known about the user.
The name comes from the fact that the algorithm
repeatedly joins most-similar clusters into larger clusters,
and this implicitly organizes users into a sort of hierarchy,
or tree, as depicted in figure 4.5.
Unfortunately, the clustering takes a long time, as
you’ll see if you attempt to run the code in the following
listing, which employs a TreeClusteringRecommenderto
implement this idea.UserSimilarity similarity = new LogLikelihoodSimilarity(model);
ClusterSimilarity clusterSimilarity =
new FarthestNeighborClusterSimilarity(similarity);
return new TreeClusteringRecommender(model, clusterSimilarity, 10);
Similarity between users is, as usual, defined by a UserSimilarityimplementation.
Similarity between two clusters of users is defined by a ClusterSimilarityimplemen-
tation. Currently, two implementations are available: one defines cluster similarity as
the similarity between the two most similar user pairs, one chosen from each cluster;
the other defines cluster similarity as the similarity between the two least similarusers.
Both approaches are reasonable; in both cases the risk is that one outlier on the
edge of a cluster distorts the notion of cluster similarity. Two clusters whose members
are on average distant but happen to be close at one edge would be considered quite
close by the most-similar-user rule, which is implemented by NearestNeighbor-
ClusterSimilarity. The least-similar-user rule, implemented by FarthestNeighbor-
ClusterSimilarityin listing 4.10, likewise may consider two fairly close clusters to be
distant from one another if each contains anoutlier far away from the opposite cluster.
A third approach, defining cluster similarity as the distance between the centers,
or means, of two clusters, is also possible, though not yet implemented in this part of
Mahout.
Listing 4.10 Creating a cluster-based recommender
4.7 对比其它推荐系统
As mentioned earlier in the book, content-based recommendation is a broad and
often-mentioned approach to recommendation that takes into account the contentor
attributes of items. For this reason, it’s similar to, yet distinct from, collaborative filter-ing approaches, which are based on user associations with items only, and treats items
as black boxes without attributes. Although Mahout largely doesn’t implement
content-based approaches, it does offer someopportunities to make use of item attri-butes in recommendation computations.
注入基于内容的计算到Mahout中
For example, consider an online bookseller who stocks multiple editions of some
books. This seller might recommend books toits customers. Its items are books, of
course, and it might naturally define a book by its ISBNnumber (a unique product
identifier). But for a popular public-domain book like Jane Eyre, there may be many
printings by different publishers of the same text, under different ISBNnumbers. It
seems more natural to recommendbooks based on their texts,rather than on a partic-ular edition—do you care more about reading “Jane Eyre” or “Jane Eyreas printed by
ACMEPublications in 1993 in paperback?” Rather than treat various publications of
Jane Eyreas distinct items, it might be more useful to think of the book itself, the text,
as the item, and recommend all editions of this book equally. This would be, in a
sense, content-based recommendation. By treating the underlying text of a book product, which is its dominant attribute, as the itemin a collaborative filtering sense,
and then applying collaborative filtering techniques with Mahout, the seller would be
engaging in a form of content-based recommendation.
Or recall that item-based recommenders require some notion of similarity
between two given items. This similarity is encapsulated by an ItemSimilarityimple-mentation. So far, implementations have derived similarity from user preferences
only—this is classic collaborative filtering.But there’s no reason the implementation
couldn’t be based on item attributes. For example, a movie recommender might
define item (movie) similarity as a functionof movie attributes like genre, director,
actor, and year of release. Using such animplementation withina traditional item-based recommender would also be an example of content-based recommendation.
深入基于内容的推荐系统
Taking this a step further, imagine content-based recommendation as a generalization
of collaborative filtering. In collaborative filtering, computations are based on prefer-ences, which are user-item associations. But what drives these user-item associations?
It’s likely that users have implicit preferences for certain item attributes, which come
out in their preferences for certain items and not others. For example, if your friend
told you she likes the albums Led Zeppelin I, Led Zeppelin II, and Led Zeppelin III, you
might well guess she is actually expressing a preference for an attribute of these items:
the band Led Zeppelin. By discovering these associations, and discovering attributes
of items, it’s possible toconstruct recommender engines based on these more
nuanced understandings of user-item associations.
These techniques come to resemble search and document-retrieval techniques:
asking what items a user might like based on user-attribute associations and item attri-butes resembles retrieving search results based on query terms and the occurrence of
terms in documents. Although Mahout’s recommender support doesn’t yet embrace
these techniques, it’s a natural direction for future versions to address.
和基于模型的推荐系统比较
Another future direction for Mahout is model-based recommendation. This family of
techniques attempts to build some model ofuser preferences based on existing pref-erences, and then infer new preferences. These techniques generally fall into the
broader category of collaborative filtering,because they typically derive from user
preferences only.
The model might be a probabilistic picture ofusers’ preferences, in the form of a
Bayesian network, for example. The algorithm then attempts to judge the probabilityof
liking an item given its knowledge of all user preferences, and it ranks the recommen-dations accordingly.
Association rule learning can be applied in a similar sense to recommendations. By
learning rules such as “when a user prefers item X and item Y, they will prefer item Z”
from the data, and judging confidence in the reliability of suchrules, a recommender
can put together the most likely set of new, preferred items. Cluster-based recommenders might be considered a type of model-based recom-mender. The clusters represent a model ofhow users group together and therefore
how their preferences might run the same way.In this limited sense, Mahout supports
model-based recommenders. But this is an area that’s still largely under construction
in Mahout as of this writing.
4.8 总结
In this chapter, we thoroughly exploredthe core recommenderalgorithms offered by
Mahout.
We started by explaining the general user-based recommender algorithm in terms
of real-world reasoning. From there, we looked at how this algorithm is realized in
Mahout, as GenericUserBasedRecommender. Many pieces of this generic approach can
be customized, such as the definition ofuser similarity and user neighborhood.
We looked at the classic user similaritymetric, based on the Pearson correlation,
noted some possible issues with this approach, and looked at responses such as
weighting. We looked at similarity metrics based on the Euclidean distance, Spearman
correlation, Tanimoto coefficient, and log-likelihood ratio.
Then we examined another canonical recommendation technique, item-based
recommendation, as implemented by GenericItemBasedRecommender. It reuses some
concepts already covered in the context of user-based recommenders, such as the
Pearson correlation.
Next, we examined a slope-one recommender, a unique and relatively simple
approach to recommendation based on average differences in preference values
between items. It requires significant precomputation and storage for these diffs, so
we explored how to store these both in memory and in a database.
Last, we looked briefly at a few newer,more experimental implementations cur-rently in the framework. These include implementations based on singular value
decomposition, linear interpolation, and clustering. These may be useful for small
data sets, or may be of academic interest, because they’re still a work in progress.
The key parameters and features for each implementation are summarized in
table 4.8.
Implementation Key parameters Key features
GenericUserBasedRecommender • User similarity metric
• Neighborhood definition
and size
• Conventional implementation
• Fast when number of users is
relatively small
GenericItemBasedRecommender • Item similarity metric • Fast when number of items is
relatively small
• Useful when an external notion
of item similarity is available
SlopeOneRecommender • Diff storage strategy • Recommendations and
updates are fast at runtime
• Requires large precomputation
• Suitable when number of items
is relatively small
SVDRecommender • Number of features • Good results
• Requires large precomputation
KnnItemBasedRecommender • Number of means (k)
• Item similarity metric
• Neighborhood size
• Good when number of items is
relatively small
TreeClusteringRecommender • Number of clusters
• Cluster similarity
definition
• User similarity metric
• Recommendations are fast at
runtime
• Requires large precomputation
• Good when number of users is
relatively small
Implementation Key parameters Key features
We’re done introducing Mahout’s recommender engine support. Now we’re ready to
examine even larger and more realistic data sets, from the practitioner’s perspective.
You might wonder why we’ve made little mention of Hadoop yet. Hadoop is a power-
ful tool, and it’s necessary when dealing with massive data sets, where one must make
use of many machines. This has drawbacks:such computations are massive, they’re
resource-intensive, and they complete in hours, not milliseconds. We’ll reach Hadoop
in the last chapter in this section.
First, in the next chapter, we create a production-ready recommender engine
based on Mahout that fits onto one machine—one that can respond to requests for
recommendations in a fraction of a second and incorporate updates immediately.本作品采用知识共享署名 4.0 国际许可协议进行许可。