4 创建推荐系统

原创文章,转载请注明: 转载自慢慢的回味

本文链接地址: 4 创建推荐系统

本章包括
  深入了解基于用户的推荐系统
  相识度度量
  基于商品的和其它推荐系统

  我们已经学习了怎么评估推荐系统和怎么表示推荐系统的数据输入,现在我们来详细看看推荐系统。前一章简单介绍了Mahout中实现的两类推荐系统:基于用户的和基于商品的。实际上,在第二章,你已经接触到了基于用户的推荐系统。两类算法都依赖相似性度量,而且有很多方法度量相似性,包括皮尔逊相关性,对数似然函数,斯皮尔曼相关性等。最终,也可以接触到其它样式的推荐算法,包括slop-one,奇异值分解,基于集群的推荐引擎。

4.1 理解基于用户的推荐系统

  基于用户的推荐算法的特征是:它依赖用户之间相似性的度量。实际上,你每天都可能接触到了这种算法。
什么时候推荐是错误的

  先看看下面一个成年人去CD店买CD的场景:
成年人:我要为一个青少年找一章CD。
雇员:什么样的青少年?
成年人:就像这些天来的青少年。
雇员:什么样的音乐或类型?
成年人:我糊涂了,我不知道。
雇员:我想,类似New 2 Town的专辑吧。
成年人:就买它了。

  但是这种情况经常错误,因为某些青少年喜欢音乐,可能其它的喜欢相册呢?所以,从相似的人群中去运行推荐将更合理。

什么时候推荐是正确的

  让我们重现这个场景,并把它构建好一些:
成年人:我要为一个青少年找一章CD。
雇员:他喜欢什么类型的音乐?
成年人:我不知道,但是他的最好的朋友总是穿一件T-shirt。
雇员:哦,非常流行的克利夫兰的新金属乐对……
这就对了,推荐系统基于两个好朋友的在音乐上相似的假设,对于这个可信赖的维度,这次预测讲准一些。

4.2 探索基于用户的推荐系统

算法
  基于用户的推荐系统就如这样,它推荐商品给某个用户,如下所示:

for 用户u没有参考值的每个项i
  for 其它对项i含有参考值的用户v
    计算用户u和用户v的相识度
    通过用s,加权用户v对i的参考值,计算平均值(s越大,平均值越大)
return 通过排序,返回最优的项

  通过对每项都这么计算将是非常慢的,现实中,通常先计算用户的邻居,只有邻居含有的项才会被考虑:

for 每个其它用户w
  计算用户u和w的相识度s
  通过相似度,获取最优的用户,即当前用户u的邻居n
for 邻居n拥有的项i,但用户u没有
  for 邻居n中拥有项i的用户v
    计算u和v的相识度s
    通过用s,加权用户v对i的参考值,计算平均值(s越大,平均值越大)

  这就Mahout中实现的标准的基于用户的推荐算法了。

用GenericUserBasedRecommender实现算法
  在第二章中我们举了第一个基于用户的推荐引擎。现在让我们返回这个例子并探索它的每个组件的使用吧。

		DataModel model = new FileDataModel(new File("intro.csv"));
		UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
		UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
		Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);

  UserSimilarity封装了用户之间的相似性,User-Neighborhood按最相似分组了用户,它们是标准的基于用户的相关组件。有许多定义最相似用户邻居的方法:最相似的5个?20个?还是特定值?举个例子,想象你要创建一个关于你婚礼的客人列表,你想邀请你最好的朋友和家庭,但是你有太多的朋友,而预算却不足。所以你必须决定怎么邀请比如限制50个人和家庭?或者40,100?或者你想邀请所有最要好的朋友?或者只邀请真正最要好的朋友?这种情况就类似于你决定怎么选择一个相似用户邻居。Mahout不仅仅是一个单一的推荐引擎,它可以插入或参数化来创建一个真正的在特定领域的推荐引擎。
组件如下:
  数据模型,通过DataModel实现
  用户到用户的相似性度量,通过UserSimilarity实现
  用户邻居定义,通过UserNeighborhood实现
  推荐引擎,通过Recommender实现(GenericUser-BasedRecommender)

使用GroupLens探索
  让我们返回GroupLens这个数据集,并且使用100倍多的数据,从http://grouplens.org去下载1亿评分的MovieLens数据,即从http://www.grouplens.org/node/73下载,并从中提取出ratings.dat。这和100,000次评分的那个数据集不同,ua.base已经是准备好的File-DataModel。你可以转换成CSV文件,也可以自己实现一个Data Model。幸好的是,Mahout已经实现了一个GroupLensDataModel,参考如下代码:

		DataModel model = new GroupLensDataModel(new File("ratings.dat"));
		UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
		UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model);
		Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
		LoadEvaluator.runLoad(recommender);

  请调节JVM的堆内存,不然运行时会内存不足。

探索用户的邻居数据
  我们来评估推荐系统的精度,如下的代码所示。我们考虑是否可能来配置和修改邻居算法的实现。

代码4.3

		DataModel model = new GroupLensDataModel(new File("ratings.dat"));
		RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
		RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
			@Override
			public Recommender buildRecommender(DataModel model) throws TasteException {
				UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
				UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model);
				return new GenericUserBasedRecommender(model, neighborhood, similarity);
			}
		};
		double score = evaluator.evaluate(recommenderBuilder, null, model, 0.95, 0.05);
		System.out.println(score);

  方法evaluate()中的0.05表示5%的数据将用来进行评估。评估是个费时的过程,使用全部数据几乎不可能,为了方便使用较小的值。参数中的0.95表示95%的数据用来构建模型,剩下的5%用于测试。这个score的结果在0.89左右。

固定大小的邻居数据
  上面代码的推荐引擎导出的邻居足足100个相似用户(NearestNUserNeighborhood里是这么设置的),这绝对是了从100个用户中去找最相似的。如果是10呢?推荐系统将依赖少数的相似用户,但是会排出一些相似性小的用户。试着用10替代100,评估结果将在0.98左右,意味着更大的评估结果是不好的,这可以这么解释,10个相似用户太少了。试着使用500,结果将为0.75,嗯,更好了。你还可以试试其它值。

基于阈值的邻居数据
  如果你不想用个数来限制相似用户,试着采用相似性。使用new ThresholdUserNeighborhood(0.7, similarity, model)试试吧。0.7就是相关性的阈值系数,此时评估结果将是0.84左右。改变系数,0.9输出0.92,0.5输出0.78.

4.3 探索相似性度量标准

  基于用户的推荐系统另一个重要的部分是用户相似性度量的实现。它是整个基于用户的推荐系统中最重要的。

基于皮尔逊相关性相似性度量
  现在我们来看看基于皮尔逊相关性相似性度量的实现。皮尔逊相关性是一个基于-1到1直接的数值,用r来表示,r描述的是两个变量间线性相关强弱的程度,若r>0,表明两个变量是正相关,即一个变量的值越大,另一个变量的值也会越大;若r<0,表明两个变量是负相关,即一个变量的值越大另一个变量的值反而会越小。r 的绝对值越大表明相关性越强,要注意的是这里并不存在因果关系。若r=0,表明两个变量间不是线性相关,但有可能是其他方式的相关。更多可参考皮尔逊积矩相关系数
这个概念广泛用于统计,可以用来度量用户之间的相似性。它计算用户的参考值来度量用户之间相关性的高低。我们来看看下面数据:

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

  我们注意到用户1和5看起来相似,因为他们的参考值看起来差不多。对应项101,102和103:101都是最好的,102其次,103都不好。同样的理由,用户1和2就不相似了。注意到用户1并没有104到106的参考值,相似性计算只能从已知项上计算。通过以上的数据,我们可以导出用户1和其它用户之间同101,102和103计算相关性的表:
表4.1

项101 项102 项103 和用户1的相关性
用户1 5.0 3.0 2.5 1.000
用户2 2.0 2.5 5.0 -0.764
用户3 2.5
用户4 5.0 3.0 1.000
用户5 4.0 3.0 2.0 0.945

皮尔逊相似性问题
  虽然结果确实和直觉类似,但是皮尔逊相似性有一些奇怪的特性。
  首先,它并没把用户参考值重叠项的数目的纳入计量,这可能在上下文推荐引擎中表现比较弱。两个用户看了200个相同的电影,他们常常在评分上不一致,他们应该比只看了2个电影的情况更相似。但如上表所示,用户1和用户5对3个项都有类似的评分,看起来爱好更相似,但是用户1和用户4却通过2项的评分得到更高的相关性1。看起来不合直觉。
  其次,如果两个用户只重叠了一个项,那么相关系数无法计算出来。这也是为什么用户1和用户3无法计算相关性。当少量的稀疏数据集时,这将是一个问题。
  最后,如果用户的参考值都一致,那么相关系数将无法计算。例如,如果用户5对3项的评分都为3.0,那么用户5与用户1的相关系数将无法计算。

利用权值
  PearsonCorrelationSimilarity提出了一个关于标准计算的扩展方案,叫做权值。它并不直接从项上去计算,而是每项乘上一个权值。

使用欧几里得距离定义相似性
  我们使用欧几里得距离相似性重现运行代码4.3的代码。这种实现基于用户之间的距离。想象一下,每个用户由许多维度构成(这儿是项),每个维度的值为参考值。通过这些维度计算出用户之间的距离,越大表示越不相似。为了好表达,我们把结果d修改为1/(1+d)返回,这样就成了越小,就越不相似了。如下表4.2所示,我们也可以就算出他们之间的欧几里得距离:
表4.2

项101 项102 项103 距离 和用户1的相似性
用户1 5.0 3.0 2.5 0.000 1.000
用户2 2.0 2.5 5.0 3.937 0.203
用户3 2.5 2.500 0.286
用户4 5.0 3.0 0.500 0.667
用户5 4.0 3.0 2.0 1.118 0.472

  使用EuclideanDistanceSimilarity重新运行代码4.3,结果为0.75,结果更好了。注意到用户1和3的距离也计算出来了。

使用余弦定义相似性
The cosine measure similarity is another similarity metric that depends on envisioning
user preferences as points in space. Hold in mind the image of user preferences as
points in an n-dimensional space. Now imagine two lines from the origin, or point
(0,0,…,0), to each of these two points. Whentwo users are similar, they’ll have similar
ratings, and so will be relatively close in space—at least, they’ll be in roughly the same
direction from the origin. The angle formed between these two lines will be relatively
small. In contrast, whenthe two users are dissimilar, their points will be distant, and
likely in different directions from the origin, forming a wide angle.
This angle can be used as the basis for a similarity metric in the same way that the
Euclidean distance was used to form a similarity metric. In this case, the cosine of the
angle leads to a similarity value. If you’re rusty on trigonometry, all you need to
remember to understand this is that the cosine value is always between –1 and 1: the
cosine of a small angle is near1, and the cosine of a large angle near 180 degrees is
close to –1. This is good, because small angles should map to high similarity, near 1,
and large angles should map to near –1.
You may be searching for something like CosineMeasureSimilarityin Mahout,
but you’ve already found it, under an unexpected name: PearsonCorrelation-Similarity. The cosine measure similarity and Pearson correlation aren’t the same
thing, but if you bother to work out the math, you’ll discover that they actually reduce
to the same computation when the two series of input values each have a mean of 0
(centered). The Mahout implementation centers the input, so the two are the same.
The cosine measure similarity is commonly referenced in research on collabora-tive filtering. You can employ this similarity metric by simply using Pearson-CorrelationSimilarity.
用斯皮尔曼等级相关系数定义相似性
The Spearman correlation is an interesting variant on the Pearson correlation, for our
purposes. Rather than compute a correlation based on the original preference values,
it computes a correlation based on the relative rankof preference values. Imagine that,
for each user, their least-preferred item’spreference value is overwritten with a 1.
Then the next-least-preferred item’spreference value is changed to 2, and so on. To
illustrate this, imagine that you were rating movies and gave your least-preferred
movie one star, the next-least favorite two stars, and so on. Then, a Pearson correla-tion is computed on the transformed values. This is the Spearman correlation.
This process loses some information. Although it preserves the essence of the pref-erence values—their ordering—it removes information about exactly how much more
each item was liked than the last. This may or may not be a good idea; it’s somewhere
between keeping preference values and forgetting them entirely—two possibilities
that we’ve explored before.
Table 4.3 shows the resulting Spearman correlations. Its simplifications on this
already-simple data set result in some extreme values: in fact, all correlations are 1 or –1 here, depending on whether a user’s preference values run with or counter to
user 1’s preferences. As with the Pearson correlation, no value can be computed
between users 1 and 3.
SpearmanCorrelationSimilarityimplements this idea. You could try using this as
the UserSimilarityin the evaluator code from before. Run it, and take a long coffee
break. Turn in for the night. It won’t finish anytime soon. This implementation is far
slower because it must do some nontrivial work to compute and store these ranks. The
Spearman correlation–based similarity metricis expensive to compute, and is there-
fore possibly more of academicinterest than practical use. For some small data sets,
though, it may be desirable.
This is a fine time to introduce one of many caching wrapper implementations
available in Mahout. CachingUserSimilarityis a UserSimilarityimplementation
that wraps another UserSimilarityimplementation and caches its results. That is, it
delegates computation to another, given implementation, and remembers those
results internally. Later, when it’s asked for a user-user similarity value that was previ-
ously computed, it can answerimmediately rather than ask the given implementation
to compute it again. In this way, you can add caching to any similarity implementa-
tion. When the cost of performing a computation is relatively high, as here, it can be
worthwhile to employ. The cost, of course, is the memory consumed by the cache.
Instead of just using SpearmanCorrelationSimilarity, try using the implementa-
tion shown in the following listing.
UserSimilarity similarity = new CachingUserSimilarity(
new SpearmanCorrelationSimilarity(model), model);
It’s also advisable to decrease the amount of test data from 5 percent to 1 percent by
increasing the trainingPercentageargument to evaluate()from 0.95 to 0.99. It
would also be wise to decrease the evaluation percentage from 5 percent to 1 percent
by changing the last parameter from 0.05 to 0.01. This will allow the evaluation to fin-
ish in more like tens of minutes.
The result should be near0.80. Again, broad conclusions are difficult to draw: on
this particular data set, it was not quite as effective as other similarity metrics.
Table 4.3 The preference values transformed into ranks, and the resulting
Spearman correlation between user 1 and each of the other users
Item 101 Item 102 Item 103
Correlation
to user 1
User 1 3.0 2.0 1.0 1.0
User 2 1.0 2.0 3.0 –1.0
User 3 1.0 – – –
User 4 2.0 – 1.0 1.0
User 5 3.0 2.0 1.0 1.0
Listing 4.5 Employing caching with a UserSimilarityimplementation
使用Tanimoto系数定义忽略参考值的相似性
Interestingly, there are also User-Similarityimplementations that ignore
preference values entirely. They don’t
care whether a user expresses a high or
low preference for an item—only that the
user expresses a preference at all. As we
discussed in chapter 3,it doesn’t necessar-ily hurt to ignore preference values.
TanimotoCoefficientSimilarityis
one such implementation, based on (sur-prise) the Tanimoto coefficient. This
value is also known as the Jaccard coeffi-cient. It’s the number of items that two
users express some preference for,
divided by the number of items that
either user expresses some preferencefor, as illustrated in figure 4.3.
In other words, it’s the ratio of the size of the intersection to the size of the union
of their preferred items. It has the required properties: when two users’ items com-pletely overlap, the result is 1.0. When they have nothing in common, it’s 0.0. The
value is never negative, but that’s OK. It’s possible to expand the results into the
range –1 to 1 with some simple math: similarity = 2 • similarity – 1. It won’t matter to
the framework.
Table 4.4 shows the value of the Tanimoto coefficient similarity between user 1 and
other users.
Note that this similarity metric doesn’t depend only on the items that bothusers
have some preference for, but that eitheruser has some preference for. Hence, all
seven items appear in the calculation, unlike before.
You’re likely to use this similarity metricif, and only if, your underlying data con-tains only Boolean preferences and you haveno preference values to begin with. If
Table 4.4 The similarity values between user 1 and other users, computed using the
Tanimoto coefficient. Note that preference values themselves are omitted, because they
aren’t used in the computation.
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Similarity
to user 1
User 1 XXX 1.0 User 2 XXXX 0.75 User 3 X X X X 0.17
User 4 XXXX 0.4 User 5 XXXXXX 0.5
you do have preference values, you might use this metric because you believe there’s
more signal than noise in the values. You would usually do better with a metric that
uses the preference information. In the GroupLens data set, using the Tanimoto coef-ficient gives a slightly worse score of 0.82.
使用对数似然函数测试比较得到更好的相似度量
Log-likelihood–based similarity is similar tothe Tanimoto coefficient–based similarity,
though it’s more difficult to understand intuitively. It’s another metric that doesn’t
take account of individual preference values. The math involved in computing this
similarity metric is beyond the scope of this book to explain. Like the Tanimoto coeffi-cient, it’s based on the number of items in common between two users, but its value is
more an expression of how unlikelyit is for two users to have so much overlap, given
the total number of items out there and the number of items each user has a prefer-ence for.
To illustrate, consider two movie fans who have each seen and rated several movies
but have only bothseen Star Warsand Casablanca. Are they similar? If they’ve each seen
hundreds of movies, it wouldn’t mean much. Many people have seen these movies,
and if these two have seen many movies but only managed to overlap in these two,
they’re probably not similar. On the other hand, if each user has seen just a few mov-ies, and these two were on bothusers’ lists, it would seem to imply that they’re similar
people when it comes to movies; the overlap would be significant.
The Tanimoto coefficient already encapsulates some of this thinking, because it
looks at the ratio of the size of the intersection of their interests to the union. The log-likelihood is computing something slightly different. It’s trying to assess how unlikelyit
is that the overlap between the two users is just due to chance. That is to say, two dis-similar users will no doubt happen to rate a couple movies in common, but two similar
users will show an overlap that looks quite unlikely to be mere chance. With some sta-tistical tests, this similarity metric attemptsto determine just how strongly unlikely it is
that two users have no resemblance in their tastes; the more unlikely, the more similar
the two should be. The resulting similarity value may be interpreted as a probability
that the overlap isn’t due to chance.
These similarity values are shown for our small data set in table 4.5. As you can see,
user 1’s similarity with user 1, or any other user with identical preferred items, isn’t 1.0
under this measure. Using a log-likelihood–basedsimilarity metric is as easy as insert-ing new LogLikelihoodSimilarityin listing 4.3, as before.
Table 4.5 The similarity values between user 1 and other users, computed using the log-likelihood similarity metric
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Similarity
to user 1
User 1 X X X 0.90
User 2 XXX X 0.84
Although it’s hard to generalize, log-likelihood–based similarity will probably outper-
orm Tanimoto coefficient–based similarity. Itis, in a sense, a more intelligent metric.
Rerunning the evaluation shows that, at least for this data set and recommender, it
mproves performance over TanimotoCoefficientSimilarity, to 0.73.
User 3 X X X X 0.55
User 4 X X X X 0.16
User 5 XXX XXX 0.55
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Similarity
to user 1
推断参考值
Sometimes too littledata is a problem. In a few cases, for example, the Pearson correla-tion was unable to compute any similarity value at all because some pairs of users over-lap in only one item. The Pearson correlation also can’t take account of preference
values for items for which only one user has expressed a preference.
What if a default value were filled in for all the missing data points? For example,
the framework could pretend that each user has rated everyitem by inferringprefer-ences for items for which the user hasn’t explicitly expressed a preference. This sort of
strategy is enabled via the PreferenceInferrerinterface, which currently has one
implementation, AveragingPreferenceInferrer. This implementation computes the
average preference value for each user and fills in this average as the preference value
for any item not already associated with the user. It can be enabled on a User-Similarityimplementation with a call to setPreferenceInferrer().
Although this strategy is available, it’s not usually helpful in practice. It’s provided
primarily because it’s mentioned in early research papers on recommender engines.
In theory, making up information purely based on existing information isn’t adding
anything, but it certainly does slow down computations drastically. It’s available for
experimentation but will likely not be useful when applied to real data sets.
You’ve now seen all of the user-user similarity metrics that are implemented within
Mahout. This knowledge pays double dividends, because the very same implementa-tions in Mahout also provide an analogous notion of item-item similarity. That is, the
same computations can be applied to define how similar items are, not just users. This
is just in time, because the notion of similarity is likewise the basis of the next style of
Mahout recommender implementation you’ll meet, the item-based recommender.

4.4 基于商品的推荐系统

We’ve looked at user-based recommenders in Mahout—not one recommender, but
tools to build a nearly limitless number of variations on the basic user-based approach, by plugging different and differently configured components into the
implementation.
It’s natural to look next at item-basedrecommenders. This section will be shorter,
because several of the components in the preceding discussion (data models, similar-ity implementations) also apply to item-based recommenders.
Item-based recommendation isderived from how similar itemsare to items, instead
of usersto users. In Mahout, this means they’re based on an ItemSimilarityimple-mentation instead of UserSimilarity. To illustrate, let’s return to the pair we left in
the music store, doing their best to pick an album that a teenage boy would like. Imag-ine yet another line of reasoning they could have adopted:
ADULT: I’m looking for a CD for a teenage boy.
EMPLOYEE: What kind of music or bands does he like?
ADULT: He wears a Bowling In Hades T-shirt all the time and seems to have all of
their albums. Anything else you’d recommend?
EMPLOYEE: Well, about everyone I know that likes Bowling In Hades seems to like the new
Rock Mobster album.
This sounds reasonable. It’s different fromprevious examples, too. The record store
employee is recommending an item that’ssimilar to something they already know
the boy likes. This isn’t the same as before, where the question was, “Who is similar
to the boy, and what do they like?” Here the question is, “What is similar to what the
boy likes?”
Figure 4.4 attempts to illustrate this difference, which is the essential difference
between user-based and item-based recommenders. They go by different paths to find
recommended items: via similar users, and via similar items, respectively.
算法
The algorithm will feel familiar to you since you have seen user-based recommenders
already. This is also how the algorithm is implemented in Mahout.
for every item ithat uhas no preference for yet
for every item jthat uhas a preference for
compute a similarity sbetween iand j
add u’s preference for j, weighted by s, to a running average
return the top items, ranked by weighted average The third line shows how it’s based on item-item similarities, not user-user similarities
as before. The algorithms are similar, but not entirely symmetric. They do have nota-bly different properties. For instance, the running time of an item-based recom-mender scales up as the number of items increases, whereas a user-based
recommender’s running timegoes up as the number of users increases.
This suggests one reason that you might choose an item-based recommender: if
the number of items is relatively low compared to the number of users, the perfor-mance advantage could be significant.
Also, items are typically less subject to change than users. When items are things
like DVDs, it’s reasonable to expect that over time, as you acquire more data, estimates
of the similarities between items converge. There is no reason to expect them to change
radically or frequently. Some of the same may be said of users, but users can change
over time and new knowledge of users is likely to come in bursts of new information
that must be digested quickly. To connect this to the last example, it’s likely that Bowl-ing In Hades albums and Rock Mobster albums will remain as similar to each other
next year as today. But it’s a lot less likelythat the same fans mentioned above will have
the same tastes next year, and so their similarities will change more.
If item-item similarities are more fixed,they’re better candidates for precomputa-tion. Precomputing similarities takes work, but it speeds up recommendations at run-time. This could be desirable in contextswhere delivering recommendations quickly
at runtime is essential—think about a news site that must potentially deliver recom-mendations immediately witheach news article view.
In Mahout, the GenericItemSimilarityclass can be used to precompute and
store the results from any ItemSimilarity. It can be applied to any of the implemen-tations you have seen, and can be added to the code snippets that follow, if you wish.
探索基于商品的推荐系统
Let’s insert a simple item-based recommender into our familiar evaluation frame-work, using the following code. It deploys a GenericItemBasedRecommenderrather
than GenericUserBasedRecommender, and it requires a different and simpler set of
dependencies.
public Recommender buildRecommender(DataModel model)
throws TasteException {
ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
return new GenericItemBasedRecommender(model, similarity);
}
PearsonCorrelationSimilaritystill works here because it also implements the
ItemSimilarityinterface, which is entirely analogous to the UserSimilarityinter-face. It implements the same notion of similarity, based on the Pearson correlation,
but between items instead of users. That is, it compares series of preferences
expressed by many users, for one item, rather than by one user for many items.
GenericItemBasedRecommenderis simpler. It only needs a DataModeland Item-Similarity—there’s no ItemNeighborhood. You might wonder at the apparent asym-metry. Recall that the item-based recommendation process already begins with a
limited number of starting points: the items that the user in question already
expresses a preference for. This is analogous to the neighborhood of similar users that
the user-based approach first identifies. It doesn’t make sense in the second half of the
algorithm to compute neighborhoods aroundeach of the user’s preferred items.
You’re invited to experiment with differentsimilarity metrics, as we did for users.
Not all of the implementations of UserSimilarityalso implement ItemSimilarity.
You already know how to evaluate the accuracy of this item-based recommender when
using various similarity metrics on the now-familiar GroupLens data set. The results
are reproduced for convenience in table 4.6.
One thing you may notice is that this recommender setup runs significantly faster.
That’s not surprising, given that the data set has about 70,000 users and 10,000 items.
Item-based recommenders are generally faster when there are fewer items than users.
You may, as a result, wish to increase the percentage of data used in the evaluation
to 20 percent or so (pass 0.2 as the final argument to evaluate()). This should result
in a more reliable evaluation. Note that there’s little apparent difference among these
implementations on this data set.

Implementation Similarity
PearsonCorrelationSimilarity 0.75
PearsonCorrelationSimilarity+ weighting 0.75
EuclideanDistanceSimilarity 0.76
EuclideanDistanceSimilarity+ weighting 0.78
TanimotoCoefficientSimilarity 0.77
LogLikelihoodSimilarity 0.77

4.5 Slope-one推荐系统

Did you like the movie Carlito’s Way? Most people who liked this movie, it seems, also
liked another film starring Al Pacino, Scarface. But people tend to like Scarfacea bit
more. We’d imagine most people who think of Carlito’s Wayas a four-star movie would
give Scarfacefive stars. So if you told us you thought Carlito’s Waywas a three-star
movie, we might guess you’d give Scarfacefour stars—one more than the other film.
If you agree with this sort of reasoning, you’ll like the slope-one recommender
(http://en.wikipedia.org/wiki/Slope_One). It estimates preferences for new items
based on average difference in the preference value (diffs) between a new item and
the other items the user prefers. For example, let’s say that, on average, people rate Scarfacehigher by 1.0 than Car-lito’s Way. Let’s also say that everyone rates Scarfacethe same as The Godfather, on aver-age. And now, there’s a user who rates Carlito’s Wayas 2.0, and The Godfatheras 4.0.
What’s a good estimate of their preference for Scarface?
Based on Carlito’s Way, a good guess would be 2.0 + 1.0 = 3.0. Based on The Godfa-ther, it might be 4.0 + 0.0 = 4.0. A better guess still might be the average of the two: 3.5.
This is the essence of the slope-one recommender approach.
算法
The slope-one name comes fromthe fact that the recommender algorithm starts with
the assumption that there’s some linear relationship between the preference values
for one item and another, that it’s valid to estimate the preferences for some item Y
based on the preferences for item X, via some linear function like Y= mX+ b. Then the
slope-one recommendermakes the additional simplifying assumption that m=1: slope
one. It just remains to find b= Y-X, the (average) difference in preference value, for
every pair of items.
This means the algorithm consists of a significant preprocessing phase, in which all
item-item preference value differences are computed:
for every item i
for every other item j
for every user uexpressing preference for both iand j
add the difference in u’s preference for iand jto an average
And then, the recommendation algorithm looks like this:
for every item ithe user uexpresses no preference for
for every item jthat user u expresses a preference for
find the average preference difference between jand i
add this diff to u’s preference value for j
add this to a running average
return the top items, ranked by these averages
The average diffs over the small sample recommender inputs from several prior exam-ples in the book are shown in table 4.7.
Table 4.7 Average differences in preference values between all pairs of items. Cells along the
diagonal are 0.0. Cells in the bottom left are simply the negative of their counterparts across the
diagonal, so these aren’t represented explicitly. Some diffs don’t exist, such as 102-107,
because no user expressed a preference for both 102 and 107.
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Item 101 –0.833 0.875 0.25 0.75 –0.5 2.5
Item 102 0.333 0.25 0.5 1.0 -Item 103 0.167 1.5 1.5 -Item 104 0.0 –0.25 1.0
Slope-one is attractive because the online portion of the algorithm is fast. Like an
item-based recommender, its performance doesn’t depend upon the number of users
in the data model. It depends only uponthe average preference difference between
every pair of items, which can be precomputed. Further, its underlying data structure
can be efficiently updated: when a preference changes, it’s simple to update the rele-
vant diff values. In contexts where preferences may change quickly, this is an asset.
Note that the memory requirements necessary to store all of these item-item differ-
ences in preference values grow as the squareof the number of items. Twice as many
items means four times the memory!
Item
101
Item
102
Item
103
Item
104
Item
105
Item
106
Item
107
Item 105 0.5 0.5
Item 106 –
Item 107
Slope-one实践
You can easily try the slope-one recommender. It takes no similarity metric as a neces-sary argument:
new SlopeOneRecommender(model)
After running a standard evaluation using, again, the GroupLens 10 million rating
data set, you’ll get a result near 0.65. That’sthe best yet. Indeed, the simple slope-one
approach works well in manycases. This algorithm doesn’t make use of a similarity
metric, unlike the other approaches so far.It has relatively few knobs to twiddle.
Like the Pearson correlation, the simplest form of the slope-one algorithm has a
vulnerability: item-item diffs are given equal weighting regardless of how reliable they
are—how much data they’re based upon. Let’s say only one user in the history of
movie watching has rated both Carlito’s Wayand The Notebook. It’s possible; they’re
quite different films. It’s easy to compute a diff for these two films, but would it be as
useful as the diff between Carlito’s Wayand The Godfather, which are averaged over
thousands of users? It sounds unlikely. The latter diff is probably more reliable
because it’s an average overa higher number of users.
Again, some form of weighting may help to improve these recommendations.
SlopeOneRecommenderoffers two types of weighting:weighting based on count and on
standard deviation. Recall that slope-one estimates the preference values by adding
diffs to all of the user’s current preference values and then averaging all of those
results together to form an estimate. Count weighting will weightmore heavily those
diffs that are based on more data. The average becomes a weighted average, where the
diff count is the weight—the number of users on which the diff is based.
Similarly, standard deviation weighting will weight according to the standard devia-tion of difference in preference value.A lower standard deviation means a higher weighting. If the difference in preferencevalue between two films is very consistent
across many users, it seems more reliable and should be given more weight. If it varies
considerably from user to user, it should be deemphasized.
These variants turn out to be enough of a good idea that they’re enabled by
default. You already used this strategy when you ran the previous evaluation. Disable
them to see the effect, as follows.
DiffStorage diffStorage = new MemoryDiffStorage(
model, Weighting.UNWEIGHTED, Long.MAX_VALUE));
return new SlopeOneRecommender(
model,
Weighting.UNWEIGHTED,
Weighting.UNWEIGHTED,
diffStorage);
The result is 0.67—only slightly worse on this data set.
不同存储和内存考虑
Slope-one does have its price: memory consumption. In fact, if you tweak the evalua-tion to use just 10 percent of all data (about 100,000 ratings), even a gigabyte of heap
space won’t be enough. The diffs are used sofrequently, and it’s so relatively expen-sive to compute them, that they do need to be computed and stored ahead of time.
But keeping them all in memory can get expensive. It may become necessary to store
diffs elsewhere.
Fortunately, implementations like MySQLJDBCDiffStorageexist for this purpose,
allowing diffs to be computed and updated from a database. It must be used in con-junction with a JDBC-backed DataModelimplementation like MySQLJDBCDataModel, as
shown in the next listing.
AbstractJDBCDataModel model = new MySQLJDBCDataModel();
DiffStorage diffStorage = new MySQLJDBCDiffStorage(model);
Recommender recommender = new SlopeOneRecommender(
model, Weighting.WEIGHTED, Weighting.WEIGHTED, diffStorage);
As with MySQLJDBCDataModel, the table name and column names used by MySQLJDBC-DiffStoragecan be customized via constructor parameters.
预先计算
Precomputing the item-item diffsis significant work. Althoughit’s more likely that the
size of your data will cause problems with memory requirements before the time
required to compute these diffs becomes problematic, you might be wondering if
there are ways to distribute this computation to complete it faster. Diffs can be updated easily at runtime in response to new information, so a relatively infrequent
offline precomputation process is feasible in this model.
Distributing the diff computation via Hadoop issupported. Chapter 6 will introduce
all of the Hadoop-related recommender support in Mahout, to explore this process.

4.6 新的实验性推荐系统

Mahout also contains implementations of other approaches to recommendation. The
three implementations presented briefly in this section are newer: the implementa-tions may still be evolving, or the technique may be more recent and experimental.
All are worthy ideas that may be useful for use or for modification.
基于奇异值分解的推荐系统
Among the most intriguing of these implementations is SVDRecommender, based on
the singular value decomposition (SVD). This is an important technique in linear alge-bra that pops up in machine learning techniques. Fully understanding it requires
some advanced matrix math and understanding of matrix factorization, but this isn’t
necessary to appreciate SVD’s application torecommenders.
To outline the intuition behind what SVDdoes for recommenders, let’s say you ask
a friend what sort of music she likes, and she lists the following artists:
 Brahms  Louis Armstrong
 Chopin  Schumann
 Miles Davis  John Coltrane
 Tchaikovsky  Charlie Parker
She might as well have summarized that she likes classical and jazz music. That com-municates less precise information, but not a great deal less. From either statement,
you could (probably correctly) infer that she would appreciate Beethoven more than
the classic rock band Deep Purple.
Of course, recommender engines operate ina world of many specific data points,
not generalities. The input is user preferences for a lot of particular items—more like
the preceding list than thissummary. It would be nice to operate on a smaller set of
data, all else equal, for reasons of performance. If, for example, iTunes could base its
Genius recommendations not on billionsof individual song ratings, but instead on mil-lionsof ratings of genres, it would be faster—and, as a basis for recommending music,
it might not be much worse.
Here, SVDis the magic that can do the equivalent of the preceding summarization.
It boils down the world of user preferences for individual items to a world of user
preferences for more general and less numerous features(like genre). This is, poten-tially, a much smaller set of data.
Although this process loses some information, it can sometimes improve recom-mendation results. The process smooths the input in useful ways. For example, imag-ine two car enthusiasts. One loves Corvettes, and the other loves Camaros, and they want car recommendations. These enthusiasts have similar tastes: both love a Chevro-let sports car. But in a typical data model for this problem, these two cars would be dif-ferent items. Without any overlap in their preferences, these two users would be
deemed unrelated. However, an SVD-based recommender would perhaps find the sim-ilarity. The SVDoutput may contain features that correspond to concepts like Chevrolet
or sports car, with which both users would be associated. From the overlap in features, a
similarity could be computed.
Using the SVDRecommenderis as simple as this:
new SVDRecommender(model, new ALSWRFactorizer(model, 10, 0.05, 10))
SVDRecommenderuses a Factorizerto perform its work; a fine first choice is the
ALSWRFactorizerimplementation used here. Wewon’t discuss your choices of
Factorizerin detail here.
The first numeric argument is the number of features that the SVDshould target.
There’s no right answer for this; it would beequivalent to the number of genres that
you condensed someone’s musical taste into, in this section’s first example. The sec-ond argument is lambda and it controls a feature of the factorizer called regulariza-tion. The last argument is the number of training steps to run. Think of this as
controlling the amount of time it should spend producing this summary; larger values
mean longer training.
This approach can give good results (0.69 on the GroupLens data set). At the
moment, the major issue with the implementation is that it computes the SVDin
memory. This requires the entire data set to fit in memory, and it’s precisely when
this isn’tthe case that this technique is appealing, because it can shrink the input
without compromising output quality significantly. In the future, this algorithm will
be reimplemented in terms of Hadoop, so the necessarily massive SVDcomputation
can be distributed across multiple machines. It isn’t yet available at this stage of
Mahout’s evolution.
基于线性插值的推荐系统
Linear interpolation is a somewhat different take on item-based recommendation,
implemented as KnnItemBasedRecommender. Knnis short for k nearest neighbors, which
is an idea also embodied in NearestNUserNeighborhood. As you saw earlier in
figure 4.1, a UserNeighborhoodimplementation selects a fixed number of most simi-lar users as a neighborhood of similar users. This linear interpolation algorithm uses
the concept of a user neighborhood, but in a different way.
KnnItemBasedRecommenderstill estimates preference values by means of a
weighted average of the items the user already has a preference for, but the weights
aren’t the results of a similarity metric. Instead, the algorithm calculates the optimal
set of weights to use between all pairs ofitems by means of some linear algebra—
here’s where the linear interpolation comes in. Yes, it’s possible to optimize the
weights with some mathematical wizardry.In reality, it would be very expensive tocompute this across all pairs of items, so
instead it first calculates a neighborhood ofitems most similar tothe target item—the
item for which a preference is being estimated. It chooses the nnearest neighbors, in
much the same way that NearestNUserNeighborhooddid. One can try KnnItemBased-
Recommenderas shown in the following listing.
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
Optimizer optimizer = new NonNegativeQuadraticOptimizer();
return new KnnItemBasedRecommender(model, similarity, optimizer, 10);
This code will cause the recommender to use a log-likelihood similarity metric to cal-
culate nearest-10 neighborhoods of items.Then it’ll use a quadratic programming–
based strategy to calculate the linear interpolation. The details of this calculation are
outside the scope of the book.
The implementation is quite functional, but in its current form, it’s also slow on
moderately sized data sets. It should be viewedas viable for small data sets, or for study
and extension. On the GroupLens data set, it yields an evaluation result of 0.76.
Listing 4.9 Deploying KnnItemBasedRecommender
基于集群的推荐系统
Cluster-based recommendation is best thought of as a variant on user-based recom-mendation. Instead of recommending items to users, items are recommended to clus-ters of similar users. This entails a preprocessing phase, in which all users are
partitioned into clusters. Recommendations are then produced for each cluster, such
that the recommended items are most interesting to the largest number of users.
The upside of this approach is that recommendation is fast at runtime because
almost everything is precomputed. One could argue that
the recommendations are less personal this way, because
recommendations are computedfor a group rather than
an individual. This approach may be more effective at
producing recommendations for new users, who have lit-tle preference data available. As long as the user can be
attached to a reasonably relevant cluster, the recommen-dations ought to be as good as they’ll be when more is
known about the user.
The name comes from the fact that the algorithm
repeatedly joins most-similar clusters into larger clusters,
and this implicitly organizes users into a sort of hierarchy,
or tree, as depicted in figure 4.5.
Unfortunately, the clustering takes a long time, as
you’ll see if you attempt to run the code in the following
listing, which employs a TreeClusteringRecommenderto
implement this idea.UserSimilarity similarity = new LogLikelihoodSimilarity(model);
ClusterSimilarity clusterSimilarity =
new FarthestNeighborClusterSimilarity(similarity);
return new TreeClusteringRecommender(model, clusterSimilarity, 10);
Similarity between users is, as usual, defined by a UserSimilarityimplementation.
Similarity between two clusters of users is defined by a ClusterSimilarityimplemen-
tation. Currently, two implementations are available: one defines cluster similarity as
the similarity between the two most similar user pairs, one chosen from each cluster;
the other defines cluster similarity as the similarity between the two least similarusers.
Both approaches are reasonable; in both cases the risk is that one outlier on the
edge of a cluster distorts the notion of cluster similarity. Two clusters whose members
are on average distant but happen to be close at one edge would be considered quite
close by the most-similar-user rule, which is implemented by NearestNeighbor-
ClusterSimilarity. The least-similar-user rule, implemented by FarthestNeighbor-
ClusterSimilarityin listing 4.10, likewise may consider two fairly close clusters to be
distant from one another if each contains anoutlier far away from the opposite cluster.
A third approach, defining cluster similarity as the distance between the centers,
or means, of two clusters, is also possible, though not yet implemented in this part of
Mahout.
Listing 4.10 Creating a cluster-based recommender

4.7 对比其它推荐系统

As mentioned earlier in the book, content-based recommendation is a broad and
often-mentioned approach to recommendation that takes into account the contentor
attributes of items. For this reason, it’s similar to, yet distinct from, collaborative filter-ing approaches, which are based on user associations with items only, and treats items
as black boxes without attributes. Although Mahout largely doesn’t implement
content-based approaches, it does offer someopportunities to make use of item attri-butes in recommendation computations.
注入基于内容的计算到Mahout中
For example, consider an online bookseller who stocks multiple editions of some
books. This seller might recommend books toits customers. Its items are books, of
course, and it might naturally define a book by its ISBNnumber (a unique product
identifier). But for a popular public-domain book like Jane Eyre, there may be many
printings by different publishers of the same text, under different ISBNnumbers. It
seems more natural to recommendbooks based on their texts,rather than on a partic-ular edition—do you care more about reading “Jane Eyre” or “Jane Eyreas printed by
ACMEPublications in 1993 in paperback?” Rather than treat various publications of
Jane Eyreas distinct items, it might be more useful to think of the book itself, the text,
as the item, and recommend all editions of this book equally. This would be, in a
sense, content-based recommendation. By treating the underlying text of a book product, which is its dominant attribute, as the itemin a collaborative filtering sense,
and then applying collaborative filtering techniques with Mahout, the seller would be
engaging in a form of content-based recommendation.
Or recall that item-based recommenders require some notion of similarity
between two given items. This similarity is encapsulated by an ItemSimilarityimple-mentation. So far, implementations have derived similarity from user preferences
only—this is classic collaborative filtering.But there’s no reason the implementation
couldn’t be based on item attributes. For example, a movie recommender might
define item (movie) similarity as a functionof movie attributes like genre, director,
actor, and year of release. Using such animplementation withina traditional item-based recommender would also be an example of content-based recommendation.
深入基于内容的推荐系统
Taking this a step further, imagine content-based recommendation as a generalization
of collaborative filtering. In collaborative filtering, computations are based on prefer-ences, which are user-item associations. But what drives these user-item associations?
It’s likely that users have implicit preferences for certain item attributes, which come
out in their preferences for certain items and not others. For example, if your friend
told you she likes the albums Led Zeppelin I, Led Zeppelin II, and Led Zeppelin III, you
might well guess she is actually expressing a preference for an attribute of these items:
the band Led Zeppelin. By discovering these associations, and discovering attributes
of items, it’s possible toconstruct recommender engines based on these more
nuanced understandings of user-item associations.
These techniques come to resemble search and document-retrieval techniques:
asking what items a user might like based on user-attribute associations and item attri-butes resembles retrieving search results based on query terms and the occurrence of
terms in documents. Although Mahout’s recommender support doesn’t yet embrace
these techniques, it’s a natural direction for future versions to address.
和基于模型的推荐系统比较
Another future direction for Mahout is model-based recommendation. This family of
techniques attempts to build some model ofuser preferences based on existing pref-erences, and then infer new preferences. These techniques generally fall into the
broader category of collaborative filtering,because they typically derive from user
preferences only.
The model might be a probabilistic picture ofusers’ preferences, in the form of a
Bayesian network, for example. The algorithm then attempts to judge the probabilityof
liking an item given its knowledge of all user preferences, and it ranks the recommen-dations accordingly.
Association rule learning can be applied in a similar sense to recommendations. By
learning rules such as “when a user prefers item X and item Y, they will prefer item Z”
from the data, and judging confidence in the reliability of suchrules, a recommender
can put together the most likely set of new, preferred items. Cluster-based recommenders might be considered a type of model-based recom-mender. The clusters represent a model ofhow users group together and therefore
how their preferences might run the same way.In this limited sense, Mahout supports
model-based recommenders. But this is an area that’s still largely under construction
in Mahout as of this writing.

4.8 总结

In this chapter, we thoroughly exploredthe core recommenderalgorithms offered by
Mahout.
We started by explaining the general user-based recommender algorithm in terms
of real-world reasoning. From there, we looked at how this algorithm is realized in
Mahout, as GenericUserBasedRecommender. Many pieces of this generic approach can
be customized, such as the definition ofuser similarity and user neighborhood.
We looked at the classic user similaritymetric, based on the Pearson correlation,
noted some possible issues with this approach, and looked at responses such as
weighting. We looked at similarity metrics based on the Euclidean distance, Spearman
correlation, Tanimoto coefficient, and log-likelihood ratio.
Then we examined another canonical recommendation technique, item-based
recommendation, as implemented by GenericItemBasedRecommender. It reuses some
concepts already covered in the context of user-based recommenders, such as the
Pearson correlation.
Next, we examined a slope-one recommender, a unique and relatively simple
approach to recommendation based on average differences in preference values
between items. It requires significant precomputation and storage for these diffs, so
we explored how to store these both in memory and in a database.
Last, we looked briefly at a few newer,more experimental implementations cur-rently in the framework. These include implementations based on singular value
decomposition, linear interpolation, and clustering. These may be useful for small
data sets, or may be of academic interest, because they’re still a work in progress.
The key parameters and features for each implementation are summarized in
table 4.8.
Implementation Key parameters Key features
GenericUserBasedRecommender • User similarity metric
• Neighborhood definition
and size
• Conventional implementation
• Fast when number of users is
relatively small
GenericItemBasedRecommender • Item similarity metric • Fast when number of items is
relatively small
• Useful when an external notion
of item similarity is available
SlopeOneRecommender • Diff storage strategy • Recommendations and
updates are fast at runtime
• Requires large precomputation
• Suitable when number of items
is relatively small
SVDRecommender • Number of features • Good results
• Requires large precomputation
KnnItemBasedRecommender • Number of means (k)
• Item similarity metric
• Neighborhood size
• Good when number of items is
relatively small
TreeClusteringRecommender • Number of clusters
• Cluster similarity
definition
• User similarity metric
• Recommendations are fast at
runtime
• Requires large precomputation
• Good when number of users is
relatively small
Implementation Key parameters Key features
We’re done introducing Mahout’s recommender engine support. Now we’re ready to
examine even larger and more realistic data sets, from the practitioner’s perspective.
You might wonder why we’ve made little mention of Hadoop yet. Hadoop is a power-
ful tool, and it’s necessary when dealing with massive data sets, where one must make
use of many machines. This has drawbacks:such computations are massive, they’re
resource-intensive, and they complete in hours, not milliseconds. We’ll reach Hadoop
in the last chapter in this section.
First, in the next chapter, we create a production-ready recommender engine
based on Mahout that fits onto one machine—one that can respond to requests for
recommendations in a fraction of a second and incorporate updates immediately.本作品采用知识共享署名 4.0 国际许可协议进行许可。

3 推荐系统数据表示

原创文章,转载请注明: 转载自慢慢的回味

本文链接地址: 3 推荐系统数据表示

本章包括
  Mahout怎么表示推荐系统数据
  DataModel实现和使用方法
  处理没有参考值的数据

  数据的质量很大的决定类推荐系统的质量。拥有高质量的数据是很好的,拥有大量的数据也好。推荐引擎需要从数据中获取大量信息,强烈受数据影响。因此运行性能大大地受数据质量和数据表示方式影响。聪明的选择数据结构可以影响性能的几个数量级,而且在大规模上,它影响更多。本章探索了Mahout表示和获取数据的几种关键类。你将知道为什么用Mahout中表示用户,项和参考值的方式将更加有效和可伸缩性。本章也详细的了解了Mahout的数据模型: DataModel。最终,我们将看看当数据没有评估值或参考值的情况,即布尔参考值。

3.1 参考数据表示

  推荐系统的输入是参考数据——谁喜欢什么,喜欢多少。即输入数据由用户ID,项ID 和参考值构成的元组表示,有些时候,甚至没有参考值。
参考数据对象
  一个参考数据对象由用户ID,项ID和参考值构成,表示用户ID对项ID的参考值是多少。Preference是接口,有多种实现。通常可以用GenericPreference。例如表示用户123对项456的参考值为3.0为new GenericPreference(123, 456, 3.0f)。可是参考数据集合怎么表示呢?使用Preference[]吗?如果你这么想,在Mahout里面就错了。通常GenericPreference由20字节的有用数据表示:8字节的用户ID(Java long),8字节的项ID(Java long)和4字节的参考值(float)。但每个对象有28字节的头信息:8字节的对象引用,另外的20字节用于描述对象自己,这样头信息就占了有用信息的140%。所以如果是Preference[]的化,就有很多冗余。

  参考数据数组和实现
PreferenceArray就是这种数组的实现。例如 GenericUser-PreferenceArray表示一个用户所关联的所以项以及每项的参考值。这种参考值的表示只需要12字节(8字节项ID和4字节参考值),只是原始的1/4。对比如下左右图可以看到是怎么实现的。

代码实现:

PreferenceArray user1Prefs = new GenericUserPreferenceArray(2);
user1Prefs.setUserID(0, 1L);
user1Prefs.setItemID(0, 101L);
user1Prefs.setValue(0, 2.0f);
user1Prefs.setItemID(1, 102L);
user1Prefs.setValue(1, 3.0f);
Preference pref = user1Prefs.get(1);

提高集合速度

FastByIDMap和FastIDSet
  Mahout大量的使用Map和Set,但是不使用Java提供的如TreeSet和HashMap,而使用自己定义的FastMap,FastByIDMap和FastIDSet。它们减少了内存占用,而不是显著提高性能。它们显著的几点不同:
就如HashMap,FastByIDMap也是基于hash的,它使用线性探测而不是使用分别的链来探测hash冲突,这避免了对每一项还有一个额外的Map.Entry对象。键值都是long型的而不是对象,这能提高不少性能。
  Set的实现不由Map实现。
  FastByIDMap就如一个缓存,它具有最大容量限制。当新项加入时,不常用的项将被删除。
  这些存储的区别是显著的:FastIDSet平均每项需要14字节,而HashSet需要84字节。FastByIDMap每项需要28字节,而HashMap需要84字节。

3.2 内存中的数据模型

  封装推荐引擎输入数据的是DataModel。不同的推荐引擎算法有各自的实现。例如,一个DataModel可以提供输入数据中用户ID的数量或列表,或提供每项的所有参考值,或提供对某个项集合有多少用户有过推荐值。这节只关注几个关键的,更多的请参考 (https://builds.apache.org/job/Mahout-Quality/javadoc/)。

通用数据模型
  GenericDataModel这是最简单的内存数据模型实现,当你要在内存中构建数据,而非使用文件或数据库时,它是合适的。它能简单的描述输入数据,通过如下代码的形式:

FastByIDMap preferences = new FastByIDMap();
PreferenceArray prefsForUser1 = new GenericUserPreferenceArray(10);
prefsForUser1.setUserID(0, 1L);
prefsForUser1.setItemID(0, 101L);
prefsForUser1.setValue(0, 3.0f);
prefsForUser1.setItemID(1, 102L);
prefsForUser1.setValue(1, 4.5f);
preferences.put(1L, prefsForUser1);
DataModel model = new GenericDataModel(preferences);

文件类型数据
  你可能不常直接用GenericDataModel,相反你需要FileDataModel,它从文件中读取数据把结果用GenericDataModel的形式存在内存中。任何可能的文件都可以,比如逗号分割文件,制表符分割文件,压缩文件。
可刷新组件
  当讨论加载数据时,很有必要说说重载数据,即刷新接口,Mahout的推荐系统相关类都有实现。它们通过一个刷新方法的暴露来完成数据的重载,重新计算。
需要注意的是,FileDataModel只有当你调用方法的时候才会去重载,而不算自动的。
  你可能不止要FileDataModel刷新,而且依赖它的对象也要刷新,所以为什么refresh方法需要在Recommender上了:

DataModel dataModel = new FileDataModel(new File("input.csv");
Recommender recommender = new SlopeOneRecommender(dataModel);
...
recommender.refresh(null);

更新文件

数据库类型数据
  有时数据太大以至于不能装配到内存中,这个时候就需要利用关系数据库来存取数据了。Mahout的几个推荐引擎就把计算结果存在类数据库中,但是请记住,利用数据库这种方式将减慢运行速度,即使合适的调优和索引,但是在过量的存取,序列化,反序列化时还是比内存方式慢好多。可有时没有办法,比如你的数据本来就存储在数据库中,需要集成时。
  JDBC和MySQL
  通过JDBC的实现为JDBCDataModel,如MySQL5.x: MySQLJDBCDataModel。
  配置JNDI
  这里请自己参考实现。
用程序配置
例如:

		MysqlDataSource dataSource = new MysqlDataSource();
		dataSource.setServerName("my_database_host");
		dataSource.setUser("my_user");
		dataSource.setPassword("my_password");
		dataSource.setDatabaseName("my_database_name");
		JDBCDataModel dataModel = new MySQLJDBCDataModel(dataSource, "my_prefs_table", "my_user_column",
				"my_item_column", "my_pref_value_column");

3.3 复制没有参考值的参考数据

  有时,参考数据是没有参考值的,只是证明用户和项由关系。比如新闻网站需要推荐文章给用户,但是要求用户去给文章打分好像不大现实,我们只知道用户看了这个文章而已。这种情况下就没有参考值。在Mahout中,这种参考数据就是布尔参考数据,即喜欢,不喜欢,或没关系。

什么时候忽略参考值
  什么时候忽略参考值?可能和上下文相关,如只关注喜欢或不喜欢。

使用内存表示没有参考值的参考数据
  Mahout中有专门用于存储没有参考值的数据模型GenericBooleanPrefDataModel。每条参考数据会少4字节。它仅仅存储用户和项的关联关系,使用FastIDSets。这种新的实现在某些方法调用上会快些,如getItemIDsForUser(),因为已经有实现;但有些方法会慢,如getPreferencesFromUser(),因为它没有PreferenceArrays,但必须实现这个方法。你会疑问getPreferenceValue()应该返回什么呢?它每次返回人为的值:1.0。
  关于GenericBooleanPrefDataModel,这儿有个例子:

		DataModel model = new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(new FileDataModel(
				new File("ua.base"))));
		RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
		RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
			public Recommender buildRecommender(DataModel model) throws TasteException {
				UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
				UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, similarity, model);
				return new GenericUserBasedRecommender(model, neighborhood, similarity);
			}
		};
		DataModelBuilder modelBuilder = new DataModelBuilder() {
			public DataModel buildDataModel(FastByIDMap trainingData) {
				return new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(trainingData));
			}
		};
		double score = evaluator.evaluate(recommenderBuilder, modelBuilder, model, 0.9, 1.0);
		System.out.println(score);

选择兼容的实现方式
  你将发现使用上面的代码会有IllegalArgument-Exception产生,因为构造器是PearsonCorrelationSimilarity。为什么呢?GenericBooleanPrefDataModel也是数据模型啊?在计算相似性时,如EuclideanDistanceSimilarity拒绝工作当没有参考值时,因为即使计算,结果也是无意义的。为了解这个问题,必须选择合适的相似性计算维度,Log-LikelihoodSimilarity就是一个,因为它不依赖参考值。用它替换运行上面的代码,产生结果为0.0,即它能够合适的计算。它是正确的结果吗?是的。估计值和真实值的差异都是1,所以结果为0.测试本身是无效的,因为怎么都是结果0。但是精度和重现率还算有效的,看下面的代码:

		DataModel model = new GenericBooleanPrefDataModel(new FileDataModel(new File("ua.base")));
		RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
		RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
			@Override
			public Recommender buildRecommender(DataModel model) {
				UserSimilarity similarity = new LogLikelihoodSimilarity(model);
				UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, similarity, model);
				return new GenericUserBasedRecommender(model, neighborhood, similarity);
			}
		};
		DataModelBuilder modelBuilder = new DataModelBuilder() {
			@Override
			public DataModel buildDataModel(FastByIDMap trainingData) {
				return new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(trainingData));
			}
		};
		IRStatistics stats = evaluator.evaluate(recommenderBuilder, modelBuilder, model, null, 10,
				GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);
		System.out.println(stats.getPrecision());
		System.out.println(stats.getRecall());

  结果如下,精度和重现率都是24.7。这不好,因为返回的推荐结果中大约1/4是好的,大约1/4的好推荐返回了。
我们来看看这个问题,在GenericUserBasedRecommender中,推荐引擎还是通过估计参考值来排序,但是这些参考值都是1.0,所以结果是无意义的。我们可以采用GenericBooleanPrefUserBasedRecommender,它将产生有意义的结果。它通过项的关联关系来度量相似用户,越相似的权重越大。替换后重新运行,结果为22.9,结果差不多,那是因为我们没有使用特别有效的推荐系统。

3.4 总结

略本作品采用知识共享署名 4.0 国际许可协议进行许可。

2 推荐系统介绍

原创文章,转载请注明: 转载自慢慢的回味

本文链接地址: 2 推荐系统介绍

本章包括
Mahout里面的推荐引擎
实践第一个推荐引擎
评估推荐引擎的准确性和质量
使用GroupLens真实数据评估推荐引擎

  每天我们都对事物表达喜好,不喜欢和不关心。它无意识的发生了。你从电台听到一首歌并注意到它了,因为它吸人,或者难听或者你根本没有注意到它。同样的事发生在衬衫,沙拉,发型,面孔等。虽然人们爱好不一样,但都遵循这个模式。人们喜好和他们所喜好的类似的东西。比如某人喜欢牛肉拉面,你可以猜到他同样喜欢羊肉泡馍,因为都是清真食品。同样的,人们喜欢和他相似的人喜欢的东西。这个可以用来判定喜欢和不喜欢。推荐系统就是用来判定这些模式的爱好,被用来发现你不知道但想要的东西。在这个观点更深入的研究后,你将运行一个简单的推荐引擎并明白它怎么工作的。

2.1 定义一个推荐系统

  比如你去书架找这本书,你可以看到挨着它放的书你可能也喜欢。因为人们喜欢把相关的书放在一块儿。如果要发现你可能喜欢的东西,可以从和你有类似喜好的人喜欢的东西里面找,也可以从类似你已经拥有的东西里面去找。实际上有两类推荐算法引擎:基于用户的和基于物品的,这两者Mahout都含有。严格的讲,上述就是采用协同过滤来推荐的。这种技术不需要项的属性。也就是说,这个推荐引擎框架不关心项是书,花还是其他人,即不需要输入数据的属性。
  当然也有其它依赖与属性的算法即基于内容的推荐引擎技术。比如,你朋友推荐这本书给你是因为它是Manning的书,而你的朋友喜欢其它的Manning的书,所以你朋友喜欢什么东西更像基于内容的推荐。这种推荐就是基于书的属性:出版商。如果要实现基于内容的推荐引擎,你必须决定使用项的什么属性以及每个属性的权值,而且换个领域,算法就不行了。
基于这个原因,Mahout不想多关心基于内容的推荐引擎。Mahout提供的就是协调过滤框架。具体参见第5章关于dating网站的例子。现在我们来看一个简单的例子。

2.2 运行第一个推荐引擎

  Mahout含有多种类型的推荐引擎,我们以一个简单的基于用户的推荐引擎开始吧。

创建输入
  数据输入采用参考数据的形式,每条参考数据包含用户ID,物品ID和对应的参考值,因为我们是要把物品推荐给用户,这样表达输入数据比较合理。用户ID和物品ID是一个整形的ID,参考值是一个范围的数值,比如可以是1-5,其中数值越大表示越喜欢。
  按照这个逻辑先创建一个如下形式的输入文件intro.csv。依次为用户ID,物品ID,参考值。

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

  可以看到用户1和5具有相似的喜欢,他们都喜欢101>102>103。用户1和用户4也类似,他们都喜欢101和103。用户1和用户2好像就相反了,用户1喜欢102但是用户2不喜欢。用户1和用户3就看不出来了,因为他们重复喜欢的就只有101。

创建推荐系统
  如果要给用户1推荐,该推荐什么呢?不能是101,102和103,因为用户已经有了。推荐系统是要发现新的东西给用户1。用户4和用户5喜欢类似用户1,看来可以从他们喜欢的东西中去找到可能推荐的东西。剩下的104,105,106中,104更有可能。
  现在,运行下面的代码:

class RecommenderIntro {
	public static void main(String[] args) throws Exception {
		DataModel model = new FileDataModel(new File("intro.csv"));
		UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
		UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
		Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
		List<RecommendedItem> recommendations = recommender.recommend(1, 1);
		for (RecommendedItem recommendation : recommendations) {
			System.out.println(recommendation);
		}
	}
}

  在接下来2章中将详细介绍,现在稍微解释下。DataModelimplementation存储和提供怎么去获取参考数据,UserSimilarityimplementation用来计算2个用户的相似度,UserNeighborhoodimplementation用来定义用户相似的邻居关系,最后Recommenderimplementation计算并推荐结果。

分析输出
  运行代码会有如下输出:
RecommendedItem [item:104, value:4.257081]

2.3 评估一个推荐系统

  现在我们要解决一个问题,即对一个用户来说,什么推荐系统是最好的。现在我们就要来评估推荐系统。
一个最好的推荐应该是它可以很好推荐你所没有表达喜好的物品,并对这些推荐进行评级。其实推荐系统就是估计这些参考值,所以评估这些推荐系统就可以计算它的参考值和实际值的吻合度。
训练数据和评分
  实际的参考值不存在,任何人都不可能知道,但是我们可以模通过把已有的输入数据分成真实数据和测试数据。通过推荐系统,处理分出来的真实数据,产生推荐值,然后和分出来的测试数据进行比较,差异越小的越好。这种差异的计算可以通过均值,也可以通过平方的均值。如下表所示:

Item 1 Item 2 Item 3
测试数据值 3.0 5.0 4.0
推荐数据值 3.5 2.0 5.0
差异 0.5 3.0 1.0
差异平均值 = (0.5 + 3.0 + 1.0) / 3 = 1.5
差异的平方开放值 = √((0.5^2+ 3.0^2+ 1.0^2) / 3) = 1.8484

运行推荐评估系统
参考如下代码:
代码2.3

		RandomUtils.useTestSeed();
		DataModel model = new FileDataModel(new File("intro.csv"));
		RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
		RecommenderBuilder builder = new RecommenderBuilder() {
			@Override
			public Recommender buildRecommender(DataModel model) throws TasteException {
				UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
				UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
				return new GenericUserBasedRecommender(model, neighborhood, similarity);
			}
		};
		double score = evaluator.evaluate(builder, null, model, 0.7, 1.0);
		System.out.println(score);

  RecommenderEvaluator把数据分成训练数据和测试数据,构建一个数据模型和推荐系统进行测试,最好比较推荐值和实际值。

获取结果
  运行测试,输出结果为1.0。因为RandomUtils.useTestSeed(),每次使用同样的随机数,所以多次运行都能得到同一个结果,这方便测试。这儿用的是AverageAbsoluteDifferenceRecommenderEvaluator,即产生的结果是推荐值和测试值差异的均值。你也可以用RMSRecommenderEvaluator替代AverageAbsoluteDifferenceRecommenderEvaluator而得到平方开方平均值。evaluator.evaluate方法中的1.0代表需要使用的输入数据即100%,0.7表示这些输入数据中多少用于推荐计算即70%。
  更一般的说,我们可以用经典的信息检索指标来评估推荐系统:精度和重现度。就像搜索引擎从查询到的很多可能的结果中返回最好的结果一样。搜素引擎不能返回相关度不高的结果到最上面,应该返回最相关的结果。比如精度10意味着前面的结果在相关里的比例为10%,重现度表示所有相关的结果在最前的比例。用在推荐引擎上就是:精度为最前的推荐引擎是好的推荐引擎的比例,重现度是好的推荐引擎在最前的比例。

2.4 评估精度和重现能力

  运行RecommenderIRStatsEvaluator。如下代码:

		RandomUtils.useTestSeed();
		DataModel model = new FileDataModel(new File("intro.csv"));
		RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
		RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
			@Override
			public Recommender buildRecommender(DataModel model) throws TasteException {
				UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
				UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
				return new GenericUserBasedRecommender(model, neighborhood, similarity);
			}
		};
		IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 2,
				GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);
		System.out.println(stats.getPrecision());
		System.out.println(stats.getRecall());

  输出应为:
0.75
1.0
  精度对用户2是0.75表示平均3/4的推荐是好的,即推荐的结果中占实际结果的75%,重现度是1.0,在他们推荐的结果中都是好的推荐,即推荐的结果中100%在实际结果中。

精度和重现能力问题

2.5 评估GroupLens数据集

  工具在手,我们不仅要讨论推荐系统的速度,而且还有质量。

分解输入
  GroupLens (http://grouplens.org/)是一个提供了不同大小数据的研究项目,每个数据集都是从真实用户的评分中获取的。到http://www.grouplens.org/node/73下载100K数据集吧,解压找到ua.base文件,它是用制表符Tab分割的数据文件。把ua.base应用与代码2.3中,通过几分钟后,你会得到结果在0.9左右。难道我们实现的推荐引擎不是最好的?
使用其它推荐系统试验
  我们用org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOne-Recommender试试吧。

                RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
			@Override
			public Recommender buildRecommender(DataModel model) throws TasteException {
				return new SlopeOneRecommender(model);
			}
		};

  再次运行,结果在0.748左右,好像更好了。每个算法都有它的优势。比如,slope-one可以更快的计算推荐,但是它需要先去计算产生内部数据结构。

2.6 总结

  本章,我们介绍了推荐引擎的概率。我们可以创建一个小量数据集的Mahout推荐引擎,运行并解释结果。评估推荐系统并解释输出。最好用真实数据评估了推荐系统。本作品采用知识共享署名 4.0 国际许可协议进行许可。

1 了解 Apache Mahout

原创文章,转载请注明: 转载自慢慢的回味

本文链接地址: 1 了解 Apache Mahout

本章包括
Mahout是什么,它从哪儿来
一瞥现实世界的推荐系统,集群系统和分类系统
配置Mahout

  从标题可以猜到,这本书讲一个特殊的工具,Apache Mahout, 在现实中的有效使用,它包括3个定义的质量。首先,Mahout是Apache上一个开源的机器学习库。它实现了部分机器智能学习算法。Mahout主要包括推荐引擎(协调过滤),集群和分类。它同时是可伸缩的。当数据非常大,Mahout将作为一个集群学习工具的选择。当前情况,Mahout实现的可伸缩机器学习算法由Java书写,部分依赖Hadoop分布计算系统实现。最后,它是一个Java库,它不提供用户界面,或者预先打包的服务或者安装包。它是一个开发者使用和实现的框架工具。本章先简约的看看Mahout提供的机器学习算法。为亲手操作Mahout做准备,你需要一些必须的设置和安装。

1.1 Mahout的历史

  首先,Mahout的一些背景知识。你可能正在考虑Mahout怎么发音:这是个英化词,发音类似trout。一个关于赶象人的海地单词,这里有一个简短的历史。Mahout开始与2008年,作为Apache Lucene的一个子项目,Lucene是非常有名的开源搜索引擎。它提供了搜索,文本挖掘,信息提取技术。在大学的教程中,这些概念类似集群,分类。最终,一些Lucene贡献者关注在了机器学习方面。稍后,Mahout纳入了一个开源的协同过滤项目Taste。2010年4月,Mahout成为一个使用象为Logo的顶层Apache项目。
  Mahout的工作不仅是有效和可伸缩的转换这些算法,而且是把这些算法在Hadoop上实现。Mahout孵化了一系列技术和算法,许多还正在开发中或处于试验状态(https://cwiki.apache.org/confluence/display/
MAHOUT/Algorithms)。在项目的初始阶段,有3个核心的主题:推荐引擎(协调过滤),集群和分类。这不表示所有的都在Mahout中,但是大部分突出且成熟的已经写成。基于此,你现在已经在关注着3大家族技术。

1.2 Mahout机器学习主题

  虽然Mahout是一个开放的需要实现所有种类的机器学习技术,但是目前关注在3个关键方面。包括推荐引擎(协调过滤),集群和分类。

推荐引擎
  推荐引擎是目前最真知灼见的机器学习技术。你是否曾经看到过一下网站和服务提供推荐书籍,电影或文章。它们试着推断和识别你可能喜好的未知项:
     Amazon.com是可能最著名的部署了推荐系统的商业网站。基于订购和网站活跃度,Amazon推荐你可能喜好的书籍或其他项。
     Netflix相似的推荐你可能喜好的DVD,而且愿意支付$1,000,000给谁可以提高他们推荐引擎质量的研究者。
     Dating网站就像Líbímseti甚至可以推荐人给人。
     社交网站像Facebook使用带有变量的推荐技术去识别人们可能但是还没有联系上的朋友。
  就像Amazon和其它已经举例的,推荐引擎已经通过交叉销售机会创造了商业价值。有报道,推荐产品给用户可以提高8%到12%的销售增长。

集群
  集群就没那么明显了,但是他的出现却非常著名。名称都隐含着集群技术试图把相似的东西分到集群里去。它是一种发现大或者难懂数据的架构和顺序的方法,也是一种发现兴趣模式或是数据容易理解的方式。
     Google新闻使用集群技术按主题把新闻分类。
     搜索引擎像Clusty按相似理由把搜索结果分类。
     消费者也看按照收入,地区和购买习惯进行集群分类。集群可以帮助识别难以识别的大数据集的结构,甚至体系。企业可以使用技术去发现用户中隐藏的分组,或者有效的组织大量的文档,或者网站日志中常用的模式。

分类
  机器技术决定事物多少概览可以是或不是某个类。分类就像集群,它是无处不在的,但是它可能更不明显。这些系统常常靠检测已经所属某类实例来学习分类规则。这也有许多应用。
     Yahoo! Mail通过以前的邮件和用户的垃圾邮件规则决定收到的邮件是否垃圾邮件。
     Google的Picasa和其他照片管理程序能确定图片的一个区域是否包括人脸。
     光学字符识别软件通过分类把扫描到的区域识别成独立的字符。
  分类可以确定是输入的内容是否和以前识别到的模式是否匹配。它也可以用来检测可疑的网络活动。它也可以用来指出用户的消息是消极的还是积极的。这些技术当大量的良性输入时可以工作得很好。在某些情况下,这些技术不仅仅可以在大量的输入下工作,并且必须很快产生结果,这些因素是可伸缩的重大问题。如以前提到过的,Mahout其中的一个功能就是在大量的输入数据上实习这些技术。

1.3 Mahout和Hadoop的大型可伸缩

  在机器学习算法的尺度问题是真的吗?让我们考虑一下你部署Mahout需要考虑的几个问题。
  Picasa在3年以来,根据一些粗略的估计,收录了超过5亿的照片。这意味着每天有上百万的新图片需要分析。分析一张照片不是大问题,就算是重复上百万次。但是要同步从10亿张照片中获取信息,在单台集群上时不容易的。
  同样的相似处理,Google新闻每天要处理350万新文章。虽然看起来是大量的项,这些文章需要被及时集群。
  Netflix发布的NetflixPrize含有1亿评分。着仅仅是公布出来做测试用的,Netflix总的数据可能每天要处理的推荐数据比这大几倍。
机器学习技术部署的环境就如:大量的不容易在一台机器甚至超级机器上处理的数据。没有如Mahout这样的实现,这几乎不可能。这也是为什么Mahout把可伸缩作为首要优先级,这也是这本书所关注的有效处理大量的数据。近年来,把复杂的机器学习技术部署在可伸缩的环境上只有大的高级的技术公司考虑用过。但是今天,随着计算能力的便宜和更易得到开源的框架如Apache Hadoop,Mahout将尝试完成这个迷局,提供一个高质量,在hadoop上可伸缩开源实现给所有技术公司。
  Mahout用的Haddoop是一个开源的,基于Java实现的MapReduce分布式计算框架类似与谷歌广泛应用的(http://labs.google.com/papers/mapreduce.html)。MapReduce第一次听起来是一个奇怪的编程范式,太简单但很强大。MapReduce范式解决问题的输入时键值对集。一个Map函数把这些键值对转成中间键值对,一个Reduce函数则把这些中间键值对合并并产生输出。实际上,许多问题可以用MapReduce来解决。这种范式在并行处理上非常好:所有的处理都是独立的,意味着计算可以被分解到许多机器上。这儿就不详细的阐述MapReduce了,我们推荐一个Hadoop提供的资料(http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html)。
Hadoop实现了MapReduce范式,它不是一个小的功劳,也不是说MapReduce听起来好简单。它需要管理输入的存储,中间键值对和输出;这些大量的数据必须在很多工作机上可用,不仅仅存储在本地。它也同时要分割这些数据和在工作机之间传输它们,且检测和恢复单机故障。理解了这些你知道了Hadoop需要处理多么复杂的工作。它不仅仅是一个你加到项目中的库。它包含几个部分,可能作为单机服务,也可能是在多个机器上。
操作Hadoop并不简单,但是在可伸缩,分布实现上投资时值得的:你的数据会很快增长到一个大数量级,这种可伸缩的实现是一种面向未来处理。
  在第6章,这本书讲试着在Hadoop上快速的运行一些复杂的例子,然后你可以探索更详细的操作集群并调试这个框架。因为这个复杂的框架需要大量的计算资源,所有并不奇怪云计算提供商提供了Hadoop相关的服务。例如:Amazon提供了弹性MapReduce(http://aws.amazon.com/elasticmapreduce/),一个管理Hadoop集群,提供计算能力,并提供了友好的界面来替代复杂任务的操作和监视运行在Hadoop上的大规模任务。

1.4 设置Mahout环境

  如果你要跟随本书操作本书的代码,你还需要一些工具。我们假设你已经有合适的java开发能力。Mahout及它相关的框架是基于Java的,平台独立的,即你可以在任何运行现代JVM的平台上运行它。有时需要提供不同平台的内容,如命令行命令在Windows下和Free BSD下不一样,所以这儿我们用Bash,这个在Linux发行版上,Mac OS X和其他Unix变种的,和Cygwin(一个在Windows上运行Unix的环境)都可以用。

Java和IDEs
  如果你运行过Java开发,Java已经可能安装上了。Mahout需要Java 6。Windows和Linux用户请从http://www.oracle.com/technetwork/java/下载,Apple已经提供了,请从/Applications/Utilities目录查看,并设置为默认Java。
  你可以需要下IDE来方便编程。参考如下:
Eclipse (http://www.eclipse.org)
NetBeans (http://netbeans.org/)
IntelliJ IDEA(http://www.jetbrains.com/idea/index.html)

安装Maven
  和其它Apache项目一样,Mahout的编译和发布都需要Maven(http://maven.apache.org)。着怎么安装不翻译了。
  
安装Mahout
  Mahout还在开发中,这本书用的是0.5版本的,需要从https://cwiki.apache.org/confluence/display/MAHOUT/Downloads下载。
  下载安装好后,可以从Manning的网站(http://www.manning.com/MahoutinAction/)或GitHub(https://github.com/tdunning/MiA)下载样例代码设置工作环境。
安装Hadoop
  Hadoop请自行安装,可以从http://hadoop.apache.org/common/releases.html下载0.20.2版本的,然后设置一个伪分布的单节点(http://hadoop.apache.org/common/docs/current/single_node_setup.html).

1.5 总结

  Mahout是一个Apache上年轻的,开源的和可伸缩的机器学习库。这本书就是利用Mahout使用机器学习技术解决现实问题的训练导览。你马上讲探索推荐引擎,集群和分类技术。如果你是机器学习的理论研究者,这本书讲引导你怎么把算法用于实践。我们已经为推荐引擎,集群和分类提供了部署在真实环境上的著名例子:商业,邮件,视频,图片和更多的可伸缩机器学习。这些技术被用来解决真实的问题并通过Mahout为企业带来了加载。本作品采用知识共享署名 4.0 国际许可协议进行许可。

深入浅出Mahout(Mahout in Action)-目录

原创文章,转载请注明: 转载自慢慢的回味

本文链接地址: 深入浅出Mahout(Mahout in Action)-目录

1 了解 Apache Mahout

1.1 Mahout的历史

1.2 Mahout机器学习主题

推荐引擎
集群
分类

1.3 Mahout和Hadoop的大型可伸缩

1.4 设置Mahout环境

Java和IDEs
安装Maven
安装Mahout
安装Hadoop

1.5 总结

第一部分 推荐

2 推荐系统介绍

2.1 定义一个推荐系统

2.2 运行第一个推荐引擎

创建输入
创建推荐系统
分析输出

2.3 评估一个推荐系统

训练数据和评分
运行推荐评估系统
获取结果

2.4 评估精度和重现能力

运行RecommenderIRStatsEvaluator
精度和重现能力问题

2.5 评估GroupLens数据集

分解输入
使用其它推荐系统试验

2.6 总结

3 推荐系统数据表示

3.1 参考数据表示

参考数据对象
参考数据数组和实现
提高集合速度
FastByIDMap和FastIDSet

3.2 内存中的数据模型

通用数据模型
文件类型数据
可刷新组件
更新文件
数据库类型数据
JDBC和MySQL
配置JNDI 33 用程序配置

3.3 复制没有参考值的参考数据

什么时候忽略参考值
使用内存表示没有参考值的参考数据
选择兼容的实现方式

3.4 总结

4 创建推荐系统

4.1 理解基于用户的推荐系统

什么时候推荐是错误的
什么时候推荐是正确的

4.2 探索基于用户的推荐系统

算法
用GenericUserBasedRecommender实现算法
使用GroupLens探索
探索用户的邻居数据
固定大小的邻居数据
基于阈值的邻居数据

4.3 探索相似性度量标准

基于皮尔逊相关性相似性度量
皮尔逊相似性问题
利用权值
使用欧几里得距离定义相似性
使用余弦定义相似性
用斯皮尔曼等级相关系数定义相似性
使用Tanimoto系数定义忽略参考值的相似性
使用对数似然函数测试比较得到更好的相似度量
推断参考值

4.4 基于商品的推荐系统

算法
探索基于商品的推荐系统

4.5 Slope-one推荐系统

算法
Slope-one实践
不同存储和内存考虑
预先计算

4.6 新的实验性推荐系统

基于奇异值分解的推荐系统
基于线性插值的推荐系统
基于集群的推荐系统

4.7 对比其它推荐系统

注入基于内容的计算到Mahout中
深入基于内容的推荐系统
和基于模型的推荐系统比较

4.8 总结

5 把推荐系统用于生产

5.1 分析一个来自dating网站的样例数据

5.2 找到一个有效的推荐系统

基于用户的推荐系统
基于商品的推荐系统
Slope-one推荐系统
评估精度和重现能力
评估性能

5.3 注入领域特性信息

利用定制的相似度量
推荐
基于内容
用IDRescorer修改推荐系统
合并性别信息到IDRescorer中
组建一个定制的推荐系统

5.4 向匿名用户推荐

使用PlusAnonymousUserDataModel添加临时用户
汇聚匿名用户

5.5 创建一个可以在网站上使用的推荐系统

打包一个WAR文件
测试部署

5.6 更新检测推荐系统

5.7 总结

6 分布式推荐系统

6.1 分析Wikipedia数据

规模化
评估分布计算的好处和坏处

6.2 设计一个分布的基于商品的算法

构建一个共现矩阵
计算出用户矢量
产生推荐结果
理解结果
分布实现

6.3 使用MapReduce实现分布式算法

介绍MapReduce
转换为MapReduce: 生成用户矢量
转换为MapReduce: 计算共现矩阵
转换为MapReduce: 考虑矩阵乘法
转换为MapReduce: 使用部分集做矩阵乘法
转换为MapReduce: 计算推荐

6.4 使用Hadoop运行MapReduces

部署Hadoop
使用Hadoop运行推荐算法
配置mappers和reducers

6.5 伪分布推荐系统

6.6 推荐系统的其他使用方式

在云端运算
推荐系统的非常规用法

6.7 总结

第二部分 集群

8 数据表示

8.1 可视化矢量

把数据转为矢量
准备Mahout使用的矢量

8.2 把文本文档转为矢量

使用TF-IDF改进权值
使用n-gram集合计算单词的依赖性

8.3 从文档生成矢量

8.4 使用归一化改进矢量

8.5 总结

9 Mahout中的集群算法

9.1 K均值集群

关于K均值
运行K均值集群
使用canopy集群找到合适的K
样例学习: 使用K均值集群新闻文章

9.2 超越K均值: 集群概览

技术163
不同种类的集群问题
不同的集群方法

9.3 模糊K均值集群

运行模糊K均值集群
怎样的模糊算好?
样例学习: 使用模糊K均值集群新闻文章

9.4 基于模型的集群

K均值的缺点
狄利克雷集群
运行一个基于模型的集群例子

9.5 使用隐含狄利克雷分布(LDA)的主题模型

理解隐含狄利克雷分析
TF-IDF与LDA
调优LDA的参数
样例学习: 找到新闻文章中的主题
主题模型的应用

9.6 总结

10 评估改进集群质量

10.1 检查集群输出

10.2 分析集群输出

距离度量与特征选择
集群之间和集群内部的距离
集群的混合和重叠

10.3 改进集群质量

改进文档矢量的生成
写一个自定义的距离度量

10.4 总结

11 把集群应用与生产上

11.1 运行集群到hadoop上的快速入门例子

在hadoop本地集群上运行
Hadoop参数的配置

11.2 调优集群性能

避免处理器受限操作的性能陷阱
避免I/O受限操作的性能陷阱

11.3 批量和在线集群

样例学习: 在线新闻集群
样例学习: 集群维基百科文章

11.4 总结

12 现实世界的集群程序

12.1 在Twitter上查找相似的用户

数据处理与特征权值化
在特征选择上避免常见的陷阱

12.2 Last.fm上的艺术家建议标签

标签建议的共现矩阵
创建Last.fm艺术家词典
把Last.fm标签转成以音乐家为特征的矢量
在Last.fm数据上运行K均值集群

12.3 分析Stack Overflow数据集

得到Stack Overflow数据集
在Stack Overflow数据上集群的问题

12.4 总结

第三部分 分类

13 分类介绍

13.1 为什么使用Mahout分类

13.2 分类系统的原理

分类,推荐和集群的不同
分类应用

13.3 分类是怎么工作的

模型
训练,测试与生产
预测变量与目标变量
记录,字段和值
预测变量值的四种类型
监督与非监督学习

13.4 典型的分类过程流程

第一步: 训练分类模型
第二步: 评估分类模型
第三步: 使用模型在生产上分类

13.5 手把手的简单分类例子

数据与挑战
训练一个模型来进行着色分类: 初步想法
选择一个算法来分类
提高着色分类器的性能

13.6 总结

14 训练分类器

14.1 提取特征来构建一个Mahout分类器

14.2 预处理元数据到分类数据

转换元数据
处理一个市场数据的例子

14.3 把分类数据转化为矢量

用矢量表示数据
使用Mahout APIs hash特征

14.4 使用SGD分类对20 newsgroups数据进行分类

预览数据
从 20 newsgroups数据中获取特征
20 newsgroups数据的训练代码

14.5 选择一个算法来训练分类器

非并行但功能强大: 使用SGD和SVM
功能强大的贝叶斯分类: 使用朴素贝叶斯和互补朴素贝叶斯
复杂结构上的优势: 使用随机森林

14.6 使用朴素贝叶斯对20 newsgroups数据进行分类

数据分解
训练朴素贝叶斯分类器
测试朴素贝叶斯分类器

14.7 总结

15 评估和调优分类器

15.1 Mahout中的分类器评估

获取快速的回应
决定什么是好的方法
认识不同的错误代价

15.2 分类器评估API

计算AUC
混淆矩阵和熵矩阵
计算平均log似然度
分析模型
使用SGD对20 newsgroups分类的性能

15.3 什么时候一个分类器不好

目标泄露
不好的特征提取

15.4 调优性能

调节问题
调节分类器

15.5 总结

16 部署分类器

16.1 在大型系统中部署

问题的范围
按需优化特征提取
按需优化矢量
部署一个可伸缩的分类服务

16.2 确定规模和速度的要求

多大就可以了?
平衡大和速度

16.3 在大系统上构建一个训练管道

获取与重新训练可伸缩的数据
反规范化和下采样
训练陷阱
高速的读取编码数据

16.4 集成Mahout分类器

集成的关节问题
模型序列化

16.5 例子: 基于Thrift的分类器

运行一个分类器服务器
获取分类服务

16.6 总结

17 样例学习: Shop It To Me

17.1 为什么Shop It To Me选择Mahout

Shop It To Me做什么342
为什么Shop It To Me需要一个分类系统
Mahout解决剩下的问题

17.2 Email销售系统的通常结构

17.3 训练模型

定义分类工程的目标
使用时间分区
避免目标泄露
学习算法调整
特征矢量编码

17.4 提速分类器

特征向量的线性组合
模型分数的线性膨胀

17.5 总结

本作品采用知识共享署名 4.0 国际许可协议进行许可。