mapReduce编程之Recommender System

1 协同过滤算法

　　协同过滤算法是现在推荐系统的一种常用算法。分为user-CF和item-CF。

　　本文的电影推荐系统使用的是item-CF，主要是由于用户数远远大于电影数，构建矩阵的代价更小；另外，电影推荐系统中使用基于物品的推荐对用户来说更有说服力。因此本文对user-CF只做简单介绍，主要介绍item-CF。

　　1.1 基于用户的协同过滤算法

　　　　 a 计算出用户两两之间的相似度，得到用户相似度矩阵；

　　　　 b 预测用户的喜好，使用公式：

　　　　 mapReduce编程之Recommender System

　　　　　其中，p(u,i)表示用户u对物品i的感兴趣程度，S(u,k)表示和用户u兴趣最接近的K个用户，N(i)表示对物品i有过行为的用户集合，Wuv表示用户u和用户v的兴趣相似度，Rvi表示用户v对物品i的兴趣。

　　　　 c 根据预测出来的喜好度来做推荐。

　　 1.2 基于物品的协同过滤算法

　　　　　1.2.1 物品相似度计算

　　　　　物品相似度的计算有多种。在这里使用同现矩阵。其中第m行第n列的元素表示物品m和物品n的相似度，具体是：如果一个用户同时看过电影m和n，则m和n的相似度就加1。还要对如下所示：

mapReduce编程之Recommender System

　　　　　之后还要对同现矩阵做归一化，注意归一化之后矩阵不是对称的：

mapReduce编程之Recommender System

　　　　　　1.2.2 预测用户对未看电影的打分

　　　　　　用户打分的预测值由下式计算：

　　　　　　 mapReduce编程之Recommender System

　　　　　　因此，最后得到的预测矩阵可由同现矩阵与评分矩阵直接相乘得到：

mapReduce编程之Recommender System

　　　　　　1.2.3 推荐

　　　　　　根据预测的打分，选出未看电影中的topk即生成推荐列表。

2 mapReduce工作流程

2.1 输入数据形式

表示userID, movieID，评分

mapReduce编程之Recommender System

2.2 总体流程

mapReduce编程之Recommender System

2.3 MR1

　　MR1负责数据预处理，将同一个user的数据merge到一起。

　　mapper负责拆分数据：

mapReduce编程之Recommender System

　　reducer负责合并：

mapReduce编程之Recommender System

2.4 MR2

　　MR2负责构建同现矩阵。

　　mapper将一个用户看过的每部电影进行两两组合发送：

mapReduce编程之Recommender System

　　reducer负责merge这些值，就得到同现矩阵的每个单元（行号：列号）：

mapReduce编程之Recommender System

2.5 MR3

　　　MR3负责将同现矩阵归一化。

　　　mapper 负责读取上一个MR产生的同现矩阵cells，然后按行号发送到reducer(由于归一化是按行的，所以这里要以行号为Key)。

　　 reducer将得到的一行sum之后，用原来的值除以sum得到归一化的值，然后将每个单元按照列号写入HDFS（按列号写是为之后的矩阵相乘做准备）。

　　　因此，MR3的输入输出如下：

mapReduce编程之Recommender System

2.6 MR4

　　MR4将完成矩阵小单元相乘的工作。

　　mapper1负责读入归一化的同现矩阵的小单元，然后按列号发送（之前已经按列号存储了，这里直接读取并发送就行）

mapReduce编程之Recommender System

　　mapper2负责读取输入的rowdata文件，即评分矩阵的每个小单元，然后按行号（movie id）发送：

mapReduce编程之Recommender System

　　　在reducer中，接收到的值分别来自同现矩阵的第x列和评分矩阵的第x行。我们知道，最终生成的预测矩阵i行j列的小单元（i，j）是等于对应的同现矩阵的（i, x）乘以评分矩阵的（x, j），再对所有x求和。而这里的reducer中聚集了所有x值相同的来自两个矩阵的小单元，因此它们两两之间是可以互乘的。这里我们用=和：来区分两个矩阵的小单元。下图中橘黄色是处于同一个reducer里面的小单元，将来自同现矩阵和评分矩阵的小单元区分开后，将它们两两相乘，得到预测矩阵的行号与列号的不同组合，以它为key写入hdfs。

mapReduce编程之Recommender System

2.7 MR5

　　MR5负责将乘积的结果相加。

　　 mapReduce编程之Recommender System

3 主要代码

DataDividerByUser.java

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import java.io.IOException;

 public class DataDividerByUser {
     public static class DataDividerMapper extends Mapper<LongWritable, Text, IntWritable, Text> {

         // map method
         @Override
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //input user,movie,rating
             String[] user_movie_rating = value.toString().split(",");
             int userId = Integer.parseInt(user_movie_rating[0]);
             String outPutKey = user_movie_rating[1] + ":" + user_movie_rating[2];
             //divide data by user
             context.write(new IntWritable(userId), new Text(outPutKey));
         }
     }

     public static class DataDividerReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
         // reduce method
         @Override
         public void reduce(IntWritable key, Iterable<Text> values, Context context)
                 throws IOException, InterruptedException {
             StringBuilder sb = new StringBuilder();
             //merge data for one user
             for (Text value : values) {
                 sb.append(value.toString());
                 sb.append(",");
             }
             sb.deleteCharAt(sb.length() - 1);
             context.write(key, new Text(sb.toString()));
         }
     }

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf);
         job.setMapperClass(DataDividerMapper.class);
         job.setReducerClass(DataDividerReducer.class);

         job.setJarByClass(DataDividerByUser.class);

         job.setInputFormatClass(TextInputFormat.class);
         job.setOutputFormatClass(TextOutputFormat.class);
         job.setOutputKeyClass(IntWritable.class);
         job.setOutputValueClass(Text.class);

         TextInputFormat.setInputPaths(job, new Path(args[0]));
         TextOutputFormat.setOutputPath(job, new Path(args[1]));

         job.waitForCompletion(true);
     }

 }

CoOccurrenceMatrixGenerator.java

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.List;

 public class CoOccurrenceMatrixGenerator {
     public static class MatrixGeneratorMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

         // map method
         @Override
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //value = userid \t movie1: rating, movie2: rating...
             String[] movie_rating = value.toString().split("\t")[1].split(",");
             //key = movie1: movie2 value = 1
             //calculate each user rating list: <movieA, movieB>
             for (int i = 0; i < movie_rating.length; i++) {
                 for (int j = 0; j < movie_rating.length; j++) {
                     String outPutKey = movie_rating[i].split(":")[0] + ":" + movie_rating[j].split(":")[0];
                     context.write(new Text(outPutKey), new IntWritable(1));
                 }
             }
         }
     }

     public static class MatrixGeneratorReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
         // reduce method
         @Override
         public void reduce(Text key, Iterable<IntWritable> values, Context context)
                 throws IOException, InterruptedException {
             //key movie1:movie2 value = iterable<1, 1, 1>
             //calculate each two movies have been watched by how many people
             int sum = 0;
             for (IntWritable value : values) {
                 sum += value.get();
             }
             context.write(key, new IntWritable(sum));
         }
     }

     public static void main(String[] args) throws Exception{

         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf);
         job.setMapperClass(MatrixGeneratorMapper.class);
         job.setReducerClass(MatrixGeneratorReducer.class);

         job.setJarByClass(CoOccurrenceMatrixGenerator.class);

         job.setInputFormatClass(TextInputFormat.class);
         job.setOutputFormatClass(TextOutputFormat.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(IntWritable.class);

         TextInputFormat.setInputPaths(job, new Path(args[0]));
         TextOutputFormat.setOutputPath(job, new Path(args[1]));

         job.waitForCompletion(true);

     }
 }

Normalize.java

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import java.io.IOException;
 import java.util.HashMap;
 import java.util.Map;

 public class Normalize {

     public static class NormalizeMapper extends Mapper<LongWritable, Text, Text, Text> {

         // map method
         @Override
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             //movieA:movieB \t relation
             String movieA = value.toString().split("\t")[0].split(":")[0];
             String movieB = value.toString().split("\t")[0].split(":")[1];
             String relation = value.toString().split("\t")[1];
             //collect the relationship list for movieA
             context.write(new Text(movieA), new Text(movieB + ":" + relation));
         }
     }

     public static class NormalizeReducer extends Reducer<Text, Text, Text, Text> {
         // reduce method
         @Override
         public void reduce(Text key, Iterable<Text> values, Context context)
                 throws IOException, InterruptedException {

             //key = movieA, value=<movieB:relation, movieC:relation...>
             //normalize each unit of co-occurrence matrix
             Map<String, Double> map = new HashMap<String, Double>();
             double sum = 0;
             for (Text value : values) {
                 String[] movie_relation = value.toString().split(":");
                 map.put(movie_relation[0], Double.parseDouble(movie_relation[1]));
                 sum += Double.parseDouble(movie_relation[1]);
             }
             for (Map.Entry<String, Double> entry : map.entrySet()) {
                 String outputKey = entry.getKey();
                 String outputValue = key.toString() + "=" + String.valueOf(entry.getValue() / sum);
                 context.write(new Text(outputKey), new Text(outputValue));
             }
         }
     }

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf);
         job.setMapperClass(NormalizeMapper.class);
         job.setReducerClass(NormalizeReducer.class);

         job.setJarByClass(Normalize.class);

         job.setInputFormatClass(TextInputFormat.class);
         job.setOutputFormatClass(TextOutputFormat.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(Text.class);

         TextInputFormat.setInputPaths(job, new Path(args[0]));
         TextOutputFormat.setOutputPath(job, new Path(args[1]));

         job.waitForCompletion(true);
     }
 }

Multiplication.java

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.DoubleWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
 import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import java.io.IOException;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;

 public class Multiplication {
     public static class CooccurrenceMapper extends Mapper<LongWritable, Text, Text, Text> {

         // map method
         @Override
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //input: movieB \t movieA=relation
             //pass data to reducer
             String[] movieB_movieARelation = value.toString().split("\t");
             context.write(new Text(movieB_movieARelation[0]), new Text(movieB_movieARelation[1]));
         }
     }

     public static class RatingMapper extends Mapper<LongWritable, Text, Text, Text> {

         // map method
         @Override
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             //input: user,movie,rating
             //pass data to reducer
             String[] user_movie_rating = value.toString().split(",");
             String outputKey = user_movie_rating[0] + ":" + user_movie_rating[2];
             context.write(new Text(user_movie_rating[1]), new Text(outputKey));
         }
     }

     public static class MultiplicationReducer extends Reducer<Text, Text, Text, DoubleWritable> {
         // reduce method
         @Override
         public void reduce(Text key, Iterable<Text> values, Context context)
                 throws IOException, InterruptedException {

             //key = movieB value = <movieA=relation, movieC=relation... userA:rating, userB:rating...>
             //collect the data for each movie, then do the multiplication
             Map<String, Double> coMap = new HashMap<String, Double>();
             Map<String, Double> ratingMap = new HashMap<String, Double>();
             for (Text value : values) {
                 String s = value.toString();
                 if (s.contains("=")) {
                     coMap.put(s.split("=")[0], Double.parseDouble(s.split("=")[1]));
                 } else {
                     ratingMap.put(s.split(":")[0], Double.parseDouble(s.split(":")[1]));
                 }
             }
             for (Map.Entry<String, Double> entry1 : coMap.entrySet()) {
                 for (Map.Entry<String, Double> entry2 : ratingMap.entrySet()) {
                     double mult = entry1.getValue() * entry2.getValue();
                     String outputKey = entry2.getKey() + ":" + entry1.getKey();
                     context.write(new Text(outputKey), new DoubleWritable(mult));
                 }
             }
          }
     }

     public static void main(String[] args) throws Exception {
         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf);
         job.setJarByClass(Multiplication.class);

         ChainMapper.addMapper(job, CooccurrenceMapper.class, LongWritable.class, Text.class, Text.class, Text.class, conf);
         ChainMapper.addMapper(job, RatingMapper.class, Text.class, Text.class, Text.class, Text.class, conf);

         job.setMapperClass(CooccurrenceMapper.class);
         job.setMapperClass(RatingMapper.class);

         job.setReducerClass(MultiplicationReducer.class);

         job.setMapOutputKeyClass(Text.class);
         job.setMapOutputValueClass(Text.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(DoubleWritable.class);

         MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CooccurrenceMapper.class);
         MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, RatingMapper.class);

         TextOutputFormat.setOutputPath(job, new Path(args[2]));

         job.waitForCompletion(true);
     }
 }

Sum.java

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.DoubleWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import java.io.IOException;

 /**
  * Created by Michelle on 11/12/16.
  */
 public class Sum {

     public static class SumMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {

         // map method
         @Override
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             //pass data to reducer
             String[] key_value = value.toString().split("\t");
             context.write(new Text(key_value[0]), new DoubleWritable(Double.parseDouble(key_value[1])));
         }
     }

     public static class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
         // reduce method
         @Override
         public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
                 throws IOException, InterruptedException {

             //user:movie relation
            //calculate the sum
             double sum = 0;
             for (DoubleWritable value : values) {
                 sum += value.get();
             }
             context.write(key, new DoubleWritable(sum));
         }
     }

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf);
         job.setMapperClass(SumMapper.class);
         job.setReducerClass(SumReducer.class);

         job.setJarByClass(Sum.class);

         job.setInputFormatClass(TextInputFormat.class);
         job.setOutputFormatClass(TextOutputFormat.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(DoubleWritable.class);

         TextInputFormat.setInputPaths(job, new Path(args[0]));
         TextOutputFormat.setOutputPath(job, new Path(args[1]));

         job.waitForCompletion(true);
     }
 }

Driver.java

 public class Driver {
     public static void main(String[] args) throws Exception {

         DataDividerByUser dataDividerByUser = new DataDividerByUser();
         CoOccurrenceMatrixGenerator coOccurrenceMatrixGenerator = new CoOccurrenceMatrixGenerator();
         Normalize normalize = new Normalize();
         Multiplication multiplication = new Multiplication();
         Sum sum = new Sum();

         String rawInput = args[0];
         String userMovieListOutputDir = args[1];
         String coOccurrenceMatrixDir = args[2];
         String normalizeDir = args[3];
         String multiplicationDir = args[4];
         String sumDir = args[5];
         String[] path1 = {rawInput, userMovieListOutputDir};
         String[] path2 = {userMovieListOutputDir, coOccurrenceMatrixDir};
         String[] path3 = {coOccurrenceMatrixDir, normalizeDir};
         String[] path4 = {normalizeDir, rawInput, multiplicationDir};
         String[] path5 = {multiplicationDir, sumDir};

         dataDividerByUser.main(path1);
         coOccurrenceMatrixGenerator.main(path2);
         normalize.main(path3);
         multiplication.main(path4);
         sum.main(path5);

     }

 }

秒客网

mapReduce编程之Recommender System

1 协同过滤算法

1.1 基于用户的协同过滤算法

1.2 基于物品的协同过滤算法

1.2.1 物品相似度计算

1.2.2 预测用户对未看电影的打分

1.2.3 推荐

2 mapReduce工作流程

2.1 输入数据形式

2.2 总体流程

2.3 MR1

2.4 MR2

2.5 MR3

2.6 MR4

2.7 MR5

3 主要代码

相关文章