【文件属性】:
文件名称:Mahout in Action
文件大小:10.29MB
文件格式:PDF
更新时间:2015-02-26 21:00:05
Mahout Hadoop Mapreduce
brief contents
1 ■ Meet Apache Mahout 1
PART 1 RECOMMENDATIONS ...................................................11
2 ■ Introducing recommenders 13
3 ■ Representing recommender data 26
4 ■ Making recommendations 41
5 ■ Taking recommenders to production 70
6 ■ Distributing recommendation computations 91
PART 2 CLUSTERING .............................................................115
7 ■ Introduction to clustering 117
8 ■ Representing data 130
9 ■ Clustering algorithms in Mahout 145
10 ■ Evaluating and improving clustering quality 184
11 ■ Taking clustering to production 198
12 ■ Real-world applications of clustering 210
Licensed to Jianbin Dai
vi BRIEF CONTENTS
PART 3 CLASSIFICATION ........................................................225
13 ■ Introduction to classification 227
14 ■ Training a classifier 255
15 ■ Evaluating and tuning a classifier 281
16 ■ Deploying a classifier 307
17 ■ Case study: Shop It To Me 341
Licensed to Jianbin Dai
vii
contents
preface xvii
acknowledgments xix
about this book xx
about multimedia extras xxiii
about the cover illustration xxv
1 Meet Apache Mahout 1
1.1 Mahout’s story 2
1.2 Mahout’s machine learning themes 3
Recommender engines 3 ■ Clustering 3 ■ Classification 4
1.3 Tackling large scale with Mahout and Hadoop 5
1.4 Setting up Mahout 6
Java and IDEs 7 ■ Installing Maven 8 ■ Installing
Mahout 8 ■ Installing Hadoop 9
1.5 Summary 9
PART 1 RECOMMENDATIONS...........................................11
2 Introducing recommenders 13
2.1 Defining recommendation 14
Licensed to Jianbin Dai
viii CONTENTS
2.2 Running a first recommender engine 15
Creating the input 15 ■ Creating a recommender 16
Analyzing the output 17
2.3 Evaluating a recommender 18
Training data and scoring 18 ■ Running
RecommenderEvaluator 19 ■ Assessing the result 20
2.4 Evaluating precision and recall 21
Running RecommenderIRStatsEvaluator 21 ■ Problems with
precision and recall 23
2.5 Evaluating the GroupLens data set 23
Extracting the recommender input 23 ■ Experimenting with other
recommenders 24
2.6 Summary 25
3 Representing recommender data 26
3.1 Representing preference data 27
The Preference object 27 ■ PreferenceArray and
implementations 28 ■ Speeding up collections 28
FastByIDMap and FastIDSet 29
3.2 In-memory DataModels 30
GenericDataModel 30 ■ File-based data 30 ■ Refreshable
components 31 ■ Update files 32 ■ Database-based data 32
JDBC and MySQL 32 ■ Configuring via JNDI 33
Configuring programmatically 34
3.3 Coping without preference values 34
When to ignore values 35 ■ In-memory representations without
preference values 36 ■ Selecting compatible implementations 37
3.4 Summary 39
4 Making recommendations 41
4.1 Understanding user-based recommendation 42
When recommendation goes wrong 42 ■ When recommendation
goes right 42
4.2 Exploring the user-based recommender 43
The algorithm 43 ■ Implementing the algorithm with
GenericUserBasedRecommender 44 ■ Exploring with
GroupLens 45 ■ Exploring user neighborhoods 46
Fixed-size neighborhoods 46 ■ Threshold-based neighborhood 47
Licensed to Jianbin Dai
CONTENTS ix
4.3 Exploring similarity metrics 48
Pearson correlation–based similarity 48 ■ Pearson correlation
problems 50 ■ Employing weighting 50 ■ Defining similarity by
Euclidean distance 51 ■ Adapting the cosine measure
similarity 52 ■ Defining similarity by relative rank with the
Spearman correlation 52 ■ Ignoring preference values in
similarity with the Tanimoto coefficient 54 ■ Computing smarter
similarity with a log-likelihood test 55 ■ Inferring preferences 56
4.4 Item-based recommendation 56
The algorithm 57 ■ Exploring the item-based recommender 58
4.5 Slope-one recommender 59
The algorithm 60 ■ Slope-one in practice 61 ■ DiffStorage and
memory considerations 62 ■ Distributing the precomputation 62
4.6 New and experimental recommenders 63
Singular value decomposition–based recommenders 63
Linear interpolation item–based recommendation 64
Cluster-based recommendation 65
4.7 Comparison to other recommenders 66
Injecting content-based techniques into Mahout 66
Looking deeper into content-based recommendation 67
Comparison to model-based recommenders 67
4.8 Summary 68
5 Taking recommenders to production 70
5.1 Analyzing example data from a dating site 71
5.2 Finding an effective recommender 72
User-based recommenders 73 ■ Item-based recommenders 74
Slope-one recommender 75 ■ Evaluating precision and recall 75
Evaluating Performance 76
5.3 Injecting domain-specific information 77
Employing a custom item similarity metric 77 ■ Recommending
based on content 78 ■ Modifying recommendations with
IDRescorer 79 ■ Incorporating gender in an IDRescorer 80
Packaging a custom recommender 82
5.4 Recommending to anonymous users 83
Temporary users with PlusAnonymousUserDataModel 84
Aggregating anonymous users 85
5.5 Creating a web-enabled recommender 86
Packaging a WAR file 86 ■ Testing deployment 87
Licensed to Jianbin Dai
x CONTENTS
5.6 Updating and monitoring the recommender 88
5.7 Summary 89
6 Distributing recommendation computations 91
6.1 Analyzing the Wikipedia data set 92
Struggling with scale 93 ■ Evaluating benefits and drawbacks of
distributing computations 93
6.2 Designing a distributed item-based algorithm 95
Constructing a co-occurrence matrix 95 ■ Computing user
vectors 96 ■ Producing the recommendations 96
Understanding the results 97 ■ Towards a distributed
implementation 98
6.3 Implementing a distributed algorithm with
MapReduce 98
Introducing MapReduce 98 ■ Translating to MapReduce:
generating user vectors 99 ■ Translating to MapReduce:
calculating co-occurrence 100 ■ Translating to MapReduce:
rethinking matrix multiplication 101 ■ Translating to
MapReduce: matrix multiplication by partial products 102
Translating to MapReduce: making recommendations 105
6.4 Running MapReduces with Hadoop 107
Setting up Hadoop 107 ■ Running recommendations with
Hadoop 108 ■ Configuring mappers and reducers 110
6.5 Pseudo-distributing a recommender 110
6.6 Looking beyond first steps with recommendations 112
Running in the cloud 112 ■ Imagining unconventional uses of
recommendations 113
6.7 Summary 114
PART 2 CLUSTERING ....................................................115
7 Introduction to clustering 117
7.1 Clustering basics 118
7.2 Measuring the similarity of items 119
7.3 Hello World: running a simple clustering example 120
Creating the input 120 ■ Using Mahout clustering 122
Analyzing the output 125
Licensed to Jianbin Dai
CONTENTS xi
7.4 Exploring distance measures 125
Euclidean distance measure 126 ■ Squared Euclidean distance
measure 126 ■ Manhattan distance measure 126 ■ Cosine
distance measure 127 ■ Tanimoto distance measure 128
Weighted distance measure 128
7.5 Hello World again! Trying out various distance
measures 129
7.6 Summary 129
8 Representing data 130
8.1 Visualizing vectors 131
Transforming data into vectors 132 ■ Preparing vectors for
use by Mahout 134
8.2 Representing text documents as vectors 135
Improving weighting with TF-IDF 136 ■ Accounting for word
dependencies with n-gram collocations 137
8.3 Generating vectors from documents 138
8.4 Improving quality of vectors using normalization 143
8.5 Summary 144
9 Clustering algorithms in Mahout 145
9.1 K-means clustering 146
All you need to know about k-means 147 ■ Running k-means
clustering 148 ■ Finding the perfect k using canopy
clustering 155 ■ Case study: clustering news articles using
k-means 160
9.2 Beyond k-means: an overview of clustering
techniques 163
Different kinds of clustering problems 163 ■ Different clustering
approaches 166
9.3 Fuzzy k-means clustering 168
Running fuzzy k-means clustering 168 ■ How fuzzy is too
fuzzy? 170 ■ Case study: clustering news articles using fuzzy
k-means 170
9.4 Model-based clustering 171
Deficiencies of k-means 172 ■ Dirichlet clustering 173
Running a model-based clustering example 174
Licensed to Jianbin Dai
xii CONTENTS
9.5 Topic modeling using latent Dirichlet allocation (LDA) 177
Understanding latent Dirichlet analysis 178 ■ TF-IDF vs.
LDA 179 ■ Tuning the parameters of LDA 179 ■ Case study:
finding topics in news documents 180 ■ Applications of topic
modeling 182
9.6 Summary 182
10 Evaluating and improving clustering quality 184
10.1 Inspecting clustering output 185
10.2 Analyzing clustering output 187
Distance measure and feature selection 188 ■ Inter-cluster and
intra-cluster distances 188 ■ Mixed and overlapping clusters 191
10.3 Improving clustering quality 192
Improving document vector generation 192 ■ Writing a custom
distance measure 195
10.4 Summary 197
11 Taking clustering to production 198
11.1 Quick-start tutorial for running clustering on Hadoop 199
Running clustering on a local Hadoop cluster 199
Customizing Hadoop configurations 201
11.2 Tuning clustering performance 202
Avoiding performance pitfalls in CPU-bound operations 203
Avoiding performance pitfalls in I/O-bound operations 204
11.3 Batch and online clustering 205
Case study: online news clustering 206 ■ Case study: clustering
Wikipedia articles 207
11.4 Summary 209
12 Real-world applications of clustering 210
12.1 Finding similar users on Twitter 211
Data preprocessing and feature weighting 211 ■ Avoiding
common pitfalls in feature selection 212
12.2 Suggesting tags for artists on Last.fm 216
Tag suggestion using co-occurrence 216 ■ Creating a dictionary of
Last.fm artists 217 ■ Converting Last.fm tags into Vectors with
musicians as features 219 ■ Running k-means over the Last.fm
data 220
Licensed to Jianbin Dai
CONTENTS xiii
12.3 Analyzing the Stack Overflow data set 221
Parsing the Stack Overflow data set 222 ■ Finding clustering
problems in Stack Overflow 222
12.4 Summary 224
PART 3 CLASSIFICATION ...............................................225
13 Introduction to classification 227
13.1 Why use Mahout for classification? 228
13.2 The fundamentals of classification systems 229
Differences between classification, recommendation, and
clustering 230 ■ Applications of classification 231
13.3 How classification works 232
Models 234 ■ Training versus test versus production 234
Predictor variables versus target variable 234 ■ Records, fields,
and values 235 ■ The four types of values for predictor
variables 236 ■ Supervised versus unsupervised learning 238
13.4 Work flow in a typical classification project 239
Workflow for stage 1: training the classification model 240
Workflow for stage 2: evaluating the classification model 245
Workflow for stage 3: using the model in production 245
13.5 Step-by-step simple classification example 245
The data and the challenge 246 ■ Training a model to find colorfill:
preliminary thinking 246 ■ Choosing a learning algorithm to
train the model 247 ■ Improving performance of the color-fill
classifier 250
13.6 Summary 254
14 Training a classifier 255
14.1 Extracting features to build a Mahout classifier 256
14.2 Preprocessing raw data into classifiable data 257
Transforming raw data 258 ■ Computational marketing
example 258
14.3 Converting classifiable data into vectors 260
Representing data as a vector 260 ■ Feature hashing with
Mahout APIs 261
Licensed to Jianbin Dai
xiv CONTENTS
14.4 Classifying the 20 newsgroups data set with SGD 265
Getting started: previewing the data set 266 ■ Parsing
and tokenizing features for the 20 newsgroups data 268
Training code for the 20 newsgroups data 268
14.5 Choosing an algorithm to train the classifier 273
Nonparallel but powerful: using SGD and SVM 274 ■ The power
of the naive classifier: using naive Bayes and complementary naive
Bayes 275 ■ Strength in elaborate structure: using random
forests 276
14.6 Classifying the 20 newsgroups data with naive Bayes 276
Getting started: data extraction for naive Bayes 276 ■ Training
the naive Bayes classifier 278 ■ Testing a naive Bayes
model 278
14.7 Summary 280
15 Evaluating and tuning a classifier 281
15.1 Classifier evaluation in Mahout 282
Getting rapid feedback 282 ■ Deciding what “good” means 282
Recognizing the difference in cost of errors 284
15.2 The classifier evaluation API 284
Computation of AUC 285 ■ Confusion matrices and entropy
matrices 287 ■ Computing average log likelihood 289
Dissecting a model 290 ■ Performance of the SGD classifier with
20 newsgroups 291
15.3 When classifiers go bad 295
Target leaks 295 ■ Broken feature extraction 298
15.4 Tuning for better performance 300
Tuning the problem 300 ■ Tuning the classifier 304
15.5 Summary 306
16 Deploying a classifier 307
16.1 Process for deployment in huge systems 308
Scope out the problem 308 ■ Optimize feature extraction as
needed 309 ■ Optimize vector encoding as needed 309
Deploy a scalable classifier service 310
16.2 Determining scale and speed requirements 310
How big is big? 310 ■ Balancing big versus fast 312
Licensed to Jianbin Dai
CONTENTS xv
16.3 Building a training pipeline for large systems 313
Acquiring and retaining large-scale data 314 ■ Denormalizing
and downsampling 316 ■ Training pitfalls 318 ■ Reading
and encoding data at speed 320
16.4 Integrating a Mahout classifier 324
Plan ahead: key issues for integration 325 ■ Model
serialization 330
16.5 Example: a Thrift-based classification server 332
Running the classification server 336 ■ Accessing the
classifier service 338
16.6 Summary 340
17 Case study: Shop It To Me 341
17.1 Why Shop It To Me chose Mahout 342
What Shop It To Me does 342 ■ Why Shop It To Me needed a
classification system 342 ■ Mahout outscales the rest 343
17.2 General structure of the email marketing system 344
17.3 Training the model 346
Defining the goal of the classification project 346 ■ Partitioning by
time 348 ■ Avoiding target leaks 348 ■ Learning algorithm
tweaks 348 ■ Feature vector encoding 349
17.4 Speeding up classification 352
Linear combination of feature vectors 353 ■ Linear expansion
of model score 354
17.5 Summary 356
appendix A JVM tuning 359
appendix B Mahout math 362
appendix C Resources 367