【文件属性】:
文件名称:Data-Intensive Text Processing with MapReduce
文件大小:3.98MB
文件格式:PDF
更新时间:2017-03-02 00:05:12
hadoop mapreduce
Data-Intensive Text Processing with MapReduce
Jimmy Lin and Chris Dyer
Draft of January 27, 2013
Contents
Contents ii
1 Introduction 1
1.1 Computing in the Clouds . . . . . . . . . . . . . . . . . . . . . 6
1.2 Big Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Why Is This Dierent? . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . 17
2 MapReduce Basics 19
2.1 Functional Programming Roots . . . . . . . . . . . . . . . . . . 21
2.2 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . 27
2.4 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . 29
2.5 The Distributed File System . . . . . . . . . . . . . . . . . . . . 32
2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . 36
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Basic MapReduce Algorithm Design 39
3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Pairs and Stripes . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Computing Relative Frequencies . . . . . . . . . . . . . . . . . 56
3.4 Secondary Sorting . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Inverted Indexing for Text Retrieval 65
4.1 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Inverted Indexing: Baseline Implementation . . . . . . . . . . . 70
4.4 Inverted Indexing: Revised Implementation . . . . . . . . . . . 72
4.5 Index Compression . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Summary and Additional Readings . . . . . . . . . . . . . . . . 84
5 Graph Algorithms 87
5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Parallel Breadth-First Search . . . . . . . . . . . . . . . . . . . 90
5.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . 103
5.5 Summary and Additional Readings . . . . . . . . . . . . . . . . 105
6 Processing Relational Data 107
6.1 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7 EM Algorithms for Text Processing 113
7.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . 116
7.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 123
7.3 EM in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Case Study: Word Alignment for Statistical Machine Translation 138
7.5 EM-Like Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 147
7.6 Summary and Additional Readings . . . . . . . . . . . . . . . . 150
8 Closing Remarks 153
8.1 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . 153
8.2 Alternative Computing Paradigms . . . . . . . . . . . . . . . . 155
8.3 MapReduce and Beyond . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography 159
网友评论
- 不错的分享,学习中
- 还不错, 一本关于大数据的好书