Four Minute Papers: MapReduce
Four minute papers (inspired by fourminutebooks.com) aims to condense computing white papers down to a four minute summary.
Here goes nothing…four minute paper for MapReduce (original white paper).
High level what is MapReduce: A programming model and an implementation for processing large data sets.
The problem MapReduce solves: Handle the complex problem of processing large datasets that are too big for one machine.
How MapReduce solved that problem: MapReduce is a relatively simple framework for programmers to accomplish the complex task of scaling work in parallel over thousands of machines.
Implementation Details: End users only need to implement two functions for processing the data:
reduce(). The MapReduce framework handles the hard parts like partitioning data, scheduling work across many machines, handling failures, among other complexities in distributed systems.
The data used by MapReduce must be in the form of key value pairs. The entire MapReduce computation takes a set of input key/value pairs and produces a set of output key/value pairs. The
map function, written by the user, takes one input k/v pair and outputs a set of intermediate k/v pairs. MapReduce library groups together all intermediate values that has same key and passes that to Reduce. Reduce function, also written by the user, accepts one key and all the values for that key and merges the values together.
Limitations of MapReduce: You must wait for all Map and Reduce steps to complete. Meaning it is as fast as it’s slowest step.
Criticism of MapReduce: Michael Stonebraker’s had some criticisms about MapReduce being a major step back, essentially saying that MapReduce is not novel and DBMSs are better.
Use of MapReduce today: When MapReduce first came about, it was very popular and many well known products implement it… for example, Hadoop, Couchdb, and Riak.
Google’s MapReduce implementation was used at Google from 2003–2014, but now it is replaced with Cloud Dataflow/Apache Beam.