MapReduce: Simplified Data Processing on Large Clusters
Research Paper
Abstract
Abstract
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as we show in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Key Concepts
Programming Model
MapReduce is inspired by the map and reduce primitives present in Lisp and many other functional languages. Users express their computation as two functions:
- Map: Takes an input pair and produces a set of intermediate key/value pairs
- Reduce: Accepts an intermediate key and a set of values for that key, merges them together
Execution Overview
- The MapReduce library splits input files into M pieces
- The program is copied to worker machines
- One master assigns map and reduce tasks to workers
- Map workers process input splits and write intermediate files
- Reduce workers read intermediate data and write output files
Implementation
Master Data Structures
- Task state: Idle, in-progress, or completed
- Worker state: Identity of non-idle workers
- Intermediate file locations: For reduce tasks
Fault Tolerance
- Worker failures: Master detects failures and reschedules tasks
- Master failures: Checkpoint master state periodically
- Semantics: MapReduce provides exactly-once execution semantics
Locality
- Input data locality: Schedule map tasks near input data
- Network bandwidth: Minimize data transfer over network
- GFS integration: Leverage GFS for data placement
Performance Optimizations
Partitioning Function
- Default: Hash partitioning (hash(key) mod R)
- Custom: User-defined partitioning for better load balancing
- Example: URL partitioning for web crawling
Combiner Function
- Purpose: Reduce network traffic by combining map outputs locally
- Semantics: Must be associative and commutative
- Example: Word count combiner sums local counts
Input Types
- Text input: Line-based input format
- Custom input: User-defined input formats
- Database input: Direct database integration
Use Cases
Word Count
The canonical MapReduce example:
- Map: Emit (word, 1) for each word
- Reduce: Sum counts for each word
Distributed Grep
- Map: Emit matching lines
- Reduce: Identity function (pass-through)
Count of URL Access Frequency
- Map: Emit (URL, 1) for each access
- Reduce: Sum access counts per URL
Reverse Web-Link Graph
- Map: For each link, emit (target, source)
- Reduce: Concatenate all sources for each target
Term-Vector per Host
- Map: Extract term vectors from documents
- Reduce: Combine term vectors per host
Inverted Index
- Map: For each word in document, emit (word, document)
- Reduce: Concatenate documents for each word
Distributed Sort
- Map: Identity function with partitioning
- Reduce: Identity function
Performance Characteristics
Scalability
- Linear scaling: Performance scales with cluster size
- Fault tolerance: Automatic recovery from failures
- Load balancing: Dynamic task assignment
Efficiency
- Network usage: Minimized through locality and combiners
- Storage: Efficient use of GFS
- CPU utilization: High utilization across cluster
Impact on Modern Systems
MapReduce influenced the development of:
- Apache Hadoop: Open-source MapReduce implementation
- Apache Spark: In-memory data processing framework
- Apache Flink: Stream processing framework
- Google Cloud Dataflow: Managed data processing service
Why It Matters for Software Engineering
Understanding MapReduce is crucial for:
- Big Data Processing: Understanding distributed data processing
- System Design: Designing scalable data processing systems
- Cloud Computing: Understanding managed data processing services
- Parallel Programming: Learning functional programming paradigms
The paper demonstrates how to build a simple yet powerful abstraction for distributed data processing that can handle the scale and complexity of modern big data applications.
PDF Document
Loading PDF...
Analysis & Content
Click the button above to view detailed analysis and discussion of this paper