MapReduce: Simplified Data Processing on Large Clusters

Research Paper

2004
Jeffrey Dean, Sanjay Ghemawat
distributed-systemsbig-dataparallel-computingGoogledata-processing

Abstract

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as we show in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Key Concepts

Programming Model

MapReduce is inspired by the map and reduce primitives present in Lisp and many other functional languages. Users express their computation as two functions:

  1. Map: Takes an input pair and produces a set of intermediate key/value pairs
  2. Reduce: Accepts an intermediate key and a set of values for that key, merges them together

Execution Overview

  1. The MapReduce library splits input files into M pieces
  2. The program is copied to worker machines
  3. One master assigns map and reduce tasks to workers
  4. Map workers process input splits and write intermediate files
  5. Reduce workers read intermediate data and write output files

Implementation

Master Data Structures

  • Task state: Idle, in-progress, or completed
  • Worker state: Identity of non-idle workers
  • Intermediate file locations: For reduce tasks

Fault Tolerance

  • Worker failures: Master detects failures and reschedules tasks
  • Master failures: Checkpoint master state periodically
  • Semantics: MapReduce provides exactly-once execution semantics

Locality

  • Input data locality: Schedule map tasks near input data
  • Network bandwidth: Minimize data transfer over network
  • GFS integration: Leverage GFS for data placement

Performance Optimizations

Partitioning Function

  • Default: Hash partitioning (hash(key) mod R)
  • Custom: User-defined partitioning for better load balancing
  • Example: URL partitioning for web crawling

Combiner Function

  • Purpose: Reduce network traffic by combining map outputs locally
  • Semantics: Must be associative and commutative
  • Example: Word count combiner sums local counts

Input Types

  • Text input: Line-based input format
  • Custom input: User-defined input formats
  • Database input: Direct database integration

Use Cases

Word Count

The canonical MapReduce example:

  • Map: Emit (word, 1) for each word
  • Reduce: Sum counts for each word

Distributed Grep

  • Map: Emit matching lines
  • Reduce: Identity function (pass-through)

Count of URL Access Frequency

  • Map: Emit (URL, 1) for each access
  • Reduce: Sum access counts per URL
  • Map: For each link, emit (target, source)
  • Reduce: Concatenate all sources for each target

Term-Vector per Host

  • Map: Extract term vectors from documents
  • Reduce: Combine term vectors per host

Inverted Index

  • Map: For each word in document, emit (word, document)
  • Reduce: Concatenate documents for each word

Distributed Sort

  • Map: Identity function with partitioning
  • Reduce: Identity function

Performance Characteristics

Scalability

  • Linear scaling: Performance scales with cluster size
  • Fault tolerance: Automatic recovery from failures
  • Load balancing: Dynamic task assignment

Efficiency

  • Network usage: Minimized through locality and combiners
  • Storage: Efficient use of GFS
  • CPU utilization: High utilization across cluster

Impact on Modern Systems

MapReduce influenced the development of:

  • Apache Hadoop: Open-source MapReduce implementation
  • Apache Spark: In-memory data processing framework
  • Apache Flink: Stream processing framework
  • Google Cloud Dataflow: Managed data processing service

Why It Matters for Software Engineering

Understanding MapReduce is crucial for:

  • Big Data Processing: Understanding distributed data processing
  • System Design: Designing scalable data processing systems
  • Cloud Computing: Understanding managed data processing services
  • Parallel Programming: Learning functional programming paradigms

The paper demonstrates how to build a simple yet powerful abstraction for distributed data processing that can handle the scale and complexity of modern big data applications.

Loading PDF...

Analysis & Content

Click the button above to view detailed analysis and discussion of this paper

Key insights
Detailed breakdown