B-Trees vs LSM-Trees

Core Concept

intermediate
30-40 minutes
database-internalsstorage-enginesindexingdata-structuresperformancetrade-offs

Comparing write-optimized LSM-trees with read-optimized B-trees for database storage

B-Trees vs LSM-Trees

Overview

B-Trees and LSM-Trees (Log-Structured Merge Trees) are two fundamental data structures used in database storage engines, each optimized for different workload patterns. B-Trees excel at read-heavy workloads with their balanced structure and in-place updates, while LSM-Trees optimize for write-heavy scenarios through append-only operations and delayed merging.

This architectural choice affects everything from query performance to storage requirements, making it one of the most important decisions in database design. Systems like PostgreSQL and MySQL use B-Trees, while Cassandra, RocksDB, and LevelDB use LSM-Trees.

System Architecture Diagram

B-Trees: Read-Optimized Structure

Architecture

B-Trees maintain a balanced tree structure where all leaf nodes are at the same depth, similar to a well-organized filing system where every drawer is at the same height. This balanced structure ensures that searching for any piece of data takes the same amount of time, regardless of where it's located.

Key Characteristics:

B-Trees are balanced, meaning all paths from the root to any leaf node have the same length - like having a perfectly organized library where every book is exactly the same distance from the entrance. Keys are maintained in sorted order within nodes, making it easy to find what you're looking for. In-place updates mean modifications happen directly in existing pages, like editing a document in place rather than creating a new copy. Data is organized in fixed-size pages (typically 4KB-16KB), similar to how books are organized in chapters of consistent length.

B-Tree Implementation

B-Tree implementation involves managing nodes that can contain multiple keys and pointers to child nodes. Think of it like a sophisticated filing system where each folder can contain multiple documents and references to other folders. The tree maintains its balance through careful insertion and deletion operations that may require splitting or merging nodes when they become too full or too empty.

Search operations navigate down the tree by comparing the target key with keys in each node, similar to following directions in a multi-level building. Insert operations find the appropriate leaf node and add the new key, potentially causing the node to split if it becomes too full. Deletion operations remove keys and may require merging nodes or redistributing keys to maintain balance.

B-Tree Performance Characteristics

Read Performance:

B-Trees excel at read operations due to their balanced structure. Single key lookups take logarithmic time, meaning even with millions of records, you only need to check a small number of levels. Range scans are highly efficient because the sorted order allows sequential access to adjacent records. Sequential access is particularly fast because data is stored in sorted order, making it ideal for operations that need to process data in order.

Write Performance:

Write operations in B-Trees are more complex due to the need to maintain balance. Insert and update operations require logarithmic time to find the correct location and potentially split nodes. Delete operations also take logarithmic time and may require merging nodes. Write amplification is moderate because updates may require rewriting entire pages, but this is generally manageable compared to LSM-Trees.

LSM-Trees: Write-Optimized Structure

Architecture

LSM-Trees use a multi-level structure with periodic merging, similar to how a busy office manages incoming mail. New items are first placed in an in-memory buffer (like an inbox), then periodically sorted and filed into permanent storage levels. This approach optimizes for write performance by batching operations and deferring the expensive work of organizing data until later.

Key Characteristics:

LSM-Trees are write-optimized, meaning they prioritize fast writes over fast reads. They use append-only operations, like writing in a journal where you never erase anything, just add new entries. Periodic merging consolidates data from multiple levels, similar to how you might periodically organize scattered papers into proper files. This design trades read performance for write performance, making them ideal for write-heavy workloads.

LSM-Tree Implementation

LSM-Tree implementation involves managing multiple levels of sorted data structures. Think of it like a multi-tier filing system where new documents go into a temporary inbox, then get periodically organized into permanent filing cabinets. The system maintains a memory table for fast writes, then flushes data to disk in sorted files called SSTables.

Compaction is the process of merging multiple SSTables into fewer, larger files, similar to periodically consolidating scattered documents into organized folders. This process removes duplicate entries and organizes data more efficiently, but it requires significant I/O operations. The system uses bloom filters to quickly determine if a key might exist in a particular SSTable, reducing unnecessary disk reads.

LSM-Tree Performance Characteristics

Write Performance:

  • Insert: O(1) amortized (appends to memtable)
  • Very high write throughput
  • Lower write amplification than B-Trees initially

Read Performance:

  • Single key lookup: O(log n) worst case (check all levels)
  • May require multiple disk seeks
  • Bloom filters help reduce false lookups

Detailed Comparison

Write Amplification

B-Trees:

Original write: 1 page
Page split cascading up tree: 2-3 additional page writes
Total write amplification: 3-4x

LSM-Trees:

Initial write: 1x (memtable)
Level 0 flush: 1x  
Compaction L0→L1: 2x
Compaction L1→L2: 4x
...
Total write amplification: 10-50x (but spread over time)

Space Amplification

B-Trees:

  • Typically 1-2x due to page fragmentation
  • Internal nodes reduce utilization

LSM-Trees:

  • Can be 2-10x during compaction
  • Multiple copies of data exist temporarily

Read Amplification

B-Trees:

  • Point reads: 2-4 page reads (height of tree)
  • Range scans: Minimal additional reads

LSM-Trees:

  • Point reads: May check multiple levels (10+ files)
  • Bloom filters reduce false positives
  • Range scans can be expensive across levels

Performance Benchmarks

Write-Heavy Workload

Workload: 100M sequential inserts
B-Tree (InnoDB):     45,000 writes/sec
LSM-Tree (RocksDB):  180,000 writes/sec
LSM advantage:       4x higher throughput

Read-Heavy Workload

Workload: 1M random point reads
B-Tree (InnoDB):     85,000 reads/sec  
LSM-Tree (RocksDB):  35,000 reads/sec
B-Tree advantage:    2.4x higher throughput

Mixed Workload

Workload: 70% reads, 30% writes
B-Tree:              65,000 ops/sec
LSM-Tree:            90,000 ops/sec
Results vary by read/write ratio

Real-World Usage

B-Tree Systems

MySQL/InnoDB:

  • OLTP workloads
  • Strong consistency requirements
  • Point lookups and small range scans

PostgreSQL:

  • Complex queries with joins
  • Mixed workloads with tuning
  • ACID compliance critical

LSM-Tree Systems

Cassandra:

  • High write throughput
  • Time-series data
  • Distributed architecture

RocksDB/LevelDB:

  • Embedded database engine
  • SSD-optimized
  • Used by Facebook, LinkedIn

Apache HBase:

  • Big data analytics
  • Batch processing workloads
  • Column-family data model

Optimization Strategies

B-Tree Optimizations

Buffer Pool Management:

-- MySQL InnoDB buffer pool tuning
SET innodb_buffer_pool_size = '70%_of_RAM';
SET innodb_buffer_pool_instances = 8;
SET innodb_old_blocks_time = 1000;

Index Design:

  • Clustered index on primary key
  • Covering indexes for frequent queries
  • Partial indexes for filtered queries

LSM-Tree Optimizations

Compaction Strategies:

  • Size-tiered: Compact when level reaches size threshold
  • Leveled: Maintain non-overlapping levels
  • Hybrid: Combine both strategies

Bloom Filter Tuning:

# Configure Bloom filters to reduce read amplification
bloom_filter_bits_per_key = 10  # 1% false positive rate
block_based_table_options = {
    'filter_policy': rocksdb.BloomFilterPolicy(bloom_filter_bits_per_key),
    'cache_index_and_filter_blocks': True,
}

Write Buffer Management:

# Tune memtable size and count
options = rocksdb.Options()
options.write_buffer_size = 64 * 1024 * 1024  # 64MB
options.max_write_buffer_number = 3
options.min_write_buffer_number_to_merge = 2

Decision Framework

Choose B-Trees When:

  • Read-heavy workloads (>80% reads)
  • Point lookups are primary access pattern
  • Storage efficiency is critical
  • Consistent latency required
  • OLTP systems with complex queries

Choose LSM-Trees When:

  • Write-heavy workloads (>50% writes)
  • Sequential/batch writes are common
  • High write throughput required
  • Time-series data or append-only patterns
  • Big data/analytics workloads

Hybrid Approaches

Some modern systems use hybrid approaches:

MySQL X-Engine:

  • LSM for writes, B-Tree indexes for reads
  • Automatic data migration between structures

TiDB:

  • LSM for transaction log (TiKV)
  • B-Tree for SQL layer optimization

Understanding the trade-offs between B-Trees and LSM-Trees is crucial for choosing the right storage engine for your workload patterns and performance requirements.

Related Concepts

storage-engines
database-indexing
write-amplification
compaction

Used By

cassandrarocksdbleveldbmongodbmysqlpostgresql