B-Trees vs LSM-Trees
Core Concept
Comparing write-optimized LSM-trees with read-optimized B-trees for database storage
B-Trees vs LSM-Trees
Overview
B-Trees and LSM-Trees (Log-Structured Merge Trees) are two fundamental data structures used in database storage engines, each optimized for different workload patterns. B-Trees excel at read-heavy workloads with their balanced structure and in-place updates, while LSM-Trees optimize for write-heavy scenarios through append-only operations and delayed merging.
This architectural choice affects everything from query performance to storage requirements, making it one of the most important decisions in database design. Systems like PostgreSQL and MySQL use B-Trees, while Cassandra, RocksDB, and LevelDB use LSM-Trees.
System Architecture Diagram
B-Trees: Read-Optimized Structure
Architecture
B-Trees maintain a balanced tree structure where all leaf nodes are at the same depth, similar to a well-organized filing system where every drawer is at the same height. This balanced structure ensures that searching for any piece of data takes the same amount of time, regardless of where it's located.
Key Characteristics:
B-Trees are balanced, meaning all paths from the root to any leaf node have the same length - like having a perfectly organized library where every book is exactly the same distance from the entrance. Keys are maintained in sorted order within nodes, making it easy to find what you're looking for. In-place updates mean modifications happen directly in existing pages, like editing a document in place rather than creating a new copy. Data is organized in fixed-size pages (typically 4KB-16KB), similar to how books are organized in chapters of consistent length.
B-Tree Implementation
B-Tree implementation involves managing nodes that can contain multiple keys and pointers to child nodes. Think of it like a sophisticated filing system where each folder can contain multiple documents and references to other folders. The tree maintains its balance through careful insertion and deletion operations that may require splitting or merging nodes when they become too full or too empty.
Search operations navigate down the tree by comparing the target key with keys in each node, similar to following directions in a multi-level building. Insert operations find the appropriate leaf node and add the new key, potentially causing the node to split if it becomes too full. Deletion operations remove keys and may require merging nodes or redistributing keys to maintain balance.
B-Tree Performance Characteristics
Read Performance:
B-Trees excel at read operations due to their balanced structure. Single key lookups take logarithmic time, meaning even with millions of records, you only need to check a small number of levels. Range scans are highly efficient because the sorted order allows sequential access to adjacent records. Sequential access is particularly fast because data is stored in sorted order, making it ideal for operations that need to process data in order.
Write Performance:
Write operations in B-Trees are more complex due to the need to maintain balance. Insert and update operations require logarithmic time to find the correct location and potentially split nodes. Delete operations also take logarithmic time and may require merging nodes. Write amplification is moderate because updates may require rewriting entire pages, but this is generally manageable compared to LSM-Trees.
LSM-Trees: Write-Optimized Structure
Architecture
LSM-Trees use a multi-level structure with periodic merging, similar to how a busy office manages incoming mail. New items are first placed in an in-memory buffer (like an inbox), then periodically sorted and filed into permanent storage levels. This approach optimizes for write performance by batching operations and deferring the expensive work of organizing data until later.
Key Characteristics:
LSM-Trees are write-optimized, meaning they prioritize fast writes over fast reads. They use append-only operations, like writing in a journal where you never erase anything, just add new entries. Periodic merging consolidates data from multiple levels, similar to how you might periodically organize scattered papers into proper files. This design trades read performance for write performance, making them ideal for write-heavy workloads.
LSM-Tree Implementation
LSM-Tree implementation involves managing multiple levels of sorted data structures. Think of it like a multi-tier filing system where new documents go into a temporary inbox, then get periodically organized into permanent filing cabinets. The system maintains a memory table for fast writes, then flushes data to disk in sorted files called SSTables.
Compaction is the process of merging multiple SSTables into fewer, larger files, similar to periodically consolidating scattered documents into organized folders. This process removes duplicate entries and organizes data more efficiently, but it requires significant I/O operations. The system uses bloom filters to quickly determine if a key might exist in a particular SSTable, reducing unnecessary disk reads.
LSM-Tree Performance Characteristics
Write Performance:
- Insert: O(1) amortized (appends to memtable)
- Very high write throughput
- Lower write amplification than B-Trees initially
Read Performance:
- Single key lookup: O(log n) worst case (check all levels)
- May require multiple disk seeks
- Bloom filters help reduce false lookups
Detailed Comparison
Write Amplification
B-Trees:
Original write: 1 page
Page split cascading up tree: 2-3 additional page writes
Total write amplification: 3-4x
LSM-Trees:
Initial write: 1x (memtable)
Level 0 flush: 1x
Compaction L0→L1: 2x
Compaction L1→L2: 4x
...
Total write amplification: 10-50x (but spread over time)
Space Amplification
B-Trees:
- Typically 1-2x due to page fragmentation
- Internal nodes reduce utilization
LSM-Trees:
- Can be 2-10x during compaction
- Multiple copies of data exist temporarily
Read Amplification
B-Trees:
- Point reads: 2-4 page reads (height of tree)
- Range scans: Minimal additional reads
LSM-Trees:
- Point reads: May check multiple levels (10+ files)
- Bloom filters reduce false positives
- Range scans can be expensive across levels
Performance Benchmarks
Write-Heavy Workload
Workload: 100M sequential inserts
B-Tree (InnoDB): 45,000 writes/sec
LSM-Tree (RocksDB): 180,000 writes/sec
LSM advantage: 4x higher throughput
Read-Heavy Workload
Workload: 1M random point reads
B-Tree (InnoDB): 85,000 reads/sec
LSM-Tree (RocksDB): 35,000 reads/sec
B-Tree advantage: 2.4x higher throughput
Mixed Workload
Workload: 70% reads, 30% writes
B-Tree: 65,000 ops/sec
LSM-Tree: 90,000 ops/sec
Results vary by read/write ratio
Real-World Usage
B-Tree Systems
MySQL/InnoDB:
- OLTP workloads
- Strong consistency requirements
- Point lookups and small range scans
PostgreSQL:
- Complex queries with joins
- Mixed workloads with tuning
- ACID compliance critical
LSM-Tree Systems
Cassandra:
- High write throughput
- Time-series data
- Distributed architecture
RocksDB/LevelDB:
- Embedded database engine
- SSD-optimized
- Used by Facebook, LinkedIn
Apache HBase:
- Big data analytics
- Batch processing workloads
- Column-family data model
Optimization Strategies
B-Tree Optimizations
Buffer Pool Management:
-- MySQL InnoDB buffer pool tuning
SET innodb_buffer_pool_size = '70%_of_RAM';
SET innodb_buffer_pool_instances = 8;
SET innodb_old_blocks_time = 1000;
Index Design:
- Clustered index on primary key
- Covering indexes for frequent queries
- Partial indexes for filtered queries
LSM-Tree Optimizations
Compaction Strategies:
- Size-tiered: Compact when level reaches size threshold
- Leveled: Maintain non-overlapping levels
- Hybrid: Combine both strategies
Bloom Filter Tuning:
# Configure Bloom filters to reduce read amplification
bloom_filter_bits_per_key = 10 # 1% false positive rate
block_based_table_options = {
'filter_policy': rocksdb.BloomFilterPolicy(bloom_filter_bits_per_key),
'cache_index_and_filter_blocks': True,
}
Write Buffer Management:
# Tune memtable size and count
options = rocksdb.Options()
options.write_buffer_size = 64 * 1024 * 1024 # 64MB
options.max_write_buffer_number = 3
options.min_write_buffer_number_to_merge = 2
Decision Framework
Choose B-Trees When:
- Read-heavy workloads (>80% reads)
- Point lookups are primary access pattern
- Storage efficiency is critical
- Consistent latency required
- OLTP systems with complex queries
Choose LSM-Trees When:
- Write-heavy workloads (>50% writes)
- Sequential/batch writes are common
- High write throughput required
- Time-series data or append-only patterns
- Big data/analytics workloads
Hybrid Approaches
Some modern systems use hybrid approaches:
MySQL X-Engine:
- LSM for writes, B-Tree indexes for reads
- Automatic data migration between structures
TiDB:
- LSM for transaction log (TiKV)
- B-Tree for SQL layer optimization
Understanding the trade-offs between B-Trees and LSM-Trees is crucial for choosing the right storage engine for your workload patterns and performance requirements.