Apache Cassandra

System Architecture

advanced
50-70 minutes
cassandranosqldistributed-databasewide-columnhigh-availabilityscalability

Distributed NoSQL database designed for high availability, linear scalability, and handling massive amounts of data across multiple data centers

Overview

Apache Cassandra is a distributed NoSQL database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It addresses the critical challenge of data distribution and consistency in distributed systems at massive scale.

Originally developed by Facebook and later open-sourced, Cassandra has become the standard for time-series data, IoT applications, and real-time analytics at companies like Netflix, Instagram, and Spotify. It processes petabytes of data and is designed for linear scalability and eventual consistency.

Key capabilities include:

  • Linear Scalability: Add nodes to increase capacity without downtime
  • High Availability: No single point of failure with multi-datacenter support
  • Tunable Consistency: Choose between strong and eventual consistency per operation
  • Time-Series Optimization: Efficient storage and querying of time-ordered data

Architecture & Core Components

System Architecture

System Architecture Diagram

Core Components

1. Node

  • Individual Cassandra server instance
  • Stores data in SSTables and commit logs
  • Participates in gossip protocol for cluster membership
  • Handles read/write operations for assigned token ranges

2. Keyspace

  • Top-level namespace (similar to database in RDBMS)
  • Defines replication strategy and factor
  • Contains multiple column families (tables)

3. Column Family (Table)

  • Collection of rows with similar structure
  • Defined by primary key and clustering columns
  • Stores data in SSTable format

4. Partitioner

  • Determines which node stores each row
  • Murmur3Partitioner: Default, evenly distributes data
  • RandomPartitioner: Legacy, uses MD5 hash
  • ByteOrderedPartitioner: Preserves sort order

5. Replication Strategy

  • SimpleStrategy: Single datacenter replication
  • NetworkTopologyStrategy: Multi-datacenter aware replication

Data Model

System Architecture Diagram

Write Path

System Architecture Diagram

Read Path

System Architecture Diagram

Configuration & Deployment

Production Setup

1. Cluster Configuration

# cassandra.yaml
cluster_name: 'production-cluster'
num_tokens: 256
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2

# Network Configuration
listen_address: 10.0.1.100
rpc_address: 10.0.1.100
rpc_port: 9160
native_transport_port: 9042

# Data Storage
data_file_directories:
    - /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
saved_caches_directory: /var/lib/cassandra/saved_caches

# Memory Configuration
heap_size_in_mb: 8192
new_size_in_mb: 200
max_heap_size_in_mb: 8192
heap_newsize_in_mb: 200

2. Keyspace Creation

-- Create keyspace with replication strategy
CREATE KEYSPACE user_analytics 
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3,
    'datacenter2': 2
};

-- Create table with appropriate partitioning
CREATE TABLE user_analytics.user_events (
    user_id UUID,
    event_time TIMESTAMP,
    event_type TEXT,
    event_data TEXT,
    PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

3. Multi-Datacenter Setup

# Datacenter 1 - Primary
# cassandra-rackdc.properties
dc=datacenter1
rack=rack1

# Datacenter 2 - Secondary
# cassandra-rackdc.properties
dc=datacenter2
rack=rack1

Resource Requirements

Minimum Requirements

  • CPU: 4 cores (8+ recommended)
  • Memory: 8GB RAM (16GB+ recommended)
  • Storage: SSD with 100GB+ free space
  • Network: 1Gbps (10Gbps for high-throughput)

Production Recommendations

  • CPU: 16+ cores for high-throughput workloads
  • Memory: 32GB+ RAM (50% for heap, 50% for OS cache)
  • Storage: NVMe SSD with 1TB+ capacity
  • Network: 10Gbps+ with low latency

Performance Characteristics

Throughput Metrics

Write Performance

  • Single Node: 10,000-50,000 writes/second
  • Cluster (10 nodes): 100,000-500,000 writes/second
  • Large Cluster (100+ nodes): 1M+ writes/second
  • Bulk Loading: 100MB-1GB/second per node

Read Performance

  • Single Node: 5,000-25,000 reads/second
  • Cluster (10 nodes): 50,000-250,000 reads/second
  • Point Queries: 1-5ms latency
  • Range Queries: 10-100ms latency

Latency Characteristics

Write Latency

  • P50: 1-3ms
  • P95: 5-15ms
  • P99: 20-100ms
  • P99.9: 100-500ms

Read Latency

  • P50: 1-5ms
  • P95: 10-50ms
  • P99: 50-200ms
  • P99.9: 200ms-1s

Scalability Patterns

Horizontal Scaling

  • Linear Scaling: Add nodes to increase capacity
  • No Downtime: Add/remove nodes without service interruption
  • Auto-Rebalancing: Automatic data redistribution
  • Multi-Datacenter: Global distribution support

Vertical Scaling

  • Memory: Increase heap size for larger caches
  • CPU: More cores for higher concurrency
  • Storage: Larger disks for more data retention
  • Network: Higher bandwidth for better throughput

Operational Considerations

Failure Modes

Node Failures

  • Single Node: Automatic failover, no data loss
  • Multiple Nodes: Depends on replication factor
  • Datacenter Failure: Cross-datacenter replication required
  • Network Partition: Split-brain scenarios possible

Data Corruption

  • SSTable Corruption: Automatic repair mechanisms
  • Commit Log Corruption: Data loss possible
  • Replication Inconsistency: Anti-entropy repair needed

Disaster Recovery

Backup Strategies

# Full backup
nodetool snapshot -t backup_$(date +%Y%m%d) keyspace_name

# Incremental backup
# Enable in cassandra.yaml
incremental_backups: true

# Restore from backup
# Stop node, restore SSTables, restart

Multi-Datacenter Recovery

  • Cross-Datacenter Replication: Automatic failover
  • Read Repair: Consistency maintenance
  • Hinted Handoff: Temporary failure handling
  • Anti-Entropy Repair: Long-term consistency

Maintenance Procedures

Node Addition

# 1. Start new node with auto_bootstrap: true
# 2. Monitor streaming progress
nodetool netstats

# 3. Verify cluster health
nodetool status
nodetool ring

Node Removal

# 1. Decommission node
nodetool decommission

# 2. Wait for streaming to complete
nodetool netstats

# 3. Stop and remove node

Schema Changes

-- Add column (non-breaking)
ALTER TABLE user_analytics.user_events 
ADD new_column TEXT;

-- Modify column (requires careful planning)
-- Drop and recreate with new type

Production Best Practices

Configuration Tuning

Memory Optimization

# Heap size: 50% of total RAM
heap_size_in_mb: 16384

# New generation: 25% of heap
new_size_in_mb: 4096

# Enable compressed OOPs
-XX:+UseCompressedOops

Write Optimization

# Batch size optimization
batch_size_warn_threshold_in_kb: 5
batch_size_fail_threshold_in_kb: 50

# Commit log optimization
commitlog_segment_size_in_mb: 32
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

Read Optimization

# Row cache for hot data
row_cache_size_in_mb: 0
row_cache_save_period: 0

# Key cache for partition keys
key_cache_size_in_mb: 100
key_cache_save_period: 14400

Security Configuration

Authentication & Authorization

# Enable authentication
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

# Create users
CREATE USER admin WITH PASSWORD 'secure_password' SUPERUSER;
CREATE USER app_user WITH PASSWORD 'app_password' NOSUPERUSER;

Encryption

# Client-to-node encryption
client_encryption_options:
    enabled: true
    keystore: /etc/cassandra/keystore.jks
    keystore_password: keystore_password

# Node-to-node encryption
server_encryption_options:
    internode_encryption: all
    keystore: /etc/cassandra/keystore.jks
    keystore_password: keystore_password

Monitoring & Alerting

Essential Metrics

# Cluster health
nodetool status
nodetool ring
nodetool info

# Performance metrics
nodetool tpstats
nodetool cfstats
nodetool proxyhistograms

Key Performance Indicators

  • Read/Write Latency: P95, P99 percentiles
  • Throughput: Operations per second
  • Disk Usage: Space utilization per node
  • Memory Usage: Heap and off-heap utilization
  • Network: Bandwidth and connection counts

Alerting Thresholds

  • High Latency: P95 > 100ms
  • Low Throughput: < 50% of baseline
  • Disk Space: > 80% utilization
  • Memory Pressure: > 85% heap usage
  • Node Down: Any node unavailable

Interview-Focused Content

Technology-Specific Questions

Architecture Deep-Dive

Q: Explain Cassandra's write path and how it ensures durability.

A: Cassandra's write path involves multiple steps:

  1. Client Request: Application sends write to coordinator node
  2. Replica Selection: Coordinator determines replicas based on partitioner and replication strategy
  3. Commit Log Write: All writes go to commit log first for durability
  4. Memtable Update: Data written to in-memory structure (memtable)
  5. Replication: Write sent to all replica nodes
  6. Acknowledgment: Coordinator waits for required number of acknowledgments based on consistency level

The commit log ensures durability by persisting writes before acknowledgment, allowing recovery from crashes.

Consistency Models

Q: How does Cassandra handle consistency and what are the trade-offs?

A: Cassandra offers tunable consistency with these levels:

  • ONE: Single replica acknowledgment (fastest, least consistent)
  • QUORUM: Majority of replicas (balanced)
  • ALL: All replicas (slowest, most consistent)
  • LOCAL_QUORUM: Majority in local datacenter
  • EACH_QUORUM: Quorum in each datacenter

Trade-offs:

  • Speed vs Consistency: Lower consistency = faster operations
  • Availability vs Consistency: CAP theorem implications
  • Network Partitions: Higher consistency levels more susceptible to failures

Operational Questions

Scaling Scenarios

Q: How would you scale a Cassandra cluster from 10 to 100 nodes?

A: Scaling approach:

  1. Capacity Planning: Calculate data distribution and token ranges
  2. Gradual Addition: Add nodes in batches (5-10 at a time)
  3. Monitor Streaming: Watch data movement and cluster health
  4. Load Testing: Verify performance after each batch
  5. Schema Optimization: Adjust replication factors if needed
  6. Network Planning: Ensure sufficient bandwidth for streaming

Key considerations:

  • Token Assignment: Use vnodes for even distribution
  • Streaming Throttle: Control network usage during rebalancing
  • Read Repair: Monitor consistency during scaling

Troubleshooting Scenarios

Q: Your Cassandra cluster is experiencing high read latency. How do you investigate?

A: Investigation approach:

  1. Identify Affected Nodes: Check nodetool proxyhistograms
  2. Analyze Query Patterns: Look for expensive queries or hotspots
  3. Check Resource Usage: CPU, memory, disk I/O
  4. Review Configuration: Cache settings, compaction strategy
  5. Monitor Compaction: Check for compaction backlog
  6. Network Analysis: Look for network bottlenecks

Common causes:

  • Hot Partitions: Uneven data distribution
  • Large Partitions: Too much data per partition
  • Compaction Pressure: SSTable count too high
  • Cache Misses: Insufficient row/key cache
  • Network Issues: High latency between nodes

Design Integration

System Architecture

Q: How would you design a real-time analytics system using Cassandra?

A: Architecture components:

  1. Data Ingestion: Kafka for high-throughput event streaming
  2. Storage Layer: Cassandra for time-series data storage
  3. Query Layer: Application servers with connection pooling
  4. Caching Layer: Redis for hot data caching
  5. Analytics Layer: Spark for batch processing

Data model:

-- Time-series events table
CREATE TABLE analytics.events (
    event_type TEXT,
    timestamp TIMESTAMP,
    user_id UUID,
    event_data TEXT,
    PRIMARY KEY (event_type, timestamp, user_id)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Considerations:

  • Partition Design: Avoid hot partitions
  • TTL Strategy: Automatic data expiration
  • Compaction: TimeWindowCompactionStrategy for time-series
  • Read Patterns: Optimize for time-range queries

Multi-Datacenter Design

Q: How would you design Cassandra for global deployment?

A: Multi-datacenter strategy:

  1. Geographic Distribution: Deploy in multiple regions
  2. Replication Strategy: NetworkTopologyStrategy with regional replication
  3. Consistency Levels: Use LOCAL_QUORUM for low latency
  4. Data Locality: Route reads to local datacenter
  5. Disaster Recovery: Cross-datacenter replication for RTO/RPO

Configuration:

CREATE KEYSPACE global_app 
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'us-west': 3,
    'eu-west': 3,
    'asia-pacific': 3
};

Trade-off Analysis

Cassandra vs Alternatives

Q: When would you choose Cassandra over DynamoDB or MongoDB?

A: Choose Cassandra when:

  • Scale Requirements: Need to handle petabytes of data
  • Multi-Datacenter: Global deployment with local reads
  • Time-Series Data: Optimized for time-ordered data
  • Cost Control: Want to avoid vendor lock-in
  • Custom Tuning: Need fine-grained control over performance

Choose DynamoDB when:

  • Managed Service: Want fully managed solution
  • Predictable Workloads: Consistent access patterns
  • AWS Ecosystem: Already using AWS services
  • Auto-Scaling: Need automatic scaling without planning

Choose MongoDB when:

  • Complex Queries: Need rich query capabilities
  • Document Model: Natural fit for document data
  • ACID Transactions: Need strong consistency guarantees
  • Developer Experience: Prefer JSON-like data model

Real-World Scenarios

Production Case Studies

Netflix: Global Content Delivery

  • Scale: 100+ billion events per day
  • Architecture: Multi-datacenter Cassandra deployment
  • Use Case: User viewing history, recommendations
  • Challenges: Global consistency, low latency requirements
  • Solutions: Eventual consistency, local reads, anti-entropy repair

Instagram: User Activity Tracking

  • Scale: 500+ million users, billions of interactions
  • Architecture: Time-series optimized Cassandra
  • Use Case: User feeds, activity streams
  • Challenges: Hot partitions, write amplification
  • Solutions: Partition design, compaction tuning, caching

Failure Stories & Lessons

Configuration Mistakes

Scenario: Production cluster experiencing data loss Root Cause: Incorrect replication factor configuration Impact: 2 hours of data loss, service degradation Lesson: Always validate replication strategy before production deployment

Scaling Issues

Scenario: Cluster performance degradation after adding nodes Root Cause: Insufficient network bandwidth for streaming Impact: 50% performance drop during rebalancing Lesson: Plan network capacity for scaling operations

Memory Pressure

Scenario: Cluster nodes crashing due to out-of-memory errors Root Cause: Heap size too large, insufficient OS cache Impact: Service unavailability, data inconsistency Lesson: Balance heap size with OS cache requirements

This comprehensive guide covers Cassandra from basic concepts to advanced production deployment, providing the depth needed for both technical interviews and real-world system design scenarios.

Related Systems

dynamodb
hbase
mongodb
scylladb

Used By

netflixinstagramspotifyuberappleebaygithubreddit