Design Top K Heavy Hitters
System Design Challenge
Design Top K Heavy Hitters
What is Top K Heavy Hitters?
Top K Heavy Hitters is a distributed system that tracks and queries the most frequent items (Top K) across various time windows. It's similar to systems used by analytics platforms, recommendation engines, or real-time dashboards. The service provides real-time tracking, efficient querying, and scalable processing.
Streaming data processing with probabilistic data structures for real-time Top K queries is what makes systems like Top K Heavy Hitters unique. By understanding this system, you can tackle interview questions for similar analytics platforms, since the core design challenges—streaming processing, probabilistic algorithms, real-time queries, and scalability—remain the same.
Functional Requirements
Core (Interview Focussed)
- Real-time Tracking: Track item frequencies in real-time streams.
- Top K Queries: Query the most frequent K items for any time window.
- Multiple Time Windows: Support different time windows (1min, 1hour, 1day).
- Item Counting: Maintain accurate counts for tracked items.
Out of Scope
- User authentication and authorization
- Historical data archival
- Item metadata and descriptions
- Real-time notifications
- Mobile app specific features
Non-Functional Requirements
Core (Interview Focussed)
- Low latency: Sub-second response time for Top K queries.
- High throughput: Process millions of events per second.
- Scalability: Handle billions of unique items.
- Memory efficiency: Use minimal memory for large-scale tracking.
Out of Scope
- Data retention policies
- Compliance and privacy regulations
💡 Interview Tip: Focus on low latency, high throughput, and memory efficiency. Interviewers care most about streaming processing, probabilistic algorithms, and query optimization.
Core Entities
Entity | Key Attributes | Notes |
---|---|---|
Event | event_id, item_id, timestamp, metadata | Indexed by timestamp for time-based queries |
ItemCount | item_id, count, last_updated, time_window | Track item frequencies |
TopKResult | time_window, k_value, items, counts, timestamp | Cached Top K results |
TimeWindow | window_id, start_time, end_time, window_type | Define time windows |
Stream | stream_id, name, partition_count, status | Data stream information |
💡 Interview Tip: Focus on Event, ItemCount, and TopKResult as they drive streaming processing, counting, and query results.
Core APIs
Event Processing
POST /events { item_id, timestamp, metadata }
– Process a new eventPOST /events/batch { events[] }
– Process multiple eventsGET /events/{event_id}
– Get event detailsGET /events?item_id=&time_range=&limit=
– Query events
Top K Queries
GET /topk?k=&time_window=&limit=
– Get Top K items for time windowGET /topk/{item_id}?time_window=
– Get item rank and countGET /topk/history?item_id=&time_range=
– Get item count historyPOST /topk/precompute { time_windows[] }
– Precompute Top K results
Stream Management
POST /streams { name, partition_count }
– Create a new streamGET /streams/{stream_id}
– Get stream detailsPUT /streams/{stream_id}/status { status }
– Update stream statusGET /streams?status=&limit=
– List streams
High-Level Design
System Architecture Diagram
Key Components
- Stream Processor: Process incoming events in real-time
- Count Tracker: Track item frequencies using probabilistic data structures
- Top K Engine: Generate Top K results efficiently
- Query Service: Handle Top K queries with low latency
- Cache Layer: Cache Top K results for fast access
- Database: Persistent storage for events and results
Mapping Core Functional Requirements to Components
Functional Requirement | Responsible Components | Key Considerations |
---|---|---|
Real-time Tracking | Stream Processor, Count Tracker | High throughput, memory efficiency |
Top K Queries | Query Service, Top K Engine | Low latency, accurate results |
Multiple Time Windows | Count Tracker, Top K Engine | Time window management, data aggregation |
Item Counting | Count Tracker, Database | Accurate counting, storage efficiency |
Detailed Design
Stream Processor
Purpose: Process incoming events in real-time streams.
Key Design Decisions:
- Stream Partitioning: Partition streams for parallel processing
- Event Validation: Validate events before processing
- Batch Processing: Process events in batches for efficiency
- Error Handling: Handle processing errors gracefully
Algorithm: Stream processing
1. Receive event stream
2. Partition events by item_id
3. For each partition:
- Validate event data
- Extract item_id and timestamp
- Update count tracker
- Trigger Top K recalculation
4. Handle processing errors:
- Retry failed events
- Log error details
- Continue processing
5. Update stream statistics
Count Tracker
Purpose: Track item frequencies using probabilistic data structures.
Key Design Decisions:
- Probabilistic Structures: Use Count-Min Sketch or HyperLogLog
- Memory Efficiency: Minimize memory usage for large-scale tracking
- Accuracy Trade-offs: Balance accuracy with memory usage
- Time Windows: Support multiple time windows efficiently
Algorithm: Count-Min Sketch implementation
1. Initialize Count-Min Sketch with hash functions
2. For each event:
- Hash item_id with multiple hash functions
- Increment counters in sketch
- Update time window counters
3. For count queries:
- Hash item_id with same hash functions
- Return minimum counter value
4. Handle time window expiration:
- Reset expired counters
- Maintain active time windows
Top K Engine
Purpose: Generate Top K results efficiently.
Key Design Decisions:
- Heap-based Algorithm: Use min-heap for Top K maintenance
- Incremental Updates: Update Top K incrementally
- Time Window Management: Handle multiple time windows
- Result Caching: Cache Top K results for performance
Algorithm: Top K generation
1. Maintain min-heap of size K
2. For each count update:
- Check if item is in heap
- If in heap:
- Update item count
- Reheapify if needed
- If not in heap:
- If heap not full:
- Add item to heap
- Else if count > min heap count:
- Replace min item
- Reheapify
3. Return heap contents as Top K
4. Cache result for time window
Query Service
Purpose: Handle Top K queries with low latency.
Key Design Decisions:
- Query Optimization: Optimize queries for different time windows
- Result Caching: Cache frequent query results
- Query Routing: Route queries to appropriate servers
- Response Formatting: Format results efficiently
Algorithm: Top K query processing
1. Receive Top K query request
2. Parse query parameters:
- K value
- Time window
- Additional filters
3. Check cache for result
4. If not cached:
- Query count tracker
- Generate Top K result
- Cache result
5. Format response:
- Include item counts
- Include metadata
- Include query metadata
6. Return result to client
Database Design
Events Table
Field | Type | Description |
---|---|---|
event_id | VARCHAR(36) | Primary key |
item_id | VARCHAR(255) | Event item |
timestamp | TIMESTAMP | Event timestamp |
metadata | JSON | Additional event data |
created_at | TIMESTAMP | Creation timestamp |
Indexes:
idx_item_timestamp
on (item_id, timestamp) - Item event historyidx_timestamp
on (timestamp) - Time-based queries
Item Counts Table
Field | Type | Description |
---|---|---|
item_id | VARCHAR(255) | Item identifier |
time_window | VARCHAR(50) | Time window |
count | BIGINT | Item count |
last_updated | TIMESTAMP | Last update |
Indexes:
idx_count
on (count) - Count-based queriesidx_last_updated
on (last_updated) - Recent updates
Top K Results Table
Field | Type | Description |
---|---|---|
result_id | VARCHAR(36) | Primary key |
time_window | VARCHAR(50) | Time window |
k_value | INT | K value |
items | JSON | Top K items |
counts | JSON | Item counts |
created_at | TIMESTAMP | Result timestamp |
Indexes:
idx_time_window_k
on (time_window, k_value) - Result queriesidx_created_at
on (created_at) - Recent results
Scalability Considerations
Horizontal Scaling
- Stream Processor: Scale horizontally with stream partitioning
- Count Tracker: Use consistent hashing for data distribution
- Top K Engine: Scale Top K generation with distributed computing
- Database: Shard events by item_id
Caching Strategy
- Redis: Cache Top K results and count data
- Application Cache: Cache frequently accessed data
- Database Cache: Cache count and result data
Performance Optimization
- Connection Pooling: Efficient database connections
- Batch Processing: Batch event processing for efficiency
- Async Processing: Non-blocking event processing
- Resource Monitoring: Monitor CPU, memory, and network usage
Monitoring and Observability
Key Metrics
- Event Processing Rate: Events processed per second
- Query Latency: Average Top K query response time
- Memory Usage: Memory usage for count tracking
- System Health: CPU, memory, and disk usage
Alerting
- High Latency: Alert when query time exceeds threshold
- Memory Usage: Alert when memory usage exceeds limits
- Processing Errors: Alert when event processing fails
- System Errors: Alert on Top K generation failures
Trade-offs and Considerations
Accuracy vs. Memory
- Choice: Use probabilistic data structures for memory efficiency
- Reasoning: Balance between count accuracy and memory usage
Latency vs. Throughput
- Choice: Optimize for throughput with batch processing
- Reasoning: High event volume requires efficient processing
Consistency vs. Availability
- Choice: Eventual consistency for count updates
- Reasoning: Count accuracy can tolerate slight delays
Common Interview Questions
Q: How would you handle memory constraints?
A: Use probabilistic data structures like Count-Min Sketch and HyperLogLog to minimize memory usage while maintaining accuracy.
Q: How do you ensure Top K accuracy?
A: Use multiple hash functions, error bounds, and validation mechanisms to ensure Top K accuracy.
Q: How would you scale this system globally?
A: Deploy regional processors, use geo-distributed databases, and implement data replication strategies.
Q: How do you handle time window management?
A: Use sliding windows, efficient data structures, and automated cleanup to handle time window management.
Key Takeaways
- Probabilistic Data Structures: Essential for memory-efficient large-scale counting
- Stream Processing: Real-time event processing enables immediate Top K updates
- Query Optimization: Caching and efficient algorithms provide fast Top K queries
- Scalability: Horizontal scaling and partitioning are crucial for handling large-scale data
- Monitoring: Comprehensive monitoring ensures system reliability and performance