Design Top K Heavy Hitters

System Design Challenge

medium

Design Top K Heavy Hitters

What is Top K Heavy Hitters?

Top K Heavy Hitters is a distributed system that tracks and queries the most frequent items (Top K) across various time windows. It's similar to systems used by analytics platforms, recommendation engines, or real-time dashboards. The service provides real-time tracking, efficient querying, and scalable processing.

Streaming data processing with probabilistic data structures for real-time Top K queries is what makes systems like Top K Heavy Hitters unique. By understanding this system, you can tackle interview questions for similar analytics platforms, since the core design challenges—streaming processing, probabilistic algorithms, real-time queries, and scalability—remain the same.


Functional Requirements

Core (Interview Focussed)

  • Real-time Tracking: Track item frequencies in real-time streams.
  • Top K Queries: Query the most frequent K items for any time window.
  • Multiple Time Windows: Support different time windows (1min, 1hour, 1day).
  • Item Counting: Maintain accurate counts for tracked items.

Out of Scope

  • User authentication and authorization
  • Historical data archival
  • Item metadata and descriptions
  • Real-time notifications
  • Mobile app specific features

Non-Functional Requirements

Core (Interview Focussed)

  • Low latency: Sub-second response time for Top K queries.
  • High throughput: Process millions of events per second.
  • Scalability: Handle billions of unique items.
  • Memory efficiency: Use minimal memory for large-scale tracking.

Out of Scope

  • Data retention policies
  • Compliance and privacy regulations

💡 Interview Tip: Focus on low latency, high throughput, and memory efficiency. Interviewers care most about streaming processing, probabilistic algorithms, and query optimization.


Core Entities

EntityKey AttributesNotes
Eventevent_id, item_id, timestamp, metadataIndexed by timestamp for time-based queries
ItemCountitem_id, count, last_updated, time_windowTrack item frequencies
TopKResulttime_window, k_value, items, counts, timestampCached Top K results
TimeWindowwindow_id, start_time, end_time, window_typeDefine time windows
Streamstream_id, name, partition_count, statusData stream information

💡 Interview Tip: Focus on Event, ItemCount, and TopKResult as they drive streaming processing, counting, and query results.


Core APIs

Event Processing

  • POST /events { item_id, timestamp, metadata } – Process a new event
  • POST /events/batch { events[] } – Process multiple events
  • GET /events/{event_id} – Get event details
  • GET /events?item_id=&time_range=&limit= – Query events

Top K Queries

  • GET /topk?k=&time_window=&limit= – Get Top K items for time window
  • GET /topk/{item_id}?time_window= – Get item rank and count
  • GET /topk/history?item_id=&time_range= – Get item count history
  • POST /topk/precompute { time_windows[] } – Precompute Top K results

Stream Management

  • POST /streams { name, partition_count } – Create a new stream
  • GET /streams/{stream_id} – Get stream details
  • PUT /streams/{stream_id}/status { status } – Update stream status
  • GET /streams?status=&limit= – List streams

High-Level Design

System Architecture Diagram

Key Components

  • Stream Processor: Process incoming events in real-time
  • Count Tracker: Track item frequencies using probabilistic data structures
  • Top K Engine: Generate Top K results efficiently
  • Query Service: Handle Top K queries with low latency
  • Cache Layer: Cache Top K results for fast access
  • Database: Persistent storage for events and results

Mapping Core Functional Requirements to Components

Functional RequirementResponsible ComponentsKey Considerations
Real-time TrackingStream Processor, Count TrackerHigh throughput, memory efficiency
Top K QueriesQuery Service, Top K EngineLow latency, accurate results
Multiple Time WindowsCount Tracker, Top K EngineTime window management, data aggregation
Item CountingCount Tracker, DatabaseAccurate counting, storage efficiency

Detailed Design

Stream Processor

Purpose: Process incoming events in real-time streams.

Key Design Decisions:

  • Stream Partitioning: Partition streams for parallel processing
  • Event Validation: Validate events before processing
  • Batch Processing: Process events in batches for efficiency
  • Error Handling: Handle processing errors gracefully

Algorithm: Stream processing

1. Receive event stream
2. Partition events by item_id
3. For each partition:
   - Validate event data
   - Extract item_id and timestamp
   - Update count tracker
   - Trigger Top K recalculation
4. Handle processing errors:
   - Retry failed events
   - Log error details
   - Continue processing
5. Update stream statistics

Count Tracker

Purpose: Track item frequencies using probabilistic data structures.

Key Design Decisions:

  • Probabilistic Structures: Use Count-Min Sketch or HyperLogLog
  • Memory Efficiency: Minimize memory usage for large-scale tracking
  • Accuracy Trade-offs: Balance accuracy with memory usage
  • Time Windows: Support multiple time windows efficiently

Algorithm: Count-Min Sketch implementation

1. Initialize Count-Min Sketch with hash functions
2. For each event:
   - Hash item_id with multiple hash functions
   - Increment counters in sketch
   - Update time window counters
3. For count queries:
   - Hash item_id with same hash functions
   - Return minimum counter value
4. Handle time window expiration:
   - Reset expired counters
   - Maintain active time windows

Top K Engine

Purpose: Generate Top K results efficiently.

Key Design Decisions:

  • Heap-based Algorithm: Use min-heap for Top K maintenance
  • Incremental Updates: Update Top K incrementally
  • Time Window Management: Handle multiple time windows
  • Result Caching: Cache Top K results for performance

Algorithm: Top K generation

1. Maintain min-heap of size K
2. For each count update:
   - Check if item is in heap
   - If in heap:
     - Update item count
     - Reheapify if needed
   - If not in heap:
     - If heap not full:
       - Add item to heap
     - Else if count > min heap count:
       - Replace min item
       - Reheapify
3. Return heap contents as Top K
4. Cache result for time window

Query Service

Purpose: Handle Top K queries with low latency.

Key Design Decisions:

  • Query Optimization: Optimize queries for different time windows
  • Result Caching: Cache frequent query results
  • Query Routing: Route queries to appropriate servers
  • Response Formatting: Format results efficiently

Algorithm: Top K query processing

1. Receive Top K query request
2. Parse query parameters:
   - K value
   - Time window
   - Additional filters
3. Check cache for result
4. If not cached:
   - Query count tracker
   - Generate Top K result
   - Cache result
5. Format response:
   - Include item counts
   - Include metadata
   - Include query metadata
6. Return result to client

Database Design

Events Table

FieldTypeDescription
event_idVARCHAR(36)Primary key
item_idVARCHAR(255)Event item
timestampTIMESTAMPEvent timestamp
metadataJSONAdditional event data
created_atTIMESTAMPCreation timestamp

Indexes:

  • idx_item_timestamp on (item_id, timestamp) - Item event history
  • idx_timestamp on (timestamp) - Time-based queries

Item Counts Table

FieldTypeDescription
item_idVARCHAR(255)Item identifier
time_windowVARCHAR(50)Time window
countBIGINTItem count
last_updatedTIMESTAMPLast update

Indexes:

  • idx_count on (count) - Count-based queries
  • idx_last_updated on (last_updated) - Recent updates

Top K Results Table

FieldTypeDescription
result_idVARCHAR(36)Primary key
time_windowVARCHAR(50)Time window
k_valueINTK value
itemsJSONTop K items
countsJSONItem counts
created_atTIMESTAMPResult timestamp

Indexes:

  • idx_time_window_k on (time_window, k_value) - Result queries
  • idx_created_at on (created_at) - Recent results

Scalability Considerations

Horizontal Scaling

  • Stream Processor: Scale horizontally with stream partitioning
  • Count Tracker: Use consistent hashing for data distribution
  • Top K Engine: Scale Top K generation with distributed computing
  • Database: Shard events by item_id

Caching Strategy

  • Redis: Cache Top K results and count data
  • Application Cache: Cache frequently accessed data
  • Database Cache: Cache count and result data

Performance Optimization

  • Connection Pooling: Efficient database connections
  • Batch Processing: Batch event processing for efficiency
  • Async Processing: Non-blocking event processing
  • Resource Monitoring: Monitor CPU, memory, and network usage

Monitoring and Observability

Key Metrics

  • Event Processing Rate: Events processed per second
  • Query Latency: Average Top K query response time
  • Memory Usage: Memory usage for count tracking
  • System Health: CPU, memory, and disk usage

Alerting

  • High Latency: Alert when query time exceeds threshold
  • Memory Usage: Alert when memory usage exceeds limits
  • Processing Errors: Alert when event processing fails
  • System Errors: Alert on Top K generation failures

Trade-offs and Considerations

Accuracy vs. Memory

  • Choice: Use probabilistic data structures for memory efficiency
  • Reasoning: Balance between count accuracy and memory usage

Latency vs. Throughput

  • Choice: Optimize for throughput with batch processing
  • Reasoning: High event volume requires efficient processing

Consistency vs. Availability

  • Choice: Eventual consistency for count updates
  • Reasoning: Count accuracy can tolerate slight delays

Common Interview Questions

Q: How would you handle memory constraints?

A: Use probabilistic data structures like Count-Min Sketch and HyperLogLog to minimize memory usage while maintaining accuracy.

Q: How do you ensure Top K accuracy?

A: Use multiple hash functions, error bounds, and validation mechanisms to ensure Top K accuracy.

Q: How would you scale this system globally?

A: Deploy regional processors, use geo-distributed databases, and implement data replication strategies.

Q: How do you handle time window management?

A: Use sliding windows, efficient data structures, and automated cleanup to handle time window management.


Key Takeaways

  1. Probabilistic Data Structures: Essential for memory-efficient large-scale counting
  2. Stream Processing: Real-time event processing enables immediate Top K updates
  3. Query Optimization: Caching and efficient algorithms provide fast Top K queries
  4. Scalability: Horizontal scaling and partitioning are crucial for handling large-scale data
  5. Monitoring: Comprehensive monitoring ensures system reliability and performance