Design Top K Heavy Hitters

What is Top K Heavy Hitters?

Top K Heavy Hitters is a distributed system that tracks and queries the most frequent items (Top K) across various time windows. It's similar to systems used by analytics platforms, recommendation engines, or real-time dashboards. The service provides real-time tracking, efficient querying, and scalable processing.

Streaming data processing with probabilistic data structures for real-time Top K queries is what makes systems like Top K Heavy Hitters unique. By understanding this system, you can tackle interview questions for similar analytics platforms, since the core design challenges—streaming processing, probabilistic algorithms, real-time queries, and scalability—remain the same.

Functional Requirements

Core (Interview Focussed)

Real-time Tracking: Track item frequencies in real-time streams.
Top K Queries: Query the most frequent K items for any time window.
Multiple Time Windows: Support different time windows (1min, 1hour, 1day).
Item Counting: Maintain accurate counts for tracked items.

Out of Scope

User authentication and authorization
Historical data archival
Item metadata and descriptions
Real-time notifications
Mobile app specific features

Non-Functional Requirements

Core (Interview Focussed)

Low latency: Sub-second response time for Top K queries.
High throughput: Process millions of events per second.
Scalability: Handle billions of unique items.
Memory efficiency: Use minimal memory for large-scale tracking.

Out of Scope

Data retention policies
Compliance and privacy regulations

💡 Interview Tip: Focus on low latency, high throughput, and memory efficiency. Interviewers care most about streaming processing, probabilistic algorithms, and query optimization.

Core Entities

Entity	Key Attributes	Notes
Event	event_id, item_id, timestamp, metadata	Indexed by timestamp for time-based queries
ItemCount	item_id, count, last_updated, time_window	Track item frequencies
TopKResult	time_window, k_value, items, counts, timestamp	Cached Top K results
TimeWindow	window_id, start_time, end_time, window_type	Define time windows
Stream	stream_id, name, partition_count, status	Data stream information

💡 Interview Tip: Focus on Event, ItemCount, and TopKResult as they drive streaming processing, counting, and query results.

Core APIs

Event Processing

POST /events { item_id, timestamp, metadata } – Process a new event
POST /events/batch { events[] } – Process multiple events
GET /events/{event_id} – Get event details
GET /events?item_id=&time_range=&limit= – Query events

Top K Queries

GET /topk?k=&time_window=&limit= – Get Top K items for time window
GET /topk/{item_id}?time_window= – Get item rank and count
GET /topk/history?item_id=&time_range= – Get item count history
POST /topk/precompute { time_windows[] } – Precompute Top K results

Stream Management

POST /streams { name, partition_count } – Create a new stream
GET /streams/{stream_id} – Get stream details
PUT /streams/{stream_id}/status { status } – Update stream status
GET /streams?status=&limit= – List streams

High-Level Design

System Architecture Diagram

Key Components

Stream Processor: Process incoming events in real-time
Count Tracker: Track item frequencies using probabilistic data structures
Top K Engine: Generate Top K results efficiently
Query Service: Handle Top K queries with low latency
Cache Layer: Cache Top K results for fast access
Database: Persistent storage for events and results

Mapping Core Functional Requirements to Components

Functional Requirement	Responsible Components	Key Considerations
Real-time Tracking	Stream Processor, Count Tracker	High throughput, memory efficiency
Top K Queries	Query Service, Top K Engine	Low latency, accurate results
Multiple Time Windows	Count Tracker, Top K Engine	Time window management, data aggregation
Item Counting	Count Tracker, Database	Accurate counting, storage efficiency

Detailed Design

Stream Processor

Purpose: Process incoming events in real-time streams.

Key Design Decisions:

Stream Partitioning: Partition streams for parallel processing
Event Validation: Validate events before processing
Batch Processing: Process events in batches for efficiency
Error Handling: Handle processing errors gracefully

Algorithm: Stream processing

1. Receive event stream
2. Partition events by item_id
3. For each partition:
   - Validate event data
   - Extract item_id and timestamp
   - Update count tracker
   - Trigger Top K recalculation
4. Handle processing errors:
   - Retry failed events
   - Log error details
   - Continue processing
5. Update stream statistics

Count Tracker

Purpose: Track item frequencies using probabilistic data structures.

Key Design Decisions:

Probabilistic Structures: Use Count-Min Sketch or HyperLogLog
Memory Efficiency: Minimize memory usage for large-scale tracking
Accuracy Trade-offs: Balance accuracy with memory usage
Time Windows: Support multiple time windows efficiently

Algorithm: Count-Min Sketch implementation

1. Initialize Count-Min Sketch with hash functions
2. For each event:
   - Hash item_id with multiple hash functions
   - Increment counters in sketch
   - Update time window counters
3. For count queries:
   - Hash item_id with same hash functions
   - Return minimum counter value
4. Handle time window expiration:
   - Reset expired counters
   - Maintain active time windows

Top K Engine

Purpose: Generate Top K results efficiently.

Key Design Decisions:

Heap-based Algorithm: Use min-heap for Top K maintenance
Incremental Updates: Update Top K incrementally
Time Window Management: Handle multiple time windows
Result Caching: Cache Top K results for performance

Algorithm: Top K generation

1. Maintain min-heap of size K
2. For each count update:
   - Check if item is in heap
   - If in heap:
     - Update item count
     - Reheapify if needed
   - If not in heap:
     - If heap not full:
       - Add item to heap
     - Else if count > min heap count:
       - Replace min item
       - Reheapify
3. Return heap contents as Top K
4. Cache result for time window

Query Service

Purpose: Handle Top K queries with low latency.

Key Design Decisions:

Query Optimization: Optimize queries for different time windows
Result Caching: Cache frequent query results
Query Routing: Route queries to appropriate servers
Response Formatting: Format results efficiently

Algorithm: Top K query processing

1. Receive Top K query request
2. Parse query parameters:
   - K value
   - Time window
   - Additional filters
3. Check cache for result
4. If not cached:
   - Query count tracker
   - Generate Top K result
   - Cache result
5. Format response:
   - Include item counts
   - Include metadata
   - Include query metadata
6. Return result to client

Database Design

Events Table

Field	Type	Description
event_id	VARCHAR(36)	Primary key
item_id	VARCHAR(255)	Event item
timestamp	TIMESTAMP	Event timestamp
metadata	JSON	Additional event data
created_at	TIMESTAMP	Creation timestamp

Indexes:

idx_item_timestamp on (item_id, timestamp) - Item event history
idx_timestamp on (timestamp) - Time-based queries

Item Counts Table

Field	Type	Description
item_id	VARCHAR(255)	Item identifier
time_window	VARCHAR(50)	Time window
count	BIGINT	Item count
last_updated	TIMESTAMP	Last update

Indexes:

idx_count on (count) - Count-based queries
idx_last_updated on (last_updated) - Recent updates

Top K Results Table

Field	Type	Description
result_id	VARCHAR(36)	Primary key
time_window	VARCHAR(50)	Time window
k_value	INT	K value
items	JSON	Top K items
counts	JSON	Item counts
created_at	TIMESTAMP	Result timestamp

Indexes:

idx_time_window_k on (time_window, k_value) - Result queries
idx_created_at on (created_at) - Recent results

Scalability Considerations

Horizontal Scaling

Stream Processor: Scale horizontally with stream partitioning
Count Tracker: Use consistent hashing for data distribution
Top K Engine: Scale Top K generation with distributed computing
Database: Shard events by item_id

Caching Strategy

Redis: Cache Top K results and count data
Application Cache: Cache frequently accessed data
Database Cache: Cache count and result data

Performance Optimization

Connection Pooling: Efficient database connections
Batch Processing: Batch event processing for efficiency
Async Processing: Non-blocking event processing
Resource Monitoring: Monitor CPU, memory, and network usage

Monitoring and Observability

Key Metrics

Event Processing Rate: Events processed per second
Query Latency: Average Top K query response time
Memory Usage: Memory usage for count tracking
System Health: CPU, memory, and disk usage

Alerting

High Latency: Alert when query time exceeds threshold
Memory Usage: Alert when memory usage exceeds limits
Processing Errors: Alert when event processing fails
System Errors: Alert on Top K generation failures

Trade-offs and Considerations

Accuracy vs. Memory

Choice: Use probabilistic data structures for memory efficiency
Reasoning: Balance between count accuracy and memory usage

Latency vs. Throughput

Choice: Optimize for throughput with batch processing
Reasoning: High event volume requires efficient processing

Consistency vs. Availability

Choice: Eventual consistency for count updates
Reasoning: Count accuracy can tolerate slight delays

Common Interview Questions

Q: How would you handle memory constraints?

A: Use probabilistic data structures like Count-Min Sketch and HyperLogLog to minimize memory usage while maintaining accuracy.

Q: How do you ensure Top K accuracy?

A: Use multiple hash functions, error bounds, and validation mechanisms to ensure Top K accuracy.

Q: How would you scale this system globally?

A: Deploy regional processors, use geo-distributed databases, and implement data replication strategies.

Q: How do you handle time window management?

A: Use sliding windows, efficient data structures, and automated cleanup to handle time window management.

Key Takeaways

Probabilistic Data Structures: Essential for memory-efficient large-scale counting
Stream Processing: Real-time event processing enables immediate Top K updates
Query Optimization: Caching and efficient algorithms provide fast Top K queries
Scalability: Horizontal scaling and partitioning are crucial for handling large-scale data
Monitoring: Comprehensive monitoring ensures system reliability and performance

Design Top K Heavy Hitters

Design Top K Heavy Hitters

What is Top K Heavy Hitters?

Functional Requirements

Core (Interview Focussed)

Out of Scope

Non-Functional Requirements

Core (Interview Focussed)

Out of Scope

Core Entities

Core APIs

Event Processing

Top K Queries

Stream Management

High-Level Design

System Architecture Diagram

Key Components

Mapping Core Functional Requirements to Components

Detailed Design

Stream Processor

Count Tracker

Top K Engine

Query Service

Database Design

Events Table

Item Counts Table

Top K Results Table

Scalability Considerations

Horizontal Scaling

Caching Strategy

Performance Optimization

Monitoring and Observability

Key Metrics

Alerting

Trade-offs and Considerations

Accuracy vs. Memory

Latency vs. Throughput

Consistency vs. Availability

Common Interview Questions

Q: How would you handle memory constraints?

Q: How do you ensure Top K accuracy?

Q: How would you scale this system globally?

Q: How do you handle time window management?

Key Takeaways

Contents