Job Scheduler

Design a distributed job scheduler with fault tolerance and scalability. Covers job queuing, scheduling algorithms, resource management, and handling millions of jobs with different priorities and dependencies.

What is Job Scheduler?

A Job Scheduler is a distributed system that manages and executes tasks across multiple machines. It’s similar to systems like Apache Airflow, Kubernetes CronJobs, or AWS Batch. The service provides job queuing, scheduling, resource management, and fault tolerance.

Distributed job execution with fault tolerance and resource management is what makes systems like Job Scheduler unique. By understanding Job Scheduler, you can tackle interview questions for similar distributed systems, since the core design challenges—job queuing, scheduling algorithms, resource management, and fault tolerance—remain the same.

Functional Requirements

Core (Interview Focussed)

Job Submission: Submit jobs with different priorities and requirements.
Job Scheduling: Schedule jobs based on priority, dependencies, and resources.
Job Execution: Execute jobs on available workers.
Job Monitoring: Track job status and progress.

Out of Scope

User authentication and authorization
Job result storage and retrieval
Job templates and workflows
Real-time job streaming
Mobile app specific features

Non-Functional Requirements

Core (Interview Focussed)

High availability: 99.9% uptime for job scheduling.
Scalability: Handle millions of jobs and thousands of workers.
Fault tolerance: Handle worker failures and job retries.
Resource efficiency: Optimize resource utilization across workers.

Out of Scope

Data retention policies
Compliance and privacy regulations

💡 Interview Tip: Focus on high availability, scalability, and fault tolerance. Interviewers care most about job scheduling, resource management, and failure handling.

Core Entities

Entity	Key Attributes	Notes
Job	job_id, name, priority, status, created_at, scheduled_at	Indexed by priority for scheduling
Worker	worker_id, status, capabilities, last_heartbeat	Track worker availability
JobQueue	queue_id, name, priority, max_workers	Manage job queues
JobExecution	execution_id, job_id, worker_id, start_time, end_time	Track job executions
Resource	resource_id, worker_id, type, capacity, usage	Track resource availability

💡 Interview Tip: Focus on Job, Worker, and JobExecution as they drive scheduling, resource management, and fault tolerance.

Core APIs

Job Management

POST /jobs { name, priority, requirements, dependencies } – Submit a new job
GET /jobs/{job_id} – Get job details
PUT /jobs/{job_id}/cancel – Cancel a job
GET /jobs?status=&priority=&limit= – List jobs with filters

Worker Management

POST /workers/register { capabilities, resources } – Register a new worker
GET /workers/{worker_id} – Get worker details
POST /workers/{worker_id}/heartbeat – Send worker heartbeat
GET /workers?status=&capabilities= – List workers with filters

Job Execution

POST /jobs/{job_id}/execute { worker_id } – Execute job on worker
GET /executions/{execution_id} – Get execution details
POST /executions/{execution_id}/complete { result, status } – Complete job execution
GET /executions?job_id=&worker_id=&status= – List executions with filters

High-Level Design

graph TB
    subgraph "Clients"
        WebUI[Admin Dashboard]
        API[API Clients]
        CLI[CLI Tools]
    end
    
    subgraph "API Layer"
        LB[Load Balancer]
        APIGateway[API Gateway<br/>+ Auth + Rate Limit]
    end
    
    subgraph "Core Services"
        JobService[Job<br/>Service]
        SchedulerService[Scheduler<br/>Service]
        ExecutionService[Execution<br/>Service]
        WorkerService[Worker<br/>Service]
        QueueService[Queue<br/>Service]
    end
    
    subgraph "Scheduling Engine"
        CronEngine[Cron<br/>Engine]
        TriggerEngine[Trigger<br/>Engine]
        DependencyResolver[Dependency<br/>Resolver]
        TimeZoneHandler[Timezone<br/>Handler]
    end
    
    subgraph "Job Processing"
        JobDispatcher[Job<br/>Dispatcher]
        ResourceManager[Resource<br/>Manager]
        RetryEngine[Retry<br/>Engine]
        HealthChecker[Health<br/>Checker]
    end
    
    subgraph "Worker Nodes"
        WorkerNode1[Worker Node 1<br/>Job Executor]
        WorkerNode2[Worker Node 2<br/>Job Executor]
        WorkerNodeN[Worker Node N<br/>Job Executor]
    end
    
    subgraph "Message Queue"
        JobQueue[Job Queue<br/>Redis/RabbitMQ]
        StatusQueue[Status Queue<br/>Kafka]
        NotificationQueue[Notification Queue<br/>SNS]
    end
    
    subgraph "Coordination"
        Coordinator[Cluster Coordinator<br/>etcd/Consul]
        LeaderElection[Leader Election<br/>Service]
        ServiceDiscovery[Service<br/>Discovery]
    end
    
    subgraph "Storage"
        JobDB[(Job Database<br/>PostgreSQL)]
        ExecutionDB[(Execution Logs<br/>PostgreSQL)]
        MetricsDB[(Metrics Store<br/>InfluxDB)]
        LogStorage[(Log Storage<br/>S3/ELK)]
    end
    
    subgraph "Monitoring"
        MetricsCollector[Metrics<br/>Collector]
        AlertManager[Alert<br/>Manager]
        Dashboard[Monitoring<br/>Dashboard]
    end
    
    %% Client connections
    WebUI --> LB
    API --> LB
    CLI --> LB
    LB --> APIGateway
    
    %% API to services
    APIGateway --> JobService
    APIGateway --> ExecutionService
    APIGateway --> WorkerService
    APIGateway --> QueueService
    
    %% Core service interactions
    JobService --> SchedulerService
    SchedulerService --> CronEngine
    SchedulerService --> TriggerEngine
    SchedulerService --> DependencyResolver
    
    %% Job execution flow
    SchedulerService --> JobDispatcher
    JobDispatcher --> JobQueue
    JobQueue --> WorkerNode1
    JobQueue --> WorkerNode2
    JobQueue --> WorkerNodeN
    
    %% Worker management
    WorkerService --> ResourceManager
    WorkerService --> HealthChecker
    WorkerNode1 --> StatusQueue
    WorkerNode2 --> StatusQueue
    WorkerNodeN --> StatusQueue
    
    %% Status and monitoring
    StatusQueue --> ExecutionService
    ExecutionService --> RetryEngine
    StatusQueue --> MetricsCollector
    MetricsCollector --> AlertManager
    
    %% Coordination
    SchedulerService --> Coordinator
    WorkerService --> ServiceDiscovery
    LeaderElection --> SchedulerService
    
    %% Data persistence
    JobService --> JobDB
    ExecutionService --> ExecutionDB
    MetricsCollector --> MetricsDB
    WorkerNode1 --> LogStorage
    WorkerNode2 --> LogStorage
    WorkerNodeN --> LogStorage
    
    %% Notifications
    AlertManager --> NotificationQueue
    RetryEngine --> NotificationQueue
    
    %% Monitoring dashboard
    Dashboard --> MetricsDB
    Dashboard --> ExecutionDB
    
    %% Styling
    classDef client fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef service fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef queue fill:#fff8e1,stroke:#f57c00,stroke-width:2px
    classDef worker fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef coordination fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    classDef processing fill:#e0f2f1,stroke:#00695c,stroke-width:2px
    
    class WebUI,API,CLI client
    class LB,APIGateway api
    class JobService,SchedulerService,ExecutionService,WorkerService,QueueService service
    class JobDB,ExecutionDB,MetricsDB,LogStorage data
    class JobQueue,StatusQueue,NotificationQueue queue
    class WorkerNode1,WorkerNode2,WorkerNodeN worker
    class Coordinator,LeaderElection,ServiceDiscovery coordination
    class CronEngine,TriggerEngine,DependencyResolver,TimeZoneHandler,JobDispatcher,ResourceManager,RetryEngine,HealthChecker,MetricsCollector,AlertManager,Dashboard processing

Key Components

Job Scheduler: Manage job queuing and scheduling
Worker Manager: Manage worker registration and health
Resource Manager: Track and allocate resources
Execution Engine: Execute jobs on workers
Database: Persistent storage for jobs and executions
Message Queue: Decouple job submission from execution

Mapping Core Functional Requirements to Components

Functional Requirement	Responsible Components	Key Considerations
Job Submission	Job Scheduler, Message Queue	High throughput, job validation
Job Scheduling	Job Scheduler, Resource Manager	Priority handling, resource allocation
Job Execution	Execution Engine, Worker Manager	Fault tolerance, resource management
Job Monitoring	Job Scheduler, Database	Real-time status, progress tracking

Detailed Design

Job Scheduler

Purpose: Manage job queuing, scheduling, and resource allocation.

Key Design Decisions:

Priority Queues: Use priority queues for job scheduling
Dependency Resolution: Handle job dependencies and execution order
Resource Matching: Match jobs with available resources
Load Balancing: Distribute jobs across available workers

Algorithm: Job scheduling with priority

1. Receive job submission request
2. Validate job requirements and dependencies
3. Add job to priority queue based on priority
4. For each job in queue:
   - Check if dependencies are satisfied
   - Find available worker with required resources
   - If worker found:
     - Assign job to worker
     - Update job status to "running"
     - Remove from queue
5. Handle job timeouts and retries

Worker Manager

Purpose: Manage worker registration, health monitoring, and capability tracking.

Key Design Decisions:

Heartbeat Mechanism: Monitor worker health with regular heartbeats
Capability Tracking: Track worker capabilities and resources
Failure Detection: Detect worker failures and handle gracefully
Resource Management: Track worker resource usage

Algorithm: Worker health monitoring

1. Worker sends heartbeat with status and resource usage
2. Update worker last_seen timestamp
3. Check worker health:
   - If heartbeat missed for threshold time
   - Mark worker as "unhealthy"
   - Reassign running jobs to other workers
4. Update worker capabilities and resources
5. Notify job scheduler of worker status changes

Resource Manager

Purpose: Track and allocate resources across workers.

Key Design Decisions:

Resource Tracking: Track CPU, memory, and storage across workers
Resource Allocation: Allocate resources based on job requirements
Resource Optimization: Optimize resource utilization
Resource Limits: Enforce resource limits per worker

Algorithm: Resource allocation

1. Receive job with resource requirements
2. Find workers with available resources:
   - Check CPU availability
   - Check memory availability
   - Check storage availability
3. Select worker with best resource match
4. Allocate resources to job
5. Update worker resource usage
6. Monitor resource usage during execution

Execution Engine

Purpose: Execute jobs on workers and handle execution lifecycle.

Key Design Decisions:

Job Execution: Execute jobs on assigned workers
Progress Tracking: Track job execution progress
Failure Handling: Handle job failures and retries
Result Collection: Collect and store job results

Algorithm: Job execution

1. Receive job execution request
2. Validate worker availability and resources
3. Start job execution on worker
4. Monitor execution progress:
   - Track execution time
   - Monitor resource usage
   - Handle worker failures
5. Collect job results
6. Update job status and execution record
7. Handle job retries if needed

Database Design

Jobs Table

Field	Type	Description
job_id	VARCHAR(36)	Primary key
name	VARCHAR(255)	Job name
priority	INT	Job priority
status	VARCHAR(50)	Job status
requirements	JSON	Job requirements
dependencies	JSON	Job dependencies
created_at	TIMESTAMP	Creation timestamp
scheduled_at	TIMESTAMP	Scheduled execution time

Indexes:

idx_priority_status on (priority, status) - Job scheduling
idx_scheduled_at on (scheduled_at) - Time-based scheduling

Workers Table

Field	Type	Description
worker_id	VARCHAR(36)	Primary key
status	VARCHAR(50)	Worker status
capabilities	JSON	Worker capabilities
resources	JSON	Available resources
last_heartbeat	TIMESTAMP	Last heartbeat

Indexes:

idx_status on (status) - Worker availability
idx_last_heartbeat on (last_heartbeat) - Health monitoring

Job Executions Table

Field	Type	Description
execution_id	VARCHAR(36)	Primary key
job_id	VARCHAR(36)	Associated job
worker_id	VARCHAR(36)	Executing worker
status	VARCHAR(50)	Execution status
start_time	TIMESTAMP	Execution start
end_time	TIMESTAMP	Execution end
result	JSON	Execution result

Indexes:

idx_job_id on (job_id) - Job execution history
idx_worker_id on (worker_id) - Worker execution history
idx_status on (status) - Execution status queries

Scalability Considerations

Horizontal Scaling

Job Scheduler: Scale horizontally with load balancers
Worker Manager: Use consistent hashing for worker partitioning
Resource Manager: Scale resource tracking with distributed systems
Execution Engine: Scale job execution with multiple workers

Caching Strategy

Redis: Cache job queues and worker status
Application Cache: Cache frequently accessed data
Database Cache: Cache job and execution data

Performance Optimization

Connection Pooling: Efficient database connections
Batch Processing: Batch job operations for efficiency
Async Processing: Non-blocking job processing
Resource Monitoring: Monitor CPU, memory, and network usage

Monitoring and Observability

Key Metrics

Job Throughput: Jobs processed per second
Execution Latency: Average job execution time
Worker Utilization: Percentage of workers actively executing jobs
System Health: CPU, memory, and disk usage

Alerting

High Latency: Alert when job execution time exceeds threshold
Worker Failures: Alert when worker failure rate increases
Queue Backlog: Alert when job queue grows too large
System Errors: Alert on job execution failures

Trade-offs and Considerations

Consistency vs. Availability

Choice: Eventual consistency for job status, strong consistency for resource allocation
Reasoning: Job status can tolerate slight delays, resource allocation needs immediate accuracy

Latency vs. Throughput

Choice: Optimize for throughput with batch processing
Reasoning: Job scheduling needs to handle high volumes efficiently

Resource Efficiency vs. Job Priority

Choice: Balance resource utilization with job priority
Reasoning: Optimize both resource usage and job execution order

Common Interview Questions

Q: How would you handle worker failures?

A: Use heartbeat monitoring, job reassignment, and retry mechanisms to handle worker failures gracefully.

Q: How do you ensure job execution order?

A: Use priority queues, dependency resolution, and resource allocation to ensure proper job execution order.

Q: How would you scale this system globally?

A: Deploy regional job schedulers, use geo-distributed databases, and implement data replication strategies.

Q: How do you handle resource contention?

A: Use resource allocation algorithms, priority-based scheduling, and resource limits to handle resource contention.

Key Takeaways

Job Scheduling: Priority queues and dependency resolution are essential for efficient job scheduling
Resource Management: Resource tracking and allocation are crucial for optimal system performance
Fault Tolerance: Heartbeat monitoring and job retries ensure system reliability
Scalability: Horizontal scaling and partitioning are crucial for handling large-scale job processing
Monitoring: Comprehensive monitoring ensures system reliability and performance