Job Scheduler

Design a distributed job scheduler with fault tolerance and scalability. Covers job queuing, scheduling algorithms, resource management, and handling millions of jobs with different priorities and dependencies.

What is Job Scheduler?

A Job Scheduler is a distributed system that manages and executes tasks across multiple machines. It’s similar to systems like Apache Airflow, Kubernetes CronJobs, or AWS Batch. The service provides job queuing, scheduling, resource management, and fault tolerance.

Distributed job execution with fault tolerance and resource management is what makes systems like Job Scheduler unique. By understanding Job Scheduler, you can tackle interview questions for similar distributed systems, since the core design challenges—job queuing, scheduling algorithms, resource management, and fault tolerance—remain the same.


Functional Requirements

Core (Interview Focussed)

  • Job Submission: Submit jobs with different priorities and requirements.
  • Job Scheduling: Schedule jobs based on priority, dependencies, and resources.
  • Job Execution: Execute jobs on available workers.
  • Job Monitoring: Track job status and progress.

Out of Scope

  • User authentication and authorization
  • Job result storage and retrieval
  • Job templates and workflows
  • Real-time job streaming
  • Mobile app specific features

Non-Functional Requirements

Core (Interview Focussed)

  • High availability: 99.9% uptime for job scheduling.
  • Scalability: Handle millions of jobs and thousands of workers.
  • Fault tolerance: Handle worker failures and job retries.
  • Resource efficiency: Optimize resource utilization across workers.

Out of Scope

  • Data retention policies
  • Compliance and privacy regulations

💡 Interview Tip: Focus on high availability, scalability, and fault tolerance. Interviewers care most about job scheduling, resource management, and failure handling.


Core Entities

EntityKey AttributesNotes
Jobjob_id, name, priority, status, created_at, scheduled_atIndexed by priority for scheduling
Workerworker_id, status, capabilities, last_heartbeatTrack worker availability
JobQueuequeue_id, name, priority, max_workersManage job queues
JobExecutionexecution_id, job_id, worker_id, start_time, end_timeTrack job executions
Resourceresource_id, worker_id, type, capacity, usageTrack resource availability

💡 Interview Tip: Focus on Job, Worker, and JobExecution as they drive scheduling, resource management, and fault tolerance.


Core APIs

Job Management

  • POST /jobs { name, priority, requirements, dependencies } – Submit a new job
  • GET /jobs/{job_id} – Get job details
  • PUT /jobs/{job_id}/cancel – Cancel a job
  • GET /jobs?status=&priority=&limit= – List jobs with filters

Worker Management

  • POST /workers/register { capabilities, resources } – Register a new worker
  • GET /workers/{worker_id} – Get worker details
  • POST /workers/{worker_id}/heartbeat – Send worker heartbeat
  • GET /workers?status=&capabilities= – List workers with filters

Job Execution

  • POST /jobs/{job_id}/execute { worker_id } – Execute job on worker
  • GET /executions/{execution_id} – Get execution details
  • POST /executions/{execution_id}/complete { result, status } – Complete job execution
  • GET /executions?job_id=&worker_id=&status= – List executions with filters

High-Level Design

graph TB
    subgraph "Clients"
        WebUI[Admin Dashboard]
        API[API Clients]
        CLI[CLI Tools]
    end
    
    subgraph "API Layer"
        LB[Load Balancer]
        APIGateway[API Gateway<br/>+ Auth + Rate Limit]
    end
    
    subgraph "Core Services"
        JobService[Job<br/>Service]
        SchedulerService[Scheduler<br/>Service]
        ExecutionService[Execution<br/>Service]
        WorkerService[Worker<br/>Service]
        QueueService[Queue<br/>Service]
    end
    
    subgraph "Scheduling Engine"
        CronEngine[Cron<br/>Engine]
        TriggerEngine[Trigger<br/>Engine]
        DependencyResolver[Dependency<br/>Resolver]
        TimeZoneHandler[Timezone<br/>Handler]
    end
    
    subgraph "Job Processing"
        JobDispatcher[Job<br/>Dispatcher]
        ResourceManager[Resource<br/>Manager]
        RetryEngine[Retry<br/>Engine]
        HealthChecker[Health<br/>Checker]
    end
    
    subgraph "Worker Nodes"
        WorkerNode1[Worker Node 1<br/>Job Executor]
        WorkerNode2[Worker Node 2<br/>Job Executor]
        WorkerNodeN[Worker Node N<br/>Job Executor]
    end
    
    subgraph "Message Queue"
        JobQueue[Job Queue<br/>Redis/RabbitMQ]
        StatusQueue[Status Queue<br/>Kafka]
        NotificationQueue[Notification Queue<br/>SNS]
    end
    
    subgraph "Coordination"
        Coordinator[Cluster Coordinator<br/>etcd/Consul]
        LeaderElection[Leader Election<br/>Service]
        ServiceDiscovery[Service<br/>Discovery]
    end
    
    subgraph "Storage"
        JobDB[(Job Database<br/>PostgreSQL)]
        ExecutionDB[(Execution Logs<br/>PostgreSQL)]
        MetricsDB[(Metrics Store<br/>InfluxDB)]
        LogStorage[(Log Storage<br/>S3/ELK)]
    end
    
    subgraph "Monitoring"
        MetricsCollector[Metrics<br/>Collector]
        AlertManager[Alert<br/>Manager]
        Dashboard[Monitoring<br/>Dashboard]
    end
    
    %% Client connections
    WebUI --> LB
    API --> LB
    CLI --> LB
    LB --> APIGateway
    
    %% API to services
    APIGateway --> JobService
    APIGateway --> ExecutionService
    APIGateway --> WorkerService
    APIGateway --> QueueService
    
    %% Core service interactions
    JobService --> SchedulerService
    SchedulerService --> CronEngine
    SchedulerService --> TriggerEngine
    SchedulerService --> DependencyResolver
    
    %% Job execution flow
    SchedulerService --> JobDispatcher
    JobDispatcher --> JobQueue
    JobQueue --> WorkerNode1
    JobQueue --> WorkerNode2
    JobQueue --> WorkerNodeN
    
    %% Worker management
    WorkerService --> ResourceManager
    WorkerService --> HealthChecker
    WorkerNode1 --> StatusQueue
    WorkerNode2 --> StatusQueue
    WorkerNodeN --> StatusQueue
    
    %% Status and monitoring
    StatusQueue --> ExecutionService
    ExecutionService --> RetryEngine
    StatusQueue --> MetricsCollector
    MetricsCollector --> AlertManager
    
    %% Coordination
    SchedulerService --> Coordinator
    WorkerService --> ServiceDiscovery
    LeaderElection --> SchedulerService
    
    %% Data persistence
    JobService --> JobDB
    ExecutionService --> ExecutionDB
    MetricsCollector --> MetricsDB
    WorkerNode1 --> LogStorage
    WorkerNode2 --> LogStorage
    WorkerNodeN --> LogStorage
    
    %% Notifications
    AlertManager --> NotificationQueue
    RetryEngine --> NotificationQueue
    
    %% Monitoring dashboard
    Dashboard --> MetricsDB
    Dashboard --> ExecutionDB
    
    %% Styling
    classDef client fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef service fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef queue fill:#fff8e1,stroke:#f57c00,stroke-width:2px
    classDef worker fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef coordination fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    classDef processing fill:#e0f2f1,stroke:#00695c,stroke-width:2px
    
    class WebUI,API,CLI client
    class LB,APIGateway api
    class JobService,SchedulerService,ExecutionService,WorkerService,QueueService service
    class JobDB,ExecutionDB,MetricsDB,LogStorage data
    class JobQueue,StatusQueue,NotificationQueue queue
    class WorkerNode1,WorkerNode2,WorkerNodeN worker
    class Coordinator,LeaderElection,ServiceDiscovery coordination
    class CronEngine,TriggerEngine,DependencyResolver,TimeZoneHandler,JobDispatcher,ResourceManager,RetryEngine,HealthChecker,MetricsCollector,AlertManager,Dashboard processing

Key Components

  • Job Scheduler: Manage job queuing and scheduling
  • Worker Manager: Manage worker registration and health
  • Resource Manager: Track and allocate resources
  • Execution Engine: Execute jobs on workers
  • Database: Persistent storage for jobs and executions
  • Message Queue: Decouple job submission from execution

Mapping Core Functional Requirements to Components

Functional RequirementResponsible ComponentsKey Considerations
Job SubmissionJob Scheduler, Message QueueHigh throughput, job validation
Job SchedulingJob Scheduler, Resource ManagerPriority handling, resource allocation
Job ExecutionExecution Engine, Worker ManagerFault tolerance, resource management
Job MonitoringJob Scheduler, DatabaseReal-time status, progress tracking

Detailed Design

Job Scheduler

Purpose: Manage job queuing, scheduling, and resource allocation.

Key Design Decisions:

  • Priority Queues: Use priority queues for job scheduling
  • Dependency Resolution: Handle job dependencies and execution order
  • Resource Matching: Match jobs with available resources
  • Load Balancing: Distribute jobs across available workers

Algorithm: Job scheduling with priority

1. Receive job submission request
2. Validate job requirements and dependencies
3. Add job to priority queue based on priority
4. For each job in queue:
   - Check if dependencies are satisfied
   - Find available worker with required resources
   - If worker found:
     - Assign job to worker
     - Update job status to "running"
     - Remove from queue
5. Handle job timeouts and retries

Worker Manager

Purpose: Manage worker registration, health monitoring, and capability tracking.

Key Design Decisions:

  • Heartbeat Mechanism: Monitor worker health with regular heartbeats
  • Capability Tracking: Track worker capabilities and resources
  • Failure Detection: Detect worker failures and handle gracefully
  • Resource Management: Track worker resource usage

Algorithm: Worker health monitoring

1. Worker sends heartbeat with status and resource usage
2. Update worker last_seen timestamp
3. Check worker health:
   - If heartbeat missed for threshold time
   - Mark worker as "unhealthy"
   - Reassign running jobs to other workers
4. Update worker capabilities and resources
5. Notify job scheduler of worker status changes

Resource Manager

Purpose: Track and allocate resources across workers.

Key Design Decisions:

  • Resource Tracking: Track CPU, memory, and storage across workers
  • Resource Allocation: Allocate resources based on job requirements
  • Resource Optimization: Optimize resource utilization
  • Resource Limits: Enforce resource limits per worker

Algorithm: Resource allocation

1. Receive job with resource requirements
2. Find workers with available resources:
   - Check CPU availability
   - Check memory availability
   - Check storage availability
3. Select worker with best resource match
4. Allocate resources to job
5. Update worker resource usage
6. Monitor resource usage during execution

Execution Engine

Purpose: Execute jobs on workers and handle execution lifecycle.

Key Design Decisions:

  • Job Execution: Execute jobs on assigned workers
  • Progress Tracking: Track job execution progress
  • Failure Handling: Handle job failures and retries
  • Result Collection: Collect and store job results

Algorithm: Job execution

1. Receive job execution request
2. Validate worker availability and resources
3. Start job execution on worker
4. Monitor execution progress:
   - Track execution time
   - Monitor resource usage
   - Handle worker failures
5. Collect job results
6. Update job status and execution record
7. Handle job retries if needed

Database Design

Jobs Table

FieldTypeDescription
job_idVARCHAR(36)Primary key
nameVARCHAR(255)Job name
priorityINTJob priority
statusVARCHAR(50)Job status
requirementsJSONJob requirements
dependenciesJSONJob dependencies
created_atTIMESTAMPCreation timestamp
scheduled_atTIMESTAMPScheduled execution time

Indexes:

  • idx_priority_status on (priority, status) - Job scheduling
  • idx_scheduled_at on (scheduled_at) - Time-based scheduling

Workers Table

FieldTypeDescription
worker_idVARCHAR(36)Primary key
statusVARCHAR(50)Worker status
capabilitiesJSONWorker capabilities
resourcesJSONAvailable resources
last_heartbeatTIMESTAMPLast heartbeat

Indexes:

  • idx_status on (status) - Worker availability
  • idx_last_heartbeat on (last_heartbeat) - Health monitoring

Job Executions Table

FieldTypeDescription
execution_idVARCHAR(36)Primary key
job_idVARCHAR(36)Associated job
worker_idVARCHAR(36)Executing worker
statusVARCHAR(50)Execution status
start_timeTIMESTAMPExecution start
end_timeTIMESTAMPExecution end
resultJSONExecution result

Indexes:

  • idx_job_id on (job_id) - Job execution history
  • idx_worker_id on (worker_id) - Worker execution history
  • idx_status on (status) - Execution status queries

Scalability Considerations

Horizontal Scaling

  • Job Scheduler: Scale horizontally with load balancers
  • Worker Manager: Use consistent hashing for worker partitioning
  • Resource Manager: Scale resource tracking with distributed systems
  • Execution Engine: Scale job execution with multiple workers

Caching Strategy

  • Redis: Cache job queues and worker status
  • Application Cache: Cache frequently accessed data
  • Database Cache: Cache job and execution data

Performance Optimization

  • Connection Pooling: Efficient database connections
  • Batch Processing: Batch job operations for efficiency
  • Async Processing: Non-blocking job processing
  • Resource Monitoring: Monitor CPU, memory, and network usage

Monitoring and Observability

Key Metrics

  • Job Throughput: Jobs processed per second
  • Execution Latency: Average job execution time
  • Worker Utilization: Percentage of workers actively executing jobs
  • System Health: CPU, memory, and disk usage

Alerting

  • High Latency: Alert when job execution time exceeds threshold
  • Worker Failures: Alert when worker failure rate increases
  • Queue Backlog: Alert when job queue grows too large
  • System Errors: Alert on job execution failures

Trade-offs and Considerations

Consistency vs. Availability

  • Choice: Eventual consistency for job status, strong consistency for resource allocation
  • Reasoning: Job status can tolerate slight delays, resource allocation needs immediate accuracy

Latency vs. Throughput

  • Choice: Optimize for throughput with batch processing
  • Reasoning: Job scheduling needs to handle high volumes efficiently

Resource Efficiency vs. Job Priority

  • Choice: Balance resource utilization with job priority
  • Reasoning: Optimize both resource usage and job execution order

Common Interview Questions

Q: How would you handle worker failures?

A: Use heartbeat monitoring, job reassignment, and retry mechanisms to handle worker failures gracefully.

Q: How do you ensure job execution order?

A: Use priority queues, dependency resolution, and resource allocation to ensure proper job execution order.

Q: How would you scale this system globally?

A: Deploy regional job schedulers, use geo-distributed databases, and implement data replication strategies.

Q: How do you handle resource contention?

A: Use resource allocation algorithms, priority-based scheduling, and resource limits to handle resource contention.


Key Takeaways

  1. Job Scheduling: Priority queues and dependency resolution are essential for efficient job scheduling
  2. Resource Management: Resource tracking and allocation are crucial for optimal system performance
  3. Fault Tolerance: Heartbeat monitoring and job retries ensure system reliability
  4. Scalability: Horizontal scaling and partitioning are crucial for handling large-scale job processing
  5. Monitoring: Comprehensive monitoring ensures system reliability and performance