Design Job Scheduler

System Design Challenge

medium
45-60 minutes
distributed-schedulingtask-queuefault-toleranceload-balancing

Design Job Scheduler

What is Job Scheduler?

A Job Scheduler is a distributed system that manages and executes tasks across multiple machines. It's similar to systems like Apache Airflow, Kubernetes CronJobs, or AWS Batch. The service provides job queuing, scheduling, resource management, and fault tolerance.

Distributed job execution with fault tolerance and resource management is what makes systems like Job Scheduler unique. By understanding Job Scheduler, you can tackle interview questions for similar distributed systems, since the core design challenges—job queuing, scheduling algorithms, resource management, and fault tolerance—remain the same.


Functional Requirements

Core (Interview Focussed)

  • Job Submission: Submit jobs with different priorities and requirements.
  • Job Scheduling: Schedule jobs based on priority, dependencies, and resources.
  • Job Execution: Execute jobs on available workers.
  • Job Monitoring: Track job status and progress.

Out of Scope

  • User authentication and authorization
  • Job result storage and retrieval
  • Job templates and workflows
  • Real-time job streaming
  • Mobile app specific features

Non-Functional Requirements

Core (Interview Focussed)

  • High availability: 99.9% uptime for job scheduling.
  • Scalability: Handle millions of jobs and thousands of workers.
  • Fault tolerance: Handle worker failures and job retries.
  • Resource efficiency: Optimize resource utilization across workers.

Out of Scope

  • Data retention policies
  • Compliance and privacy regulations

💡 Interview Tip: Focus on high availability, scalability, and fault tolerance. Interviewers care most about job scheduling, resource management, and failure handling.


Core Entities

EntityKey AttributesNotes
Jobjob_id, name, priority, status, created_at, scheduled_atIndexed by priority for scheduling
Workerworker_id, status, capabilities, last_heartbeatTrack worker availability
JobQueuequeue_id, name, priority, max_workersManage job queues
JobExecutionexecution_id, job_id, worker_id, start_time, end_timeTrack job executions
Resourceresource_id, worker_id, type, capacity, usageTrack resource availability

💡 Interview Tip: Focus on Job, Worker, and JobExecution as they drive scheduling, resource management, and fault tolerance.


Core APIs

Job Management

  • POST /jobs { name, priority, requirements, dependencies } – Submit a new job
  • GET /jobs/{job_id} – Get job details
  • PUT /jobs/{job_id}/cancel – Cancel a job
  • GET /jobs?status=&priority=&limit= – List jobs with filters

Worker Management

  • POST /workers/register { capabilities, resources } – Register a new worker
  • GET /workers/{worker_id} – Get worker details
  • POST /workers/{worker_id}/heartbeat – Send worker heartbeat
  • GET /workers?status=&capabilities= – List workers with filters

Job Execution

  • POST /jobs/{job_id}/execute { worker_id } – Execute job on worker
  • GET /executions/{execution_id} – Get execution details
  • POST /executions/{execution_id}/complete { result, status } – Complete job execution
  • GET /executions?job_id=&worker_id=&status= – List executions with filters

High-Level Design

System Architecture Diagram

Key Components

  • Job Scheduler: Manage job queuing and scheduling
  • Worker Manager: Manage worker registration and health
  • Resource Manager: Track and allocate resources
  • Execution Engine: Execute jobs on workers
  • Database: Persistent storage for jobs and executions
  • Message Queue: Decouple job submission from execution

Mapping Core Functional Requirements to Components

Functional RequirementResponsible ComponentsKey Considerations
Job SubmissionJob Scheduler, Message QueueHigh throughput, job validation
Job SchedulingJob Scheduler, Resource ManagerPriority handling, resource allocation
Job ExecutionExecution Engine, Worker ManagerFault tolerance, resource management
Job MonitoringJob Scheduler, DatabaseReal-time status, progress tracking

Detailed Design

Job Scheduler

Purpose: Manage job queuing, scheduling, and resource allocation.

Key Design Decisions:

  • Priority Queues: Use priority queues for job scheduling
  • Dependency Resolution: Handle job dependencies and execution order
  • Resource Matching: Match jobs with available resources
  • Load Balancing: Distribute jobs across available workers

Algorithm: Job scheduling with priority

1. Receive job submission request
2. Validate job requirements and dependencies
3. Add job to priority queue based on priority
4. For each job in queue:
   - Check if dependencies are satisfied
   - Find available worker with required resources
   - If worker found:
     - Assign job to worker
     - Update job status to "running"
     - Remove from queue
5. Handle job timeouts and retries

Worker Manager

Purpose: Manage worker registration, health monitoring, and capability tracking.

Key Design Decisions:

  • Heartbeat Mechanism: Monitor worker health with regular heartbeats
  • Capability Tracking: Track worker capabilities and resources
  • Failure Detection: Detect worker failures and handle gracefully
  • Resource Management: Track worker resource usage

Algorithm: Worker health monitoring

1. Worker sends heartbeat with status and resource usage
2. Update worker last_seen timestamp
3. Check worker health:
   - If heartbeat missed for threshold time
   - Mark worker as "unhealthy"
   - Reassign running jobs to other workers
4. Update worker capabilities and resources
5. Notify job scheduler of worker status changes

Resource Manager

Purpose: Track and allocate resources across workers.

Key Design Decisions:

  • Resource Tracking: Track CPU, memory, and storage across workers
  • Resource Allocation: Allocate resources based on job requirements
  • Resource Optimization: Optimize resource utilization
  • Resource Limits: Enforce resource limits per worker

Algorithm: Resource allocation

1. Receive job with resource requirements
2. Find workers with available resources:
   - Check CPU availability
   - Check memory availability
   - Check storage availability
3. Select worker with best resource match
4. Allocate resources to job
5. Update worker resource usage
6. Monitor resource usage during execution

Execution Engine

Purpose: Execute jobs on workers and handle execution lifecycle.

Key Design Decisions:

  • Job Execution: Execute jobs on assigned workers
  • Progress Tracking: Track job execution progress
  • Failure Handling: Handle job failures and retries
  • Result Collection: Collect and store job results

Algorithm: Job execution

1. Receive job execution request
2. Validate worker availability and resources
3. Start job execution on worker
4. Monitor execution progress:
   - Track execution time
   - Monitor resource usage
   - Handle worker failures
5. Collect job results
6. Update job status and execution record
7. Handle job retries if needed

Database Design

Jobs Table

FieldTypeDescription
job_idVARCHAR(36)Primary key
nameVARCHAR(255)Job name
priorityINTJob priority
statusVARCHAR(50)Job status
requirementsJSONJob requirements
dependenciesJSONJob dependencies
created_atTIMESTAMPCreation timestamp
scheduled_atTIMESTAMPScheduled execution time

Indexes:

  • idx_priority_status on (priority, status) - Job scheduling
  • idx_scheduled_at on (scheduled_at) - Time-based scheduling

Workers Table

FieldTypeDescription
worker_idVARCHAR(36)Primary key
statusVARCHAR(50)Worker status
capabilitiesJSONWorker capabilities
resourcesJSONAvailable resources
last_heartbeatTIMESTAMPLast heartbeat

Indexes:

  • idx_status on (status) - Worker availability
  • idx_last_heartbeat on (last_heartbeat) - Health monitoring

Job Executions Table

FieldTypeDescription
execution_idVARCHAR(36)Primary key
job_idVARCHAR(36)Associated job
worker_idVARCHAR(36)Executing worker
statusVARCHAR(50)Execution status
start_timeTIMESTAMPExecution start
end_timeTIMESTAMPExecution end
resultJSONExecution result

Indexes:

  • idx_job_id on (job_id) - Job execution history
  • idx_worker_id on (worker_id) - Worker execution history
  • idx_status on (status) - Execution status queries

Scalability Considerations

Horizontal Scaling

  • Job Scheduler: Scale horizontally with load balancers
  • Worker Manager: Use consistent hashing for worker partitioning
  • Resource Manager: Scale resource tracking with distributed systems
  • Execution Engine: Scale job execution with multiple workers

Caching Strategy

  • Redis: Cache job queues and worker status
  • Application Cache: Cache frequently accessed data
  • Database Cache: Cache job and execution data

Performance Optimization

  • Connection Pooling: Efficient database connections
  • Batch Processing: Batch job operations for efficiency
  • Async Processing: Non-blocking job processing
  • Resource Monitoring: Monitor CPU, memory, and network usage

Monitoring and Observability

Key Metrics

  • Job Throughput: Jobs processed per second
  • Execution Latency: Average job execution time
  • Worker Utilization: Percentage of workers actively executing jobs
  • System Health: CPU, memory, and disk usage

Alerting

  • High Latency: Alert when job execution time exceeds threshold
  • Worker Failures: Alert when worker failure rate increases
  • Queue Backlog: Alert when job queue grows too large
  • System Errors: Alert on job execution failures

Trade-offs and Considerations

Consistency vs. Availability

  • Choice: Eventual consistency for job status, strong consistency for resource allocation
  • Reasoning: Job status can tolerate slight delays, resource allocation needs immediate accuracy

Latency vs. Throughput

  • Choice: Optimize for throughput with batch processing
  • Reasoning: Job scheduling needs to handle high volumes efficiently

Resource Efficiency vs. Job Priority

  • Choice: Balance resource utilization with job priority
  • Reasoning: Optimize both resource usage and job execution order

Common Interview Questions

Q: How would you handle worker failures?

A: Use heartbeat monitoring, job reassignment, and retry mechanisms to handle worker failures gracefully.

Q: How do you ensure job execution order?

A: Use priority queues, dependency resolution, and resource allocation to ensure proper job execution order.

Q: How would you scale this system globally?

A: Deploy regional job schedulers, use geo-distributed databases, and implement data replication strategies.

Q: How do you handle resource contention?

A: Use resource allocation algorithms, priority-based scheduling, and resource limits to handle resource contention.


Key Takeaways

  1. Job Scheduling: Priority queues and dependency resolution are essential for efficient job scheduling
  2. Resource Management: Resource tracking and allocation are crucial for optimal system performance
  3. Fault Tolerance: Heartbeat monitoring and job retries ensure system reliability
  4. Scalability: Horizontal scaling and partitioning are crucial for handling large-scale job processing
  5. Monitoring: Comprehensive monitoring ensures system reliability and performance