Distributed Locks

Core Concept

intermediate
25-35 minutes
distributed-systemsconcurrencysynchronizationconsensuscoordinationmutual-exclusion

Coordinating access to shared resources across distributed systems

Distributed Locks

Overview

Distributed locks are synchronization mechanisms that coordinate access to shared resources across multiple nodes in a distributed system. They ensure mutual exclusion, preventing race conditions and maintaining data consistency when multiple processes or services need to access the same resource concurrently.

Unlike traditional locks that work within a single process or machine, distributed locks operate across network boundaries and must handle failures, network partitions, and timing issues. They are essential for building reliable distributed systems where coordination is critical for correctness.

The main technical challenges this addresses include:

  • Mutual exclusion: Ensuring only one process accesses a resource at a time
  • Deadlock prevention: Avoiding situations where processes wait indefinitely
  • Fault tolerance: Handling node failures and network partitions gracefully
  • Performance: Minimizing coordination overhead and latency

Core Principles: Requirements and Properties

Essential Properties of Distributed Locks

Safety (Mutual Exclusion): At most one client can hold a lock at any given time. This is the fundamental guarantee that prevents race conditions.

Liveness (Deadlock-free): The system must eventually grant locks to waiting clients, and locks must be releasable even if the holder fails.

Fault Tolerance: The locking mechanism must continue to function despite node failures, network partitions, and other distributed system challenges.

The Challenge of Distributed Coordination

System Architecture Diagram

Implementation Patterns

Redis-Based Distributed Locks

class RedisDistributedLock {
  constructor(redisClient, defaultTTL = 30000) {
    this.redis = redisClient;
    this.defaultTTL = defaultTTL;
  }
  
  async acquireLock(resource, clientId, ttl = this.defaultTTL) {
    const lockKey = `lock:${resource}`;
    const expiry = Date.now() + ttl;
    const lockValue = `${clientId}:${expiry}`;
    
    // Use SET with NX (only if not exists) and PX (expire time)
    const result = await this.redis.set(
      lockKey, 
      lockValue, 
      'PX', ttl,     // Expire in milliseconds
      'NX'           // Only set if key doesn't exist
    );
    
    if (result === 'OK') {
      return {
        acquired: true,
        lockKey,
        lockValue,
        expiresAt: expiry
      };
    }
    
    return { acquired: false };
  }
  
  async renewLock(lockKey, lockValue, newTTL) {
    // Lua script for atomic renewal
    const renewScript = `
      if redis.call('GET', KEYS[1]) == ARGV[1] then
        return redis.call('PEXPIRE', KEYS[1], ARGV[2])
      else
        return 0
      end
    `;
    
    const result = await this.redis.eval(
      renewScript, 
      1, 
      lockKey, 
      lockValue, 
      newTTL
    );
    
    return result === 1;
  }
  
  async releaseLock(lockKey, lockValue) {
    // Lua script for atomic release (only release if we own the lock)
    const releaseScript = `
      if redis.call('GET', KEYS[1]) == ARGV[1] then
        return redis.call('DEL', KEYS[1])
      else
        return 0
      end
    `;
    
    const result = await this.redis.eval(
      releaseScript, 
      1, 
      lockKey, 
      lockValue
    );
    
    return result === 1;
  }
  
  async withLock(resource, clientId, operation, options = {}) {
    const { ttl = this.defaultTTL, retryDelay = 100, maxRetries = 10 } = options;
    
    let retries = 0;
    
    while (retries < maxRetries) {
      const lock = await this.acquireLock(resource, clientId, ttl);
      
      if (lock.acquired) {
        try {
          // Set up lock renewal if operation might take long
          const renewalInterval = this.setupLockRenewal(lock);
          
          const result = await operation();
          
          clearInterval(renewalInterval);
          await this.releaseLock(lock.lockKey, lock.lockValue);
          
          return result;
        } catch (error) {
          // Ensure lock is released on error
          await this.releaseLock(lock.lockKey, lock.lockValue);
          throw error;
        }
      }
      
      // Wait before retrying
      await this.sleep(retryDelay * Math.pow(2, retries)); // Exponential backoff
      retries++;
    }
    
    throw new Error(`Failed to acquire lock for ${resource} after ${maxRetries} retries`);
  }
  
  setupLockRenewal(lock) {
    const renewalInterval = lock.expiresAt - Date.now() - 5000; // Renew 5 seconds before expiry
    
    return setInterval(async () => {
      const renewed = await this.renewLock(
        lock.lockKey, 
        lock.lockValue, 
        this.defaultTTL
      );
      
      if (!renewed) {
        console.warn(`Failed to renew lock ${lock.lockKey}`);
        clearInterval(renewalInterval);
      }
    }, Math.max(1000, renewalInterval));
  }
  
  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage example
class InventoryService {
  constructor(redisClient, database) {
    this.lock = new RedisDistributedLock(redisClient);
    this.db = database;
  }
  
  async updateInventory(productId, quantityChange) {
    const clientId = `inventory-service-${process.pid}-${Date.now()}`;
    
    return await this.lock.withLock(
      `inventory:${productId}`,
      clientId,
      async () => {
        // Critical section - guaranteed exclusive access
        const current = await this.db.getInventory(productId);
        const newQuantity = current.quantity + quantityChange;
        
        if (newQuantity < 0) {
          throw new Error('Insufficient inventory');
        }
        
        return await this.db.updateInventory(productId, newQuantity);
      },
      { ttl: 10000, maxRetries: 5 }
    );
  }
}

Redlock Algorithm (Multi-Redis)

class RedlockDistributedLock {
  constructor(redisInstances, quorumSize = null) {
    this.redisInstances = redisInstances;
    this.quorumSize = quorumSize || Math.floor(redisInstances.length / 2) + 1;
    this.clockDriftFactor = 0.01; // 1% clock drift tolerance
  }
  
  async acquireLock(resource, clientId, ttl) {
    const startTime = Date.now();
    const lockValue = `${clientId}:${Date.now()}:${Math.random()}`;
    
    // Try to acquire lock on all Redis instances
    const promises = this.redisInstances.map(redis => 
      this.acquireSingleLock(redis, resource, lockValue, ttl)
    );
    
    const results = await Promise.allSettled(promises);
    const successful = results.filter(r => r.status === 'fulfilled' && r.value).length;
    
    const drift = (ttl * this.clockDriftFactor) + 2; // Add 2ms for network delay
    const validityTime = ttl - (Date.now() - startTime) - drift;
    
    if (successful >= this.quorumSize && validityTime > 0) {
      return {
        acquired: true,
        resource,
        lockValue,
        validityTime,
        acquiredInstances: successful
      };
    } else {
      // Failed to acquire majority, release any acquired locks
      await this.releaseLock(resource, lockValue);
      return { acquired: false };
    }
  }
  
  async acquireSingleLock(redis, resource, lockValue, ttl) {
    try {
      const result = await redis.set(
        `lock:${resource}`,
        lockValue,
        'PX', ttl,
        'NX'
      );
      return result === 'OK';
    } catch (error) {
      return false;
    }
  }
  
  async releaseLock(resource, lockValue) {
    const releaseScript = `
      if redis.call('GET', KEYS[1]) == ARGV[1] then
        return redis.call('DEL', KEYS[1])
      else
        return 0
      end
    `;
    
    const promises = this.redisInstances.map(redis => 
      redis.eval(releaseScript, 1, `lock:${resource}`, lockValue).catch(() => 0)
    );
    
    await Promise.allSettled(promises);
  }
  
  async withRedlock(resource, clientId, operation, ttl = 30000) {
    const lock = await this.acquireLock(resource, clientId, ttl);
    
    if (!lock.acquired) {
      throw new Error(`Failed to acquire Redlock for ${resource}`);
    }
    
    try {
      // Check if we still have enough time to perform operation
      if (lock.validityTime < 1000) {
        throw new Error('Not enough time left on lock');
      }
      
      const result = await operation();
      return result;
    } finally {
      await this.releaseLock(resource, lock.lockValue);
    }
  }
}

Database-Based Distributed Locks

class DatabaseDistributedLock {
  constructor(database) {
    this.db = database;
    this.initializeLockTable();
  }
  
  async initializeLockTable() {
    await this.db.query(`
      CREATE TABLE IF NOT EXISTS distributed_locks (
        resource_name VARCHAR(255) PRIMARY KEY,
        owner_id VARCHAR(255) NOT NULL,
        acquired_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        expires_at TIMESTAMP NOT NULL,
        INDEX idx_expires_at (expires_at)
      )
    `);
  }
  
  async acquireLock(resource, ownerId, ttlMs = 30000) {
    const expiresAt = new Date(Date.now() + ttlMs);
    
    try {
      // Try to insert new lock
      await this.db.query(
        `INSERT INTO distributed_locks (resource_name, owner_id, expires_at) 
         VALUES (?, ?, ?)`,
        [resource, ownerId, expiresAt]
      );
      
      return {
        acquired: true,
        resource,
        ownerId,
        expiresAt
      };
    } catch (error) {
      if (error.code === 'ER_DUP_ENTRY') {
        // Lock already exists, check if it's expired
        return await this.tryAcquireExpiredLock(resource, ownerId, expiresAt);
      }
      throw error;
    }
  }
  
  async tryAcquireExpiredLock(resource, ownerId, expiresAt) {
    // Try to update expired lock atomically
    const result = await this.db.query(
      `UPDATE distributed_locks 
       SET owner_id = ?, acquired_at = CURRENT_TIMESTAMP, expires_at = ?
       WHERE resource_name = ? AND expires_at < CURRENT_TIMESTAMP`,
      [ownerId, expiresAt, resource]
    );
    
    if (result.affectedRows > 0) {
      return {
        acquired: true,
        resource,
        ownerId,
        expiresAt
      };
    }
    
    return { acquired: false };
  }
  
  async releaseLock(resource, ownerId) {
    const result = await this.db.query(
      `DELETE FROM distributed_locks 
       WHERE resource_name = ? AND owner_id = ?`,
      [resource, ownerId]
    );
    
    return result.affectedRows > 0;
  }
  
  async renewLock(resource, ownerId, ttlMs) {
    const expiresAt = new Date(Date.now() + ttlMs);
    
    const result = await this.db.query(
      `UPDATE distributed_locks 
       SET expires_at = ? 
       WHERE resource_name = ? AND owner_id = ? AND expires_at > CURRENT_TIMESTAMP`,
      [expiresAt, resource, ownerId]
    );
    
    return result.affectedRows > 0;
  }
  
  // Cleanup expired locks periodically
  startCleanupTask(intervalMs = 60000) {
    setInterval(async () => {
      try {
        const result = await this.db.query(
          `DELETE FROM distributed_locks WHERE expires_at < CURRENT_TIMESTAMP`
        );
        
        if (result.affectedRows > 0) {
          console.log(`Cleaned up ${result.affectedRows} expired locks`);
        }
      } catch (error) {
        console.error('Lock cleanup failed:', error);
      }
    }, intervalMs);
  }
}

Consensus-Based Locks (Raft)

class RaftDistributedLock {
  constructor(raftCluster) {
    this.raft = raftCluster;
    this.locks = new Map(); // Local lock state
  }
  
  async acquireLock(resource, clientId, ttl = 30000) {
    const operation = {
      type: 'ACQUIRE_LOCK',
      resource,
      clientId,
      ttl,
      timestamp: Date.now()
    };
    
    try {
      // Submit operation to Raft cluster
      const result = await this.raft.submitOperation(operation);
      
      if (result.success) {
        // Set up local expiration
        setTimeout(() => {
          this.expireLock(resource, clientId);
        }, ttl);
        
        return {
          acquired: true,
          resource,
          clientId,
          term: result.term,
          index: result.index
        };
      }
      
      return { acquired: false, reason: result.reason };
    } catch (error) {
      return { acquired: false, error: error.message };
    }
  }
  
  async releaseLock(resource, clientId) {
    const operation = {
      type: 'RELEASE_LOCK',
      resource,
      clientId,
      timestamp: Date.now()
    };
    
    return await this.raft.submitOperation(operation);
  }
  
  // Apply committed operations from Raft
  applyOperation(operation) {
    switch (operation.type) {
      case 'ACQUIRE_LOCK':
        return this.applyLockAcquisition(operation);
      case 'RELEASE_LOCK':
        return this.applyLockRelease(operation);
      default:
        throw new Error(`Unknown operation type: ${operation.type}`);
    }
  }
  
  applyLockAcquisition(operation) {
    const { resource, clientId, ttl, timestamp } = operation;
    
    // Check if lock already exists and is not expired
    const existingLock = this.locks.get(resource);
    if (existingLock && existingLock.expiresAt > Date.now()) {
      return {
        success: false,
        reason: `Lock held by ${existingLock.clientId}`
      };
    }
    
    // Acquire lock
    this.locks.set(resource, {
      clientId,
      acquiredAt: timestamp,
      expiresAt: timestamp + ttl
    });
    
    return { success: true };
  }
  
  applyLockRelease(operation) {
    const { resource, clientId } = operation;
    
    const lock = this.locks.get(resource);
    if (!lock) {
      return { success: false, reason: 'Lock not found' };
    }
    
    if (lock.clientId !== clientId) {
      return { success: false, reason: 'Not lock owner' };
    }
    
    this.locks.delete(resource);
    return { success: true };
  }
  
  expireLock(resource, clientId) {
    const lock = this.locks.get(resource);
    if (lock && lock.clientId === clientId) {
      this.locks.delete(resource);
      console.log(`Lock expired for resource: ${resource}`);
    }
  }
}

Advanced Patterns and Considerations

Lock-Free Algorithms Alternative

// Sometimes you can avoid locks entirely using atomic operations
class LockFreeCounter {
  constructor(redis) {
    this.redis = redis;
  }
  
  // Atomic increment using Redis
  async increment(key, delta = 1) {
    return await this.redis.incrby(key, delta);
  }
  
  // Compare-and-swap pattern
  async compareAndSwap(key, expectedValue, newValue) {
    const script = `
      local current = redis.call('GET', KEYS[1])
      if current == ARGV[1] then
        redis.call('SET', KEYS[1], ARGV[2])
        return 1
      else
        return 0
      end
    `;
    
    const result = await this.redis.eval(script, 1, key, expectedValue, newValue);
    return result === 1;
  }
  
  // Optimistic concurrency control
  async updateWithVersion(key, updateFunction) {
    let retries = 0;
    const maxRetries = 10;
    
    while (retries < maxRetries) {
      // Get current value and version
      const current = await this.redis.hgetall(`${key}:versioned`);
      
      if (!current.value) {
        // Initialize if doesn't exist
        const initial = { value: updateFunction(null), version: 1 };
        const created = await this.redis.hsetnx(`${key}:versioned`, initial);
        if (created) return initial;
      } else {
        // Apply update function
        const newValue = updateFunction(current.value);
        const newVersion = parseInt(current.version) + 1;
        
        // Atomic update if version unchanged
        const script = `
          local current_version = redis.call('HGET', KEYS[1], 'version')
          if current_version == ARGV[1] then
            redis.call('HMSET', KEYS[1], 'value', ARGV[2], 'version', ARGV[3])
            return 1
          else
            return 0
          end
        `;
        
        const updated = await this.redis.eval(
          script, 1, `${key}:versioned`, 
          current.version, newValue, newVersion
        );
        
        if (updated === 1) {
          return { value: newValue, version: newVersion };
        }
      }
      
      retries++;
      await this.sleep(Math.random() * 100); // Random jitter
    }
    
    throw new Error('Failed to update after maximum retries');
  }
  
  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Performance Monitoring and Debugging

class MonitoredDistributedLock {
  constructor(lockImplementation) {
    this.lock = lockImplementation;
    this.metrics = {
      acquisitions: 0,
      acquisitionFailures: 0,
      totalAcquisitionTime: 0,
      holdTimes: [],
      contentionEvents: 0
    };
  }
  
  async acquireLock(resource, clientId, ttl) {
    const startTime = Date.now();
    
    try {
      const result = await this.lock.acquireLock(resource, clientId, ttl);
      
      const acquisitionTime = Date.now() - startTime;
      this.metrics.totalAcquisitionTime += acquisitionTime;
      
      if (result.acquired) {
        this.metrics.acquisitions++;
        result._acquiredAt = Date.now();
      } else {
        this.metrics.acquisitionFailures++;
        this.metrics.contentionEvents++;
      }
      
      return result;
    } catch (error) {
      this.metrics.acquisitionFailures++;
      throw error;
    }
  }
  
  async releaseLock(lockInfo) {
    const result = await this.lock.releaseLock(lockInfo.resource, lockInfo.clientId);
    
    if (lockInfo._acquiredAt) {
      const holdTime = Date.now() - lockInfo._acquiredAt;
      this.metrics.holdTimes.push(holdTime);
      
      // Keep only recent hold times for memory efficiency
      if (this.metrics.holdTimes.length > 1000) {
        this.metrics.holdTimes = this.metrics.holdTimes.slice(-1000);
      }
    }
    
    return result;
  }
  
  getMetrics() {
    const avgAcquisitionTime = this.metrics.acquisitions > 0 ? 
      this.metrics.totalAcquisitionTime / this.metrics.acquisitions : 0;
    
    const avgHoldTime = this.metrics.holdTimes.length > 0 ?
      this.metrics.holdTimes.reduce((a, b) => a + b, 0) / this.metrics.holdTimes.length : 0;
    
    return {
      totalAcquisitions: this.metrics.acquisitions,
      acquisitionFailures: this.metrics.acquisitionFailures,
      successRate: this.metrics.acquisitions / 
        (this.metrics.acquisitions + this.metrics.acquisitionFailures),
      avgAcquisitionTime,
      avgHoldTime,
      contentionEvents: this.metrics.contentionEvents
    };
  }
}

Deep Dive Analysis

Comparison of Lock Implementations

ImplementationProsConsBest Use Case
Redis SingleSimple, fast, low latencySingle point of failureHigh-performance, low-stakes
RedlockNo SPOF, Byzantine fault tolerantComplex, higher latencyCritical systems needing availability
DatabaseACID guarantees, familiarSlower, potential bottleneckSystems already using DB heavily
Consensus (Raft)Strong consistency, fault tolerantComplex implementationMission-critical coordination
Lock-FreeNo coordination overheadLimited use casesHigh-performance counters/flags

Common Pitfalls and Solutions

1. Lock Holder Dies Without Releasing

// Problem: Process crashes while holding lock
// Solution: Always use TTL/expiration
const lock = await lockService.acquireLock('resource', 'client-id', 30000); // 30s TTL

2. Clock Skew Issues

// Problem: Different clocks on different machines
// Solution: Use relative timeouts and clock drift compensation
const CLOCK_DRIFT_FACTOR = 0.01; // 1% tolerance
const validityTime = ttl - (Date.now() - startTime) - (ttl * CLOCK_DRIFT_FACTOR);

3. Split-Brain Scenarios

// Problem: Network partition causes multiple "masters"
// Solution: Require majority consensus (quorum)
const quorumSize = Math.floor(nodes.length / 2) + 1;
if (successfulLocks >= quorumSize) {
  // Safe to proceed
}

Interview-Focused Content

Junior Level (2-4 YOE)

Q: What is a distributed lock and why do we need it? A: A distributed lock coordinates access to shared resources across multiple processes or machines. We need it to prevent race conditions when multiple services try to modify the same data simultaneously, like preventing double-spending in payment systems or ensuring only one process handles a background job.

Q: What happens if a process holding a distributed lock crashes? A: The lock could remain held forever, blocking other processes. This is why distributed locks use TTL (time-to-live) - the lock automatically expires after a set time. The holder can also renew the lock periodically while it's still working.

Q: What's the difference between a distributed lock and a regular lock? A: Regular locks work within a single process/machine using shared memory. Distributed locks work across network boundaries and must handle network failures, different clocks, and node crashes. They're much more complex but necessary for coordinating between different services.

Senior Level (5-8 YOE)

Q: How would you implement a distributed lock using Redis? A: Use Redis SET command with NX (only if not exists) and PX (expiration time): SET lock:resource value PX 30000 NX. For release, use a Lua script to atomically check ownership and delete. Include lock renewal for long operations and proper error handling for network failures.

Q: What are the trade-offs between single Redis vs Redlock algorithm? A: Single Redis is simpler and faster but has a single point of failure. Redlock uses multiple Redis instances and requires majority consensus, providing better availability but with higher complexity and latency. Choose based on your consistency vs availability requirements.

Q: How do you handle the "herd effect" when many processes are waiting for the same lock? A: Strategies include:

  • Exponential backoff: Increase wait time between retries
  • Jittering: Add randomness to retry intervals
  • Queue-based locks: Implement fair ordering
  • Lock-free alternatives: Use atomic operations when possible

Staff+ Level (8+ YOE)

Q: Design a distributed locking system for a global, multi-region application. A: Considerations:

  • Regional locks: Use regional lock managers to reduce latency
  • Global coordination: Implement hierarchical locking for cross-region resources
  • Consensus protocol: Use Raft or similar for lock manager coordination
  • Fallback mechanisms: Handle cross-region network partitions gracefully
  • Monitoring: Track lock contention and performance across regions

Q: How would you migrate from a centralized locking system to a distributed one? A: Migration strategy:

  • Dual-write phase: Acquire locks in both old and new systems
  • Validation: Verify both systems grant/deny consistently
  • Gradual rollout: Start with non-critical resources
  • Rollback plan: Ability to revert if issues arise
  • Monitoring: Track lock performance and consistency during migration

Q: What are the implications of distributed locks for system observability? A: Observability needs:

  • Lock contention metrics: Track wait times and failure rates
  • Distributed tracing: Follow lock acquisition across services
  • Deadlock detection: Monitor for circular dependencies
  • Performance impact: Measure coordination overhead
  • Business metrics: Connect lock behavior to business outcomes

Further Reading

Related Concepts

consensus-algorithms
eventual-consistency
distributed-transactions
leader-election