Distributed Locks

Overview

Distributed locks are synchronization mechanisms that coordinate access to shared resources across multiple nodes in a distributed system. They ensure mutual exclusion, preventing race conditions and maintaining data consistency when multiple processes or services need to access the same resource concurrently, like having a single key to a shared office that only one person can use at a time.

Unlike traditional locks that work within a single process or machine, distributed locks operate across network boundaries and must handle failures, network partitions, and timing issues. They are essential for building reliable distributed systems where coordination is critical for correctness.

System Architecture Diagram

Core Principles: Requirements and Properties

Essential Properties of Distributed Locks

Safety (Mutual Exclusion): At most one client can hold a lock at any given time. This is the fundamental guarantee that prevents race conditions, like ensuring only one person can use a bathroom at a time.

Liveness (Deadlock-free): The system must eventually grant locks to waiting clients, and locks must be releasable even if the holder fails. This prevents situations where everyone waits forever for a lock that will never be released.

Fault Tolerance: The locking mechanism must continue to function despite node failures, network partitions, and other distributed system challenges. It's like having backup keys when the main key holder is unavailable.

The Challenge of Distributed Coordination

Distributed locks face unique challenges that don't exist in single-machine systems. Network delays can make it appear that a lock holder has failed when they're actually still working. Clock differences between machines can cause timing issues. Network partitions can split the system, potentially allowing multiple locks to be granted for the same resource.

Key Challenges:

Network Partitions: System splits can lead to multiple locks
Clock Skew: Different machines have different time perceptions
Node Failures: Lock holders can fail without releasing locks
Split-Brain: Multiple parts of system think they have authority

Implementation Patterns

Redis-Based Distributed Locks

Redis-based distributed locks provide a simple and efficient locking mechanism:

Core Components:

Lock Key: Unique identifier for the resource being locked
Client ID: Identifier for the process requesting the lock
TTL (Time To Live): Automatic expiration to prevent deadlocks
Atomic Operations: Use Redis atomic commands for lock acquisition

Key Operations:

Acquire Lock: Use SET with NX (only if not exists) and PX (expire time)
Release Lock: Remove lock key with proper validation
Lock Validation: Verify lock ownership before operations
Automatic Expiration: Prevent deadlocks with TTL-based expiration

Redis Lock Characteristics:

Atomic Operations: Redis SET with NX/PX provides atomic lock acquisition
Automatic Expiration: TTL prevents deadlocks from failed processes
High Performance: Redis provides low-latency lock operations
Simple Implementation: Easy to understand and implement

Lock Acquisition Flow:

Generate Lock Key: Create unique key for the resource
Set Lock Value: Include client ID and expiration timestamp
Atomic Set: Use Redis SET with NX/PX for atomic acquisition
Return Status: Indicate success or failure of lock acquisition
Handle Expiration: Monitor lock expiration and renewal if needed

async renewLock(lockKey, lockValue, newTTL) { // Lua script for atomic renewal const renewScript = if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('PEXPIRE', KEYS[1], ARGV[2]) else return 0 end ;

const result = await this.redis.eval(
  renewScript, 
  1, 
  lockKey, 
  lockValue, 
  newTTL
);

return result === 1;

}

async releaseLock(lockKey, lockValue) { // Lua script for atomic release (only release if we own the lock) const releaseScript = if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end ;

const result = await this.redis.eval(
  releaseScript, 
  1, 
  lockKey, 
  lockValue
);

return result === 1;

}

async withLock(resource, clientId, operation, options = {}) { const { ttl = this.defaultTTL, retryDelay = 100, maxRetries = 10 } = options;

let retries = 0;

while (retries < maxRetries) {
  const lock = await this.acquireLock(resource, clientId, ttl);
  
  if (lock.acquired) {
    try {
      // Set up lock renewal if operation might take long
      const renewalInterval = this.setupLockRenewal(lock);
      
      const result = await operation();
      
      clearInterval(renewalInterval);
      await this.releaseLock(lock.lockKey, lock.lockValue);
      
      return result;
    } catch (error) {
      // Ensure lock is released on error
      await this.releaseLock(lock.lockKey, lock.lockValue);
      throw error;
    }
  }
  
  // Wait before retrying
  await this.sleep(retryDelay * Math.pow(2, retries)); // Exponential backoff
  retries++;
}

throw new Error(`Failed to acquire lock for ${resource} after ${maxRetries} retries`);

}

setupLockRenewal(lock) { const renewalInterval = lock.expiresAt - Date.now() - 5000; // Renew 5 seconds before expiry

return setInterval(async () => {
  const renewed = await this.renewLock(
    lock.lockKey, 
    lock.lockValue, 
    this.defaultTTL
  );
  
  if (!renewed) {
    console.warn(`Failed to renew lock ${lock.lockKey}`);
    clearInterval(renewalInterval);
  }
}, Math.max(1000, renewalInterval));

}

sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } }

// Usage example class InventoryService { constructor(redisClient, database) { this.lock = new RedisDistributedLock(redisClient); this.db = database; }

async updateInventory(productId, quantityChange) { const clientId = inventory-service-${process.pid}-${Date.now()};

return await this.lock.withLock(
  `inventory:${productId}`,
  clientId,
  async () => {
    // Critical section - guaranteed exclusive access
    const current = await this.db.getInventory(productId);
    const newQuantity = current.quantity + quantityChange;
    
    if (newQuantity < 0) {
      throw new Error('Insufficient inventory');
    }
    
    return await this.db.updateInventory(productId, newQuantity);
  },
  { ttl: 10000, maxRetries: 5 }
);

} }


### Redlock Algorithm (Multi-Redis)

```javascript
class RedlockDistributedLock {
  constructor(redisInstances, quorumSize = null) {
    this.redisInstances = redisInstances;
    this.quorumSize = quorumSize || Math.floor(redisInstances.length / 2) + 1;
    this.clockDriftFactor = 0.01; // 1% clock drift tolerance
  }
  
  async acquireLock(resource, clientId, ttl) {
    const startTime = Date.now();
    const lockValue = `${clientId}:${Date.now()}:${Math.random()}`;
    
    // Try to acquire lock on all Redis instances
    const promises = this.redisInstances.map(redis => 
      this.acquireSingleLock(redis, resource, lockValue, ttl)
    );
    
    const results = await Promise.allSettled(promises);
    const successful = results.filter(r => r.status === 'fulfilled' && r.value).length;
    
    const drift = (ttl * this.clockDriftFactor) + 2; // Add 2ms for network delay
    const validityTime = ttl - (Date.now() - startTime) - drift;
    
    if (successful >= this.quorumSize && validityTime > 0) {
      return {
        acquired: true,
        resource,
        lockValue,
        validityTime,
        acquiredInstances: successful
      };
    } else {
      // Failed to acquire majority, release any acquired locks
      await this.releaseLock(resource, lockValue);
      return { acquired: false };
    }
  }
  
  async acquireSingleLock(redis, resource, lockValue, ttl) {
    try {
      const result = await redis.set(
        `lock:${resource}`,
        lockValue,
        'PX', ttl,
        'NX'
      );
      return result === 'OK';
    } catch (error) {
      return false;
    }
  }
  
  async releaseLock(resource, lockValue) {
    const releaseScript = `
      if redis.call('GET', KEYS[1]) == ARGV[1] then
        return redis.call('DEL', KEYS[1])
      else
        return 0
      end
    `;
    
    const promises = this.redisInstances.map(redis => 
      redis.eval(releaseScript, 1, `lock:${resource}`, lockValue).catch(() => 0)
    );
    
    await Promise.allSettled(promises);
  }
  
  async withRedlock(resource, clientId, operation, ttl = 30000) {
    const lock = await this.acquireLock(resource, clientId, ttl);
    
    if (!lock.acquired) {
      throw new Error(`Failed to acquire Redlock for ${resource}`);
    }
    
    try {
      // Check if we still have enough time to perform operation
      if (lock.validityTime < 1000) {
        throw new Error('Not enough time left on lock');
      }
      
      const result = await operation();
      return result;
    } finally {
      await this.releaseLock(resource, lock.lockValue);
    }
  }
}

Database-Based Distributed Locks

class DatabaseDistributedLock {
  constructor(database) {
    this.db = database;
    this.initializeLockTable();
  }
  
  async initializeLockTable() {
    await this.db.query(`
      CREATE TABLE IF NOT EXISTS distributed_locks (
        resource_name VARCHAR(255) PRIMARY KEY,
        owner_id VARCHAR(255) NOT NULL,
        acquired_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        expires_at TIMESTAMP NOT NULL,
        INDEX idx_expires_at (expires_at)
      )
    `);
  }
  
  async acquireLock(resource, ownerId, ttlMs = 30000) {
    const expiresAt = new Date(Date.now() + ttlMs);
    
    try {
      // Try to insert new lock
      await this.db.query(
        `INSERT INTO distributed_locks (resource_name, owner_id, expires_at) 
         VALUES (?, ?, ?)`,
        [resource, ownerId, expiresAt]
      );
      
      return {
        acquired: true,
        resource,
        ownerId,
        expiresAt
      };
    } catch (error) {
      if (error.code === 'ER_DUP_ENTRY') {
        // Lock already exists, check if it's expired
        return await this.tryAcquireExpiredLock(resource, ownerId, expiresAt);
      }
      throw error;
    }
  }
  
  async tryAcquireExpiredLock(resource, ownerId, expiresAt) {
    // Try to update expired lock atomically
    const result = await this.db.query(
      `UPDATE distributed_locks 
       SET owner_id = ?, acquired_at = CURRENT_TIMESTAMP, expires_at = ?
       WHERE resource_name = ? AND expires_at < CURRENT_TIMESTAMP`,
      [ownerId, expiresAt, resource]
    );
    
    if (result.affectedRows > 0) {
      return {
        acquired: true,
        resource,
        ownerId,
        expiresAt
      };
    }
    
    return { acquired: false };
  }
  
  async releaseLock(resource, ownerId) {
    const result = await this.db.query(
      `DELETE FROM distributed_locks 
       WHERE resource_name = ? AND owner_id = ?`,
      [resource, ownerId]
    );
    
    return result.affectedRows > 0;
  }
  
  async renewLock(resource, ownerId, ttlMs) {
    const expiresAt = new Date(Date.now() + ttlMs);
    
    const result = await this.db.query(
      `UPDATE distributed_locks 
       SET expires_at = ? 
       WHERE resource_name = ? AND owner_id = ? AND expires_at > CURRENT_TIMESTAMP`,
      [expiresAt, resource, ownerId]
    );
    
    return result.affectedRows > 0;
  }
  
  // Cleanup expired locks periodically
  startCleanupTask(intervalMs = 60000) {
    setInterval(async () => {
      try {
        const result = await this.db.query(
          `DELETE FROM distributed_locks WHERE expires_at < CURRENT_TIMESTAMP`
        );
        
        if (result.affectedRows > 0) {
          console.log(`Cleaned up ${result.affectedRows} expired locks`);
        }
      } catch (error) {
        console.error('Lock cleanup failed:', error);
      }
    }, intervalMs);
  }
}

Consensus-Based Locks (Raft)

class RaftDistributedLock {
  constructor(raftCluster) {
    this.raft = raftCluster;
    this.locks = new Map(); // Local lock state
  }
  
  async acquireLock(resource, clientId, ttl = 30000) {
    const operation = {
      type: 'ACQUIRE_LOCK',
      resource,
      clientId,
      ttl,
      timestamp: Date.now()
    };
    
    try {
      // Submit operation to Raft cluster
      const result = await this.raft.submitOperation(operation);
      
      if (result.success) {
        // Set up local expiration
        setTimeout(() => {
          this.expireLock(resource, clientId);
        }, ttl);
        
        return {
          acquired: true,
          resource,
          clientId,
          term: result.term,
          index: result.index
        };
      }
      
      return { acquired: false, reason: result.reason };
    } catch (error) {
      return { acquired: false, error: error.message };
    }
  }
  
  async releaseLock(resource, clientId) {
    const operation = {
      type: 'RELEASE_LOCK',
      resource,
      clientId,
      timestamp: Date.now()
    };
    
    return await this.raft.submitOperation(operation);
  }
  
  // Apply committed operations from Raft
  applyOperation(operation) {
    switch (operation.type) {
      case 'ACQUIRE_LOCK':
        return this.applyLockAcquisition(operation);
      case 'RELEASE_LOCK':
        return this.applyLockRelease(operation);
      default:
        throw new Error(`Unknown operation type: ${operation.type}`);
    }
  }
  
  applyLockAcquisition(operation) {
    const { resource, clientId, ttl, timestamp } = operation;
    
    // Check if lock already exists and is not expired
    const existingLock = this.locks.get(resource);
    if (existingLock && existingLock.expiresAt > Date.now()) {
      return {
        success: false,
        reason: `Lock held by ${existingLock.clientId}`
      };
    }
    
    // Acquire lock
    this.locks.set(resource, {
      clientId,
      acquiredAt: timestamp,
      expiresAt: timestamp + ttl
    });
    
    return { success: true };
  }
  
  applyLockRelease(operation) {
    const { resource, clientId } = operation;
    
    const lock = this.locks.get(resource);
    if (!lock) {
      return { success: false, reason: 'Lock not found' };
    }
    
    if (lock.clientId !== clientId) {
      return { success: false, reason: 'Not lock owner' };
    }
    
    this.locks.delete(resource);
    return { success: true };
  }
  
  expireLock(resource, clientId) {
    const lock = this.locks.get(resource);
    if (lock && lock.clientId === clientId) {
      this.locks.delete(resource);
      console.log(`Lock expired for resource: ${resource}`);
    }
  }
}

Advanced Patterns and Considerations

Lock-Free Algorithms Alternative

// Sometimes you can avoid locks entirely using atomic operations
class LockFreeCounter {
  constructor(redis) {
    this.redis = redis;
  }
  
  // Atomic increment using Redis
  async increment(key, delta = 1) {
    return await this.redis.incrby(key, delta);
  }
  
  // Compare-and-swap pattern
  async compareAndSwap(key, expectedValue, newValue) {
    const script = `
      local current = redis.call('GET', KEYS[1])
      if current == ARGV[1] then
        redis.call('SET', KEYS[1], ARGV[2])
        return 1
      else
        return 0
      end
    `;
    
    const result = await this.redis.eval(script, 1, key, expectedValue, newValue);
    return result === 1;
  }
  
  // Optimistic concurrency control
  async updateWithVersion(key, updateFunction) {
    let retries = 0;
    const maxRetries = 10;
    
    while (retries < maxRetries) {
      // Get current value and version
      const current = await this.redis.hgetall(`${key}:versioned`);
      
      if (!current.value) {
        // Initialize if doesn't exist
        const initial = { value: updateFunction(null), version: 1 };
        const created = await this.redis.hsetnx(`${key}:versioned`, initial);
        if (created) return initial;
      } else {
        // Apply update function
        const newValue = updateFunction(current.value);
        const newVersion = parseInt(current.version) + 1;
        
        // Atomic update if version unchanged
        const script = `
          local current_version = redis.call('HGET', KEYS[1], 'version')
          if current_version == ARGV[1] then
            redis.call('HMSET', KEYS[1], 'value', ARGV[2], 'version', ARGV[3])
            return 1
          else
            return 0
          end
        `;
        
        const updated = await this.redis.eval(
          script, 1, `${key}:versioned`, 
          current.version, newValue, newVersion
        );
        
        if (updated === 1) {
          return { value: newValue, version: newVersion };
        }
      }
      
      retries++;
      await this.sleep(Math.random() * 100); // Random jitter
    }
    
    throw new Error('Failed to update after maximum retries');
  }
  
  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Performance Monitoring and Debugging

class MonitoredDistributedLock {
  constructor(lockImplementation) {
    this.lock = lockImplementation;
    this.metrics = {
      acquisitions: 0,
      acquisitionFailures: 0,
      totalAcquisitionTime: 0,
      holdTimes: [],
      contentionEvents: 0
    };
  }
  
  async acquireLock(resource, clientId, ttl) {
    const startTime = Date.now();
    
    try {
      const result = await this.lock.acquireLock(resource, clientId, ttl);
      
      const acquisitionTime = Date.now() - startTime;
      this.metrics.totalAcquisitionTime += acquisitionTime;
      
      if (result.acquired) {
        this.metrics.acquisitions++;
        result._acquiredAt = Date.now();
      } else {
        this.metrics.acquisitionFailures++;
        this.metrics.contentionEvents++;
      }
      
      return result;
    } catch (error) {
      this.metrics.acquisitionFailures++;
      throw error;
    }
  }
  
  async releaseLock(lockInfo) {
    const result = await this.lock.releaseLock(lockInfo.resource, lockInfo.clientId);
    
    if (lockInfo._acquiredAt) {
      const holdTime = Date.now() - lockInfo._acquiredAt;
      this.metrics.holdTimes.push(holdTime);
      
      // Keep only recent hold times for memory efficiency
      if (this.metrics.holdTimes.length > 1000) {
        this.metrics.holdTimes = this.metrics.holdTimes.slice(-1000);
      }
    }
    
    return result;
  }
  
  getMetrics() {
    const avgAcquisitionTime = this.metrics.acquisitions > 0 ? 
      this.metrics.totalAcquisitionTime / this.metrics.acquisitions : 0;
    
    const avgHoldTime = this.metrics.holdTimes.length > 0 ?
      this.metrics.holdTimes.reduce((a, b) => a + b, 0) / this.metrics.holdTimes.length : 0;
    
    return {
      totalAcquisitions: this.metrics.acquisitions,
      acquisitionFailures: this.metrics.acquisitionFailures,
      successRate: this.metrics.acquisitions / 
        (this.metrics.acquisitions + this.metrics.acquisitionFailures),
      avgAcquisitionTime,
      avgHoldTime,
      contentionEvents: this.metrics.contentionEvents
    };
  }
}

Deep Dive Analysis

Comparison of Lock Implementations

Implementation	Pros	Cons	Best Use Case
Redis Single	Simple, fast, low latency	Single point of failure	High-performance, low-stakes
Redlock	No SPOF, Byzantine fault tolerant	Complex, higher latency	Critical systems needing availability
Database	ACID guarantees, familiar	Slower, potential bottleneck	Systems already using DB heavily
Consensus (Raft)	Strong consistency, fault tolerant	Complex implementation	Mission-critical coordination
Lock-Free	No coordination overhead	Limited use cases	High-performance counters/flags

Common Pitfalls and Solutions

1. Lock Holder Dies Without Releasing

// Problem: Process crashes while holding lock
// Solution: Always use TTL/expiration
const lock = await lockService.acquireLock('resource', 'client-id', 30000); // 30s TTL

2. Clock Skew Issues

// Problem: Different clocks on different machines
// Solution: Use relative timeouts and clock drift compensation
const CLOCK_DRIFT_FACTOR = 0.01; // 1% tolerance
const validityTime = ttl - (Date.now() - startTime) - (ttl * CLOCK_DRIFT_FACTOR);

3. Split-Brain Scenarios

// Problem: Network partition causes multiple "masters"
// Solution: Require majority consensus (quorum)
const quorumSize = Math.floor(nodes.length / 2) + 1;
if (successfulLocks >= quorumSize) {
  // Safe to proceed
}

Interview-Focused Content

Junior Level (2-4 YOE)

Q: What is a distributed lock and why do we need it? A: A distributed lock coordinates access to shared resources across multiple processes or machines. We need it to prevent race conditions when multiple services try to modify the same data simultaneously, like preventing double-spending in payment systems or ensuring only one process handles a background job.

Q: What happens if a process holding a distributed lock crashes? A: The lock could remain held forever, blocking other processes. This is why distributed locks use TTL (time-to-live) - the lock automatically expires after a set time. The holder can also renew the lock periodically while it's still working.

Q: What's the difference between a distributed lock and a regular lock? A: Regular locks work within a single process/machine using shared memory. Distributed locks work across network boundaries and must handle network failures, different clocks, and node crashes. They're much more complex but necessary for coordinating between different services.

Senior Level (5-8 YOE)

Q: How would you implement a distributed lock using Redis? A: Use Redis SET command with NX (only if not exists) and PX (expiration time): SET lock:resource value PX 30000 NX. For release, use a Lua script to atomically check ownership and delete. Include lock renewal for long operations and proper error handling for network failures.

Q: What are the trade-offs between single Redis vs Redlock algorithm? A: Single Redis is simpler and faster but has a single point of failure. Redlock uses multiple Redis instances and requires majority consensus, providing better availability but with higher complexity and latency. Choose based on your consistency vs availability requirements.

Q: How do you handle the "herd effect" when many processes are waiting for the same lock? A: Strategies include:

Exponential backoff: Increase wait time between retries
Jittering: Add randomness to retry intervals
Queue-based locks: Implement fair ordering
Lock-free alternatives: Use atomic operations when possible

Staff+ Level (8+ YOE)

Q: Design a distributed locking system for a global, multi-region application. A: Considerations:

Regional locks: Use regional lock managers to reduce latency
Global coordination: Implement hierarchical locking for cross-region resources
Consensus protocol: Use Raft or similar for lock manager coordination
Fallback mechanisms: Handle cross-region network partitions gracefully
Monitoring: Track lock contention and performance across regions

Q: How would you migrate from a centralized locking system to a distributed one? A: Migration strategy:

Dual-write phase: Acquire locks in both old and new systems
Validation: Verify both systems grant/deny consistently
Gradual rollout: Start with non-critical resources
Rollback plan: Ability to revert if issues arise
Monitoring: Track lock performance and consistency during migration

Q: What are the implications of distributed locks for system observability? A: Observability needs:

Lock contention metrics: Track wait times and failure rates
Distributed tracing: Follow lock acquisition across services
Deadlock detection: Monitor for circular dependencies
Performance impact: Measure coordination overhead
Business metrics: Connect lock behavior to business outcomes

Distributed Locks

Distributed Locks

Overview

System Architecture Diagram

Core Principles: Requirements and Properties

Essential Properties of Distributed Locks

The Challenge of Distributed Coordination

Implementation Patterns

Redis-Based Distributed Locks

Database-Based Distributed Locks

Consensus-Based Locks (Raft)

Advanced Patterns and Considerations

Lock-Free Algorithms Alternative

Performance Monitoring and Debugging

Deep Dive Analysis

Comparison of Lock Implementations

Common Pitfalls and Solutions

Interview-Focused Content

Junior Level (2-4 YOE)

Senior Level (5-8 YOE)

Staff+ Level (8+ YOE)

Further Reading

Contents

Related Concepts