Distributed Locks
Core Concept
Coordinating access to shared resources across distributed systems
Distributed Locks
Overview
Distributed locks are synchronization mechanisms that coordinate access to shared resources across multiple nodes in a distributed system. They ensure mutual exclusion, preventing race conditions and maintaining data consistency when multiple processes or services need to access the same resource concurrently.
Unlike traditional locks that work within a single process or machine, distributed locks operate across network boundaries and must handle failures, network partitions, and timing issues. They are essential for building reliable distributed systems where coordination is critical for correctness.
The main technical challenges this addresses include:
- Mutual exclusion: Ensuring only one process accesses a resource at a time
- Deadlock prevention: Avoiding situations where processes wait indefinitely
- Fault tolerance: Handling node failures and network partitions gracefully
- Performance: Minimizing coordination overhead and latency
Core Principles: Requirements and Properties
Essential Properties of Distributed Locks
Safety (Mutual Exclusion): At most one client can hold a lock at any given time. This is the fundamental guarantee that prevents race conditions.
Liveness (Deadlock-free): The system must eventually grant locks to waiting clients, and locks must be releasable even if the holder fails.
Fault Tolerance: The locking mechanism must continue to function despite node failures, network partitions, and other distributed system challenges.
The Challenge of Distributed Coordination
System Architecture Diagram
Implementation Patterns
Redis-Based Distributed Locks
class RedisDistributedLock {
constructor(redisClient, defaultTTL = 30000) {
this.redis = redisClient;
this.defaultTTL = defaultTTL;
}
async acquireLock(resource, clientId, ttl = this.defaultTTL) {
const lockKey = `lock:${resource}`;
const expiry = Date.now() + ttl;
const lockValue = `${clientId}:${expiry}`;
// Use SET with NX (only if not exists) and PX (expire time)
const result = await this.redis.set(
lockKey,
lockValue,
'PX', ttl, // Expire in milliseconds
'NX' // Only set if key doesn't exist
);
if (result === 'OK') {
return {
acquired: true,
lockKey,
lockValue,
expiresAt: expiry
};
}
return { acquired: false };
}
async renewLock(lockKey, lockValue, newTTL) {
// Lua script for atomic renewal
const renewScript = `
if redis.call('GET', KEYS[1]) == ARGV[1] then
return redis.call('PEXPIRE', KEYS[1], ARGV[2])
else
return 0
end
`;
const result = await this.redis.eval(
renewScript,
1,
lockKey,
lockValue,
newTTL
);
return result === 1;
}
async releaseLock(lockKey, lockValue) {
// Lua script for atomic release (only release if we own the lock)
const releaseScript = `
if redis.call('GET', KEYS[1]) == ARGV[1] then
return redis.call('DEL', KEYS[1])
else
return 0
end
`;
const result = await this.redis.eval(
releaseScript,
1,
lockKey,
lockValue
);
return result === 1;
}
async withLock(resource, clientId, operation, options = {}) {
const { ttl = this.defaultTTL, retryDelay = 100, maxRetries = 10 } = options;
let retries = 0;
while (retries < maxRetries) {
const lock = await this.acquireLock(resource, clientId, ttl);
if (lock.acquired) {
try {
// Set up lock renewal if operation might take long
const renewalInterval = this.setupLockRenewal(lock);
const result = await operation();
clearInterval(renewalInterval);
await this.releaseLock(lock.lockKey, lock.lockValue);
return result;
} catch (error) {
// Ensure lock is released on error
await this.releaseLock(lock.lockKey, lock.lockValue);
throw error;
}
}
// Wait before retrying
await this.sleep(retryDelay * Math.pow(2, retries)); // Exponential backoff
retries++;
}
throw new Error(`Failed to acquire lock for ${resource} after ${maxRetries} retries`);
}
setupLockRenewal(lock) {
const renewalInterval = lock.expiresAt - Date.now() - 5000; // Renew 5 seconds before expiry
return setInterval(async () => {
const renewed = await this.renewLock(
lock.lockKey,
lock.lockValue,
this.defaultTTL
);
if (!renewed) {
console.warn(`Failed to renew lock ${lock.lockKey}`);
clearInterval(renewalInterval);
}
}, Math.max(1000, renewalInterval));
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage example
class InventoryService {
constructor(redisClient, database) {
this.lock = new RedisDistributedLock(redisClient);
this.db = database;
}
async updateInventory(productId, quantityChange) {
const clientId = `inventory-service-${process.pid}-${Date.now()}`;
return await this.lock.withLock(
`inventory:${productId}`,
clientId,
async () => {
// Critical section - guaranteed exclusive access
const current = await this.db.getInventory(productId);
const newQuantity = current.quantity + quantityChange;
if (newQuantity < 0) {
throw new Error('Insufficient inventory');
}
return await this.db.updateInventory(productId, newQuantity);
},
{ ttl: 10000, maxRetries: 5 }
);
}
}
Redlock Algorithm (Multi-Redis)
class RedlockDistributedLock {
constructor(redisInstances, quorumSize = null) {
this.redisInstances = redisInstances;
this.quorumSize = quorumSize || Math.floor(redisInstances.length / 2) + 1;
this.clockDriftFactor = 0.01; // 1% clock drift tolerance
}
async acquireLock(resource, clientId, ttl) {
const startTime = Date.now();
const lockValue = `${clientId}:${Date.now()}:${Math.random()}`;
// Try to acquire lock on all Redis instances
const promises = this.redisInstances.map(redis =>
this.acquireSingleLock(redis, resource, lockValue, ttl)
);
const results = await Promise.allSettled(promises);
const successful = results.filter(r => r.status === 'fulfilled' && r.value).length;
const drift = (ttl * this.clockDriftFactor) + 2; // Add 2ms for network delay
const validityTime = ttl - (Date.now() - startTime) - drift;
if (successful >= this.quorumSize && validityTime > 0) {
return {
acquired: true,
resource,
lockValue,
validityTime,
acquiredInstances: successful
};
} else {
// Failed to acquire majority, release any acquired locks
await this.releaseLock(resource, lockValue);
return { acquired: false };
}
}
async acquireSingleLock(redis, resource, lockValue, ttl) {
try {
const result = await redis.set(
`lock:${resource}`,
lockValue,
'PX', ttl,
'NX'
);
return result === 'OK';
} catch (error) {
return false;
}
}
async releaseLock(resource, lockValue) {
const releaseScript = `
if redis.call('GET', KEYS[1]) == ARGV[1] then
return redis.call('DEL', KEYS[1])
else
return 0
end
`;
const promises = this.redisInstances.map(redis =>
redis.eval(releaseScript, 1, `lock:${resource}`, lockValue).catch(() => 0)
);
await Promise.allSettled(promises);
}
async withRedlock(resource, clientId, operation, ttl = 30000) {
const lock = await this.acquireLock(resource, clientId, ttl);
if (!lock.acquired) {
throw new Error(`Failed to acquire Redlock for ${resource}`);
}
try {
// Check if we still have enough time to perform operation
if (lock.validityTime < 1000) {
throw new Error('Not enough time left on lock');
}
const result = await operation();
return result;
} finally {
await this.releaseLock(resource, lock.lockValue);
}
}
}
Database-Based Distributed Locks
class DatabaseDistributedLock {
constructor(database) {
this.db = database;
this.initializeLockTable();
}
async initializeLockTable() {
await this.db.query(`
CREATE TABLE IF NOT EXISTS distributed_locks (
resource_name VARCHAR(255) PRIMARY KEY,
owner_id VARCHAR(255) NOT NULL,
acquired_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP NOT NULL,
INDEX idx_expires_at (expires_at)
)
`);
}
async acquireLock(resource, ownerId, ttlMs = 30000) {
const expiresAt = new Date(Date.now() + ttlMs);
try {
// Try to insert new lock
await this.db.query(
`INSERT INTO distributed_locks (resource_name, owner_id, expires_at)
VALUES (?, ?, ?)`,
[resource, ownerId, expiresAt]
);
return {
acquired: true,
resource,
ownerId,
expiresAt
};
} catch (error) {
if (error.code === 'ER_DUP_ENTRY') {
// Lock already exists, check if it's expired
return await this.tryAcquireExpiredLock(resource, ownerId, expiresAt);
}
throw error;
}
}
async tryAcquireExpiredLock(resource, ownerId, expiresAt) {
// Try to update expired lock atomically
const result = await this.db.query(
`UPDATE distributed_locks
SET owner_id = ?, acquired_at = CURRENT_TIMESTAMP, expires_at = ?
WHERE resource_name = ? AND expires_at < CURRENT_TIMESTAMP`,
[ownerId, expiresAt, resource]
);
if (result.affectedRows > 0) {
return {
acquired: true,
resource,
ownerId,
expiresAt
};
}
return { acquired: false };
}
async releaseLock(resource, ownerId) {
const result = await this.db.query(
`DELETE FROM distributed_locks
WHERE resource_name = ? AND owner_id = ?`,
[resource, ownerId]
);
return result.affectedRows > 0;
}
async renewLock(resource, ownerId, ttlMs) {
const expiresAt = new Date(Date.now() + ttlMs);
const result = await this.db.query(
`UPDATE distributed_locks
SET expires_at = ?
WHERE resource_name = ? AND owner_id = ? AND expires_at > CURRENT_TIMESTAMP`,
[expiresAt, resource, ownerId]
);
return result.affectedRows > 0;
}
// Cleanup expired locks periodically
startCleanupTask(intervalMs = 60000) {
setInterval(async () => {
try {
const result = await this.db.query(
`DELETE FROM distributed_locks WHERE expires_at < CURRENT_TIMESTAMP`
);
if (result.affectedRows > 0) {
console.log(`Cleaned up ${result.affectedRows} expired locks`);
}
} catch (error) {
console.error('Lock cleanup failed:', error);
}
}, intervalMs);
}
}
Consensus-Based Locks (Raft)
class RaftDistributedLock {
constructor(raftCluster) {
this.raft = raftCluster;
this.locks = new Map(); // Local lock state
}
async acquireLock(resource, clientId, ttl = 30000) {
const operation = {
type: 'ACQUIRE_LOCK',
resource,
clientId,
ttl,
timestamp: Date.now()
};
try {
// Submit operation to Raft cluster
const result = await this.raft.submitOperation(operation);
if (result.success) {
// Set up local expiration
setTimeout(() => {
this.expireLock(resource, clientId);
}, ttl);
return {
acquired: true,
resource,
clientId,
term: result.term,
index: result.index
};
}
return { acquired: false, reason: result.reason };
} catch (error) {
return { acquired: false, error: error.message };
}
}
async releaseLock(resource, clientId) {
const operation = {
type: 'RELEASE_LOCK',
resource,
clientId,
timestamp: Date.now()
};
return await this.raft.submitOperation(operation);
}
// Apply committed operations from Raft
applyOperation(operation) {
switch (operation.type) {
case 'ACQUIRE_LOCK':
return this.applyLockAcquisition(operation);
case 'RELEASE_LOCK':
return this.applyLockRelease(operation);
default:
throw new Error(`Unknown operation type: ${operation.type}`);
}
}
applyLockAcquisition(operation) {
const { resource, clientId, ttl, timestamp } = operation;
// Check if lock already exists and is not expired
const existingLock = this.locks.get(resource);
if (existingLock && existingLock.expiresAt > Date.now()) {
return {
success: false,
reason: `Lock held by ${existingLock.clientId}`
};
}
// Acquire lock
this.locks.set(resource, {
clientId,
acquiredAt: timestamp,
expiresAt: timestamp + ttl
});
return { success: true };
}
applyLockRelease(operation) {
const { resource, clientId } = operation;
const lock = this.locks.get(resource);
if (!lock) {
return { success: false, reason: 'Lock not found' };
}
if (lock.clientId !== clientId) {
return { success: false, reason: 'Not lock owner' };
}
this.locks.delete(resource);
return { success: true };
}
expireLock(resource, clientId) {
const lock = this.locks.get(resource);
if (lock && lock.clientId === clientId) {
this.locks.delete(resource);
console.log(`Lock expired for resource: ${resource}`);
}
}
}
Advanced Patterns and Considerations
Lock-Free Algorithms Alternative
// Sometimes you can avoid locks entirely using atomic operations
class LockFreeCounter {
constructor(redis) {
this.redis = redis;
}
// Atomic increment using Redis
async increment(key, delta = 1) {
return await this.redis.incrby(key, delta);
}
// Compare-and-swap pattern
async compareAndSwap(key, expectedValue, newValue) {
const script = `
local current = redis.call('GET', KEYS[1])
if current == ARGV[1] then
redis.call('SET', KEYS[1], ARGV[2])
return 1
else
return 0
end
`;
const result = await this.redis.eval(script, 1, key, expectedValue, newValue);
return result === 1;
}
// Optimistic concurrency control
async updateWithVersion(key, updateFunction) {
let retries = 0;
const maxRetries = 10;
while (retries < maxRetries) {
// Get current value and version
const current = await this.redis.hgetall(`${key}:versioned`);
if (!current.value) {
// Initialize if doesn't exist
const initial = { value: updateFunction(null), version: 1 };
const created = await this.redis.hsetnx(`${key}:versioned`, initial);
if (created) return initial;
} else {
// Apply update function
const newValue = updateFunction(current.value);
const newVersion = parseInt(current.version) + 1;
// Atomic update if version unchanged
const script = `
local current_version = redis.call('HGET', KEYS[1], 'version')
if current_version == ARGV[1] then
redis.call('HMSET', KEYS[1], 'value', ARGV[2], 'version', ARGV[3])
return 1
else
return 0
end
`;
const updated = await this.redis.eval(
script, 1, `${key}:versioned`,
current.version, newValue, newVersion
);
if (updated === 1) {
return { value: newValue, version: newVersion };
}
}
retries++;
await this.sleep(Math.random() * 100); // Random jitter
}
throw new Error('Failed to update after maximum retries');
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
Performance Monitoring and Debugging
class MonitoredDistributedLock {
constructor(lockImplementation) {
this.lock = lockImplementation;
this.metrics = {
acquisitions: 0,
acquisitionFailures: 0,
totalAcquisitionTime: 0,
holdTimes: [],
contentionEvents: 0
};
}
async acquireLock(resource, clientId, ttl) {
const startTime = Date.now();
try {
const result = await this.lock.acquireLock(resource, clientId, ttl);
const acquisitionTime = Date.now() - startTime;
this.metrics.totalAcquisitionTime += acquisitionTime;
if (result.acquired) {
this.metrics.acquisitions++;
result._acquiredAt = Date.now();
} else {
this.metrics.acquisitionFailures++;
this.metrics.contentionEvents++;
}
return result;
} catch (error) {
this.metrics.acquisitionFailures++;
throw error;
}
}
async releaseLock(lockInfo) {
const result = await this.lock.releaseLock(lockInfo.resource, lockInfo.clientId);
if (lockInfo._acquiredAt) {
const holdTime = Date.now() - lockInfo._acquiredAt;
this.metrics.holdTimes.push(holdTime);
// Keep only recent hold times for memory efficiency
if (this.metrics.holdTimes.length > 1000) {
this.metrics.holdTimes = this.metrics.holdTimes.slice(-1000);
}
}
return result;
}
getMetrics() {
const avgAcquisitionTime = this.metrics.acquisitions > 0 ?
this.metrics.totalAcquisitionTime / this.metrics.acquisitions : 0;
const avgHoldTime = this.metrics.holdTimes.length > 0 ?
this.metrics.holdTimes.reduce((a, b) => a + b, 0) / this.metrics.holdTimes.length : 0;
return {
totalAcquisitions: this.metrics.acquisitions,
acquisitionFailures: this.metrics.acquisitionFailures,
successRate: this.metrics.acquisitions /
(this.metrics.acquisitions + this.metrics.acquisitionFailures),
avgAcquisitionTime,
avgHoldTime,
contentionEvents: this.metrics.contentionEvents
};
}
}
Deep Dive Analysis
Comparison of Lock Implementations
Implementation | Pros | Cons | Best Use Case |
---|---|---|---|
Redis Single | Simple, fast, low latency | Single point of failure | High-performance, low-stakes |
Redlock | No SPOF, Byzantine fault tolerant | Complex, higher latency | Critical systems needing availability |
Database | ACID guarantees, familiar | Slower, potential bottleneck | Systems already using DB heavily |
Consensus (Raft) | Strong consistency, fault tolerant | Complex implementation | Mission-critical coordination |
Lock-Free | No coordination overhead | Limited use cases | High-performance counters/flags |
Common Pitfalls and Solutions
1. Lock Holder Dies Without Releasing
// Problem: Process crashes while holding lock
// Solution: Always use TTL/expiration
const lock = await lockService.acquireLock('resource', 'client-id', 30000); // 30s TTL
2. Clock Skew Issues
// Problem: Different clocks on different machines
// Solution: Use relative timeouts and clock drift compensation
const CLOCK_DRIFT_FACTOR = 0.01; // 1% tolerance
const validityTime = ttl - (Date.now() - startTime) - (ttl * CLOCK_DRIFT_FACTOR);
3. Split-Brain Scenarios
// Problem: Network partition causes multiple "masters"
// Solution: Require majority consensus (quorum)
const quorumSize = Math.floor(nodes.length / 2) + 1;
if (successfulLocks >= quorumSize) {
// Safe to proceed
}
Interview-Focused Content
Junior Level (2-4 YOE)
Q: What is a distributed lock and why do we need it? A: A distributed lock coordinates access to shared resources across multiple processes or machines. We need it to prevent race conditions when multiple services try to modify the same data simultaneously, like preventing double-spending in payment systems or ensuring only one process handles a background job.
Q: What happens if a process holding a distributed lock crashes? A: The lock could remain held forever, blocking other processes. This is why distributed locks use TTL (time-to-live) - the lock automatically expires after a set time. The holder can also renew the lock periodically while it's still working.
Q: What's the difference between a distributed lock and a regular lock? A: Regular locks work within a single process/machine using shared memory. Distributed locks work across network boundaries and must handle network failures, different clocks, and node crashes. They're much more complex but necessary for coordinating between different services.
Senior Level (5-8 YOE)
Q: How would you implement a distributed lock using Redis?
A: Use Redis SET command with NX (only if not exists) and PX (expiration time): SET lock:resource value PX 30000 NX
. For release, use a Lua script to atomically check ownership and delete. Include lock renewal for long operations and proper error handling for network failures.
Q: What are the trade-offs between single Redis vs Redlock algorithm? A: Single Redis is simpler and faster but has a single point of failure. Redlock uses multiple Redis instances and requires majority consensus, providing better availability but with higher complexity and latency. Choose based on your consistency vs availability requirements.
Q: How do you handle the "herd effect" when many processes are waiting for the same lock? A: Strategies include:
- Exponential backoff: Increase wait time between retries
- Jittering: Add randomness to retry intervals
- Queue-based locks: Implement fair ordering
- Lock-free alternatives: Use atomic operations when possible
Staff+ Level (8+ YOE)
Q: Design a distributed locking system for a global, multi-region application. A: Considerations:
- Regional locks: Use regional lock managers to reduce latency
- Global coordination: Implement hierarchical locking for cross-region resources
- Consensus protocol: Use Raft or similar for lock manager coordination
- Fallback mechanisms: Handle cross-region network partitions gracefully
- Monitoring: Track lock contention and performance across regions
Q: How would you migrate from a centralized locking system to a distributed one? A: Migration strategy:
- Dual-write phase: Acquire locks in both old and new systems
- Validation: Verify both systems grant/deny consistently
- Gradual rollout: Start with non-critical resources
- Rollback plan: Ability to revert if issues arise
- Monitoring: Track lock performance and consistency during migration
Q: What are the implications of distributed locks for system observability? A: Observability needs:
- Lock contention metrics: Track wait times and failure rates
- Distributed tracing: Follow lock acquisition across services
- Deadlock detection: Monitor for circular dependencies
- Performance impact: Measure coordination overhead
- Business metrics: Connect lock behavior to business outcomes