Design Google Docs

System Design Challenge

hard
45-60 minutes
real-time-collaborationoperational-transformationwebsocketsconflict-resolutiondocument-storage

Design Google Docs

What is Google Docs?

Google Docs is a real-time collaborative document editing platform that allows multiple users to edit documents simultaneously. It's similar to Microsoft Word Online, Notion, or Confluence. The service provides real-time collaboration, conflict resolution, and document management.

Real-time collaborative editing with conflict resolution is what makes systems like Google Docs unique. By understanding Google Docs, you can tackle interview questions for similar collaborative platforms, since the core design challenges—operational transformation, conflict resolution, real-time sync, and consistency—remain the same.


Functional Requirements

Core (Interview Focussed)

  • Real-time Collaboration: Multiple users can edit documents simultaneously.
  • Conflict Resolution: Handle conflicts when users edit the same text.
  • Document Management: Create, save, and manage documents.
  • User Presence: Show which users are currently editing.

Out of Scope

  • User authentication and authorization
  • Document sharing and permissions
  • Comment and suggestion system
  • Document templates and formatting
  • Mobile app specific features

Non-Functional Requirements

Core (Interview Focussed)

  • Low latency: Sub-second response time for edits.
  • Consistency: Ensure all users see the same document state.
  • Scalability: Handle thousands of concurrent users per document.
  • Reliability: Maintain document integrity during network issues.

Out of Scope

  • Data retention policies
  • Compliance and privacy regulations

💡 Interview Tip: Focus on low latency, consistency, and scalability. Interviewers care most about operational transformation, conflict resolution, and real-time synchronization.


Core Entities

EntityKey AttributesNotes
Documentdocument_id, title, content, created_at, modified_atIndexed by user_id for fast queries
Useruser_id, username, emailUser account information
Operationoperation_id, document_id, user_id, operation_type, content, timestampTrack document operations
Cursorcursor_id, document_id, user_id, position, timestampTrack user cursor positions
Sessionsession_id, document_id, user_id, connected_atTrack user sessions

💡 Interview Tip: Focus on Document, Operation, and Cursor as they drive real-time collaboration and conflict resolution.


Core APIs

Document Management

  • POST /documents { title, content } – Create a new document
  • GET /documents/{document_id} – Get document content
  • PUT /documents/{document_id} { content } – Update document content
  • DELETE /documents/{document_id} – Delete a document

Real-time Collaboration

  • POST /documents/{document_id}/operations { operation_type, content, position } – Apply operation
  • GET /documents/{document_id}/operations?since= – Get operations since timestamp
  • POST /documents/{document_id}/cursor { position } – Update cursor position
  • GET /documents/{document_id}/cursors – Get all user cursors

User Presence

  • POST /documents/{document_id}/join – Join document session
  • POST /documents/{document_id}/leave – Leave document session
  • GET /documents/{document_id}/users – Get active users

High-Level Design

System Architecture Diagram

Key Components

  • Document Service: Manage document CRUD operations
  • Operation Service: Handle document operations and transformations
  • Real-time Service: Manage WebSocket connections and real-time updates
  • Conflict Resolution Service: Resolve conflicts using operational transformation
  • Presence Service: Track user presence and cursors
  • Database: Persistent storage for documents and operations

Mapping Core Functional Requirements to Components

Functional RequirementResponsible ComponentsKey Considerations
Real-time CollaborationReal-time Service, Operation ServiceWebSocket connections, operation broadcasting
Conflict ResolutionConflict Resolution ServiceOperational transformation, conflict detection
Document ManagementDocument Service, DatabaseCRUD operations, data persistence
User PresencePresence Service, Real-time ServiceCursor tracking, user status

Detailed Design

Operation Service

Purpose: Handle document operations and apply operational transformation.

Key Design Decisions:

  • Operation Types: Insert, delete, and format operations
  • Operational Transformation: Transform operations to resolve conflicts
  • Operation Ordering: Ensure operations are applied in correct order
  • Operation Persistence: Store operations for recovery and replay

Algorithm: Operational transformation

1. Receive operation from user
2. Assign operation sequence number
3. Transform operation against concurrent operations:
   - For insert: adjust position based on previous operations
   - For delete: adjust range based on previous operations
4. Apply transformed operation to document
5. Broadcast operation to all connected users
6. Store operation in database
7. Update document content

Real-time Service

Purpose: Manage WebSocket connections and broadcast real-time updates.

Key Design Decisions:

  • WebSocket Connections: Maintain persistent connections for real-time updates
  • Connection Management: Handle connection drops and reconnections
  • Message Broadcasting: Broadcast operations to all connected users
  • Connection Scaling: Scale WebSocket connections horizontally

Algorithm: Real-time operation broadcasting

1. User connects to document via WebSocket
2. Send current document state to user
3. Send recent operations since user's last sync
4. When operation received:
   - Apply operational transformation
   - Broadcast to all connected users
   - Store operation in database
5. Handle connection drops gracefully
6. Reconnect users with missed operations

Conflict Resolution Service

Purpose: Resolve conflicts using operational transformation algorithms.

Key Design Decisions:

  • Transformation Rules: Define how operations transform against each other
  • Conflict Detection: Detect when operations conflict
  • Resolution Strategy: Choose appropriate resolution strategy
  • Consistency Guarantees: Ensure all users see consistent document state

Algorithm: Conflict resolution

1. Detect conflicting operations
2. Apply operational transformation:
   - Transform operation A against operation B
   - Transform operation B against operation A
3. Apply transformed operations to document
4. Ensure operations are commutative and associative
5. Broadcast resolved operations to all users
6. Maintain document consistency

Database Design

Documents Table

FieldTypeDescription
document_idVARCHAR(36)Primary key
titleVARCHAR(255)Document title
contentTEXTDocument content
created_atTIMESTAMPCreation timestamp
modified_atTIMESTAMPLast modification

Indexes:

  • idx_created_at on (created_at) - Time-based queries
  • idx_modified_at on (modified_at) - Recent documents

Operations Table

FieldTypeDescription
operation_idVARCHAR(36)Primary key
document_idVARCHAR(36)Associated document
user_idVARCHAR(36)Operation author
operation_typeVARCHAR(50)Type of operation
contentTEXTOperation content
positionINTOperation position
timestampTIMESTAMPOperation timestamp

Indexes:

  • idx_document_timestamp on (document_id, timestamp) - Document operations
  • idx_user_timestamp on (user_id, timestamp) - User operations

Cursors Table

FieldTypeDescription
cursor_idVARCHAR(36)Primary key
document_idVARCHAR(36)Associated document
user_idVARCHAR(36)Cursor owner
positionINTCursor position
timestampTIMESTAMPCursor timestamp

Indexes:

  • idx_document_user on (document_id, user_id) - User cursors
  • idx_timestamp on (timestamp) - Recent cursors

Scalability Considerations

Horizontal Scaling

  • Real-time Service: Scale WebSocket connections with load balancers
  • Operation Service: Use consistent hashing for document partitioning
  • Database: Shard operations by document_id
  • Presence Service: Use distributed cache for user presence

Caching Strategy

  • Redis: Cache document content and recent operations
  • Application Cache: Cache user sessions and cursors
  • CDN: Cache static document assets

Performance Optimization

  • Connection Pooling: Efficient database connections
  • Batch Processing: Batch operations for efficiency
  • Async Processing: Non-blocking operation processing
  • Resource Monitoring: Monitor CPU, memory, and network usage

Monitoring and Observability

Key Metrics

  • Operation Latency: Average operation processing time
  • Concurrent Users: Number of users per document
  • Conflict Rate: Percentage of operations with conflicts
  • System Health: CPU, memory, and network usage

Alerting

  • High Latency: Alert when operation time exceeds threshold
  • Connection Drops: Alert when WebSocket connections drop frequently
  • Conflict Spike: Alert when conflict rate increases
  • System Errors: Alert on operation failures

Trade-offs and Considerations

Consistency vs. Availability

  • Choice: Strong consistency for document state
  • Reasoning: Document consistency is critical for collaborative editing

Latency vs. Throughput

  • Choice: Optimize for latency with real-time processing
  • Reasoning: Real-time collaboration requires immediate operation application

Storage vs. Performance

  • Choice: Store operations for recovery and replay
  • Reasoning: Balance between storage costs and system reliability

Common Interview Questions

Q: How would you handle network partitions?

A: Use operational transformation to resolve conflicts when network partitions heal, ensuring document consistency.

Q: How do you ensure operation ordering?

A: Use sequence numbers, timestamps, and operational transformation to ensure operations are applied in correct order.

Q: How would you scale this system globally?

A: Deploy regional WebSocket servers, use geo-distributed databases, and implement data replication strategies.

Q: How do you handle large documents?

A: Use document chunking, incremental operations, and efficient storage to handle large documents.


Key Takeaways

  1. Operational Transformation: Essential for resolving conflicts in real-time collaborative editing
  2. Real-time Communication: WebSocket connections enable immediate operation broadcasting
  3. Conflict Resolution: Multiple strategies provide flexibility for different conflict scenarios
  4. Scalability: Horizontal scaling and partitioning are crucial for handling concurrent users
  5. Monitoring: Comprehensive monitoring ensures system reliability and performance