Apache Kafka

Overview

Apache Kafka is a distributed streaming platform designed to handle high-throughput, fault-tolerant, real-time data feeds. It addresses the critical challenge of building scalable, resilient data pipelines and event-driven architectures in distributed systems at enterprise scale.

Originally developed by LinkedIn in 2010 and open-sourced in 2011, Kafka has become the standard for stream processing and event streaming at companies like Netflix, Uber, Airbnb, and Spotify. It processes trillions of events daily and is designed for horizontal scalability, durability, and low-latency streaming.

Key capabilities include:

Distributed commit log: Immutable, ordered sequence of records with configurable retention
High throughput: Millions of messages per second with sub-millisecond latency
Horizontal scalability: Linear scaling across multiple brokers and partitions
Fault tolerance: Replication and automatic failover with configurable durability guarantees
Stream processing: Native integration with Kafka Streams and external processors

Architecture & Core Components

System Architecture

System Architecture Diagram

Core Components

1. Kafka Broker

Message storage: Manages topic partitions and log segments
Replication: Handles leader-follower replication for fault tolerance
Client coordination: Manages producer and consumer connections
Metadata management: Coordinates with ZooKeeper for cluster state

2. Topics and Partitions

Topic: Logical category for organizing messages
Partition: Physical subdivision enabling parallelism and scalability
Log segments: Files containing actual message data with indexes
Replication factor: Number of copies across different brokers

3. Producers

Message publishing: Sends records to specific topics and partitions
Partitioning: Determines target partition using key or round-robin
Batching: Groups messages for efficient network utilization
Acknowledgment: Configurable durability guarantees (acks=0,1,all)

4. Consumers

Message consumption: Reads records from assigned partitions
Consumer groups: Load balancing and fault tolerance mechanism
Offset management: Tracks consumption progress per partition
Rebalancing: Automatic partition reassignment on group changes

5. ZooKeeper (Legacy/KRaft)

Cluster coordination: Manages broker membership and leader election
Configuration storage: Stores cluster and topic configurations
Consumer coordination: Tracks consumer group membership (legacy)
KRaft mode: ZooKeeper replacement using Kafka's internal consensus

Data Flow & Processing Model

System Architecture Diagram

Kafka Protocol & Wire Format

Binary protocol: Efficient, versioned protocol over TCP
Message format: Headers, key, value, timestamp, and metadata
Compression: GZIP, Snappy, LZ4, ZSTD for network efficiency
Batching: Multiple messages per request for throughput optimization

Configuration & Deployment

Production Broker Configuration

Core Broker Settings

# server.properties - Production Configuration
broker.id=1
listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093
advertised.listeners=PLAINTEXT://kafka-broker-1:9092,SSL://kafka-broker-1:9093
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181/kafka

# Log Configuration
log.dirs=/kafka/logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

# Log Retention
log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.cleanup.policy=delete

# Replication
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
auto.create.topics.enable=false

# Performance
num.replica.fetchers=4
replica.fetch.max.bytes=1048576
group.initial.rebalance.delay.ms=3000

JVM Configuration

# kafka-server-start.sh JVM settings
export KAFKA_HEAP_OPTS="-Xmx6g -Xms6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true"
export KAFKA_GC_LOG_OPTS="-Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M"

Topic Configuration

Production Topic Creation

# Create topic with optimal settings
kafka-topics.sh --create \
  --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 \
  --topic user-events \
  --partitions 12 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config cleanup.policy=delete \
  --config retention.ms=604800000 \
  --config segment.ms=86400000 \
  --config compression.type=lz4

# High-throughput topic configuration
kafka-configs.sh --alter \
  --bootstrap-server kafka1:9092 \
  --entity-type topics \
  --entity-name high-volume-events \
  --add-config batch.size=65536,linger.ms=5,compression.type=lz4

Docker & Kubernetes Deployment

Docker Compose Setup

# docker-compose.yml
version: '3.8'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    volumes:
      - zk-data:/var/lib/zookeeper/data
      - zk-logs:/var/lib/zookeeper/log
    
  kafka1:
    image: confluentinc/cp-kafka:7.4.0
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
      KAFKA_HEAP_OPTS: "-Xmx2g -Xms2g"
    volumes:
      - kafka1-data:/var/lib/kafka/data
    
  kafka2:
    image: confluentinc/cp-kafka:7.4.0
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 2
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka2:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
    volumes:
      - kafka2-data:/var/lib/kafka/data

volumes:
  zk-data:
  zk-logs:
  kafka1-data:
  kafka2-data:

Kubernetes Deployment

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: production-cluster
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.5"
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 500Gi
        class: fast-ssd
    resources:
      requests:
        memory: 8Gi
        cpu: 2
      limits:
        memory: 8Gi
        cpu: 4
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd

Security Configuration

SSL/TLS Setup

# Broker SSL configuration
listeners=SSL://0.0.0.0:9093
ssl.keystore.location=/etc/kafka/ssl/kafka.server.keystore.jks
ssl.keystore.password=password
ssl.key.password=password
ssl.truststore.location=/etc/kafka/ssl/kafka.server.truststore.jks
ssl.truststore.password=password
ssl.client.auth=required

# SASL Authentication
sasl.enabled.mechanisms=PLAIN
sasl.mechanism.inter.broker.protocol=PLAIN
security.inter.broker.protocol=SASL_SSL

Performance Characteristics

Throughput Metrics

Single broker: 100K-1M messages/sec depending on message size and configuration
Cluster throughput: Linear scaling with additional brokers and partitions
Batch processing: 10M+ messages/sec with optimized batching and compression
Network utilization: Can saturate 10Gb/s network links with proper tuning

Latency Characteristics

Scenario                 | P50     | P95     | P99     | P99.9
-------------------------|---------|---------|---------|--------
End-to-end (acks=1)     | 2-5ms   | 5-15ms  | 15-30ms | 30-100ms
End-to-end (acks=all)   | 5-10ms  | 10-25ms | 25-50ms | 50-200ms
Producer only (acks=1)  | 1-2ms   | 2-5ms   | 5-15ms  | 15-50ms
Consumer fetch          | 1-3ms   | 3-8ms   | 8-20ms  | 20-80ms
Cross-DC replication     | 50-200ms| 200-500ms| 500ms-1s| 1-3s

Resource Utilization Patterns

Memory Usage

Heap memory: 6-8GB for high-throughput brokers
Page cache: Linux page cache critical for performance (32-64GB recommended)
Off-heap: Minimal, mostly for compression and network buffers
Consumer memory: Proportional to partition count and fetch size

CPU Patterns

Network I/O: High CPU for compression/decompression
Disk I/O: Sequential writes are CPU-efficient
Replication: Additional CPU for inter-broker communication
GC pressure: Well-tuned G1GC keeps pauses under 20ms

Storage Patterns

Sequential writes: 90%+ of disk operations are sequential
Log compaction: Additional I/O for compacted topics
Retention: Regular cleanup based on time/size policies
RAID configuration: RAID 10 recommended for production

Scalability Patterns

Horizontal scaling: Add brokers and increase partition count
Partition scaling: More partitions = higher parallelism
Consumer scaling: Scale consumers up to partition count
Cross-region: Multi-region clusters for global scaling

Operational Considerations

Failure Modes & Detection

Broker Failures

Symptoms:

Partition leadership changes
Consumer rebalancing
Producer retries and timeouts
Under-replicated partitions

Detection:

# Monitor broker status
kafka-broker-api-versions.sh --bootstrap-server kafka1:9092

# Check under-replicated partitions
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --under-replicated-partitions

# Monitor JMX metrics
echo "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions" | nc kafka1 9999

ZooKeeper Issues

Symptoms:

Metadata operation failures
Consumer group coordination problems
Topic creation/deletion failures
Leader election delays

Detection:

# ZooKeeper health check
echo "ruok" | nc zk1 2181

# Check ZooKeeper logs
tail -f /var/log/zookeeper/zookeeper.log | grep -i error

Network Partitions

Symptoms:

Split-brain scenarios
Inconsistent metadata views
Consumer lag spikes
Producer timeout errors

Detection:

# Monitor inter-broker connectivity
kafka-log-dirs.sh --bootstrap-server kafka1:9092 --describe

# Check consumer lag
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 --describe --group mygroup

Disaster Recovery

Backup Strategies

# Topic metadata backup
kafka-topics.sh --bootstrap-server kafka1:9092 --list > topics-backup.txt

# Consumer offset backup
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 --all-groups --describe > offsets-backup.txt

# Configuration backup
kafka-configs.sh --bootstrap-server kafka1:9092 --entity-type topics --describe > topic-configs.txt

Cross-Region Replication

# MirrorMaker 2.0 configuration
cat > mm2.properties << EOF
clusters = primary, backup
primary.bootstrap.servers = kafka1-primary:9092
backup.bootstrap.servers = kafka1-backup:9092

primary->backup.enabled = true
primary->backup.topics = user-events, orders, payments
backup->primary.enabled = false

replication.factor = 3
checkpoints.topic.replication.factor = 3
heartbeats.topic.replication.factor = 3
offset-syncs.topic.replication.factor = 3
EOF

# Start MirrorMaker
connect-mirror-maker.sh mm2.properties

Point-in-Time Recovery

# Stop consumers
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 --group mygroup --reset-offsets --to-datetime 2023-12-01T10:00:00.000 --execute

# Restore from snapshot (if using tiered storage)
kafka-log-dirs.sh --bootstrap-server kafka1:9092 --describe --json | jq '.logDirs'

Maintenance Procedures

Rolling Upgrades

# 1. Upgrade one broker at a time
systemctl stop kafka
# Install new version
systemctl start kafka
# Wait for replication to catch up

# 2. Monitor during upgrade
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --under-replicated-partitions

# 3. Verify cluster health
kafka-broker-api-versions.sh --bootstrap-server kafka1:9092

Partition Rebalancing

# Generate reassignment plan
kafka-reassign-partitions.sh --bootstrap-server kafka1:9092 --topics-to-move-json-file topics.json --broker-list "1,2,3,4" --generate > reassignment.json

# Execute reassignment
kafka-reassign-partitions.sh --bootstrap-server kafka1:9092 --reassignment-json-file reassignment.json --execute

# Monitor progress
kafka-reassign-partitions.sh --bootstrap-server kafka1:9092 --reassignment-json-file reassignment.json --verify

Log Compaction Management

# Trigger compaction
kafka-configs.sh --bootstrap-server kafka1:9092 --entity-type topics --entity-name compacted-topic --alter --add-config segment.ms=100

# Monitor compaction progress
kafka-log-dirs.sh --bootstrap-server kafka1:9092 --describe --json | jq '.logDirs[].dirs[].partitions[] | select(.topic=="compacted-topic")'

Troubleshooting Guide

High Consumer Lag

# Identify lagging consumers
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 --describe --group mygroup

# Check consumer thread dumps
jstack $(pgrep -f kafka.tools.ConsoleConsumer)

# Monitor consumer metrics
kafka-run-class.sh kafka.tools.JmxTool --object-name kafka.consumer:type=consumer-fetch-manager-metrics,client-id=* --attributes records-lag-max

Producer Performance Issues

# Producer metrics
kafka-run-class.sh kafka.tools.JmxTool --object-name kafka.producer:type=producer-metrics,client-id=*

# Network and batch metrics
kafka-run-class.sh kafka.tools.JmxTool --object-name kafka.producer:type=producer-topic-metrics,client-id=*,topic=*

Broker Performance Issues

# Disk I/O monitoring
iostat -x 1

# Network monitoring
iftop -i eth0

# JVM monitoring
jstat -gc $(pgrep -f kafka.Kafka) 5s

Production Best Practices

Configuration Tuning

Producer Optimization

// High-throughput producer configuration
Properties props = new Properties();
props.put("bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092");
props.put("acks", "1");  // Balance between throughput and durability
props.put("retries", 3);
props.put("batch.size", 65536);  // 64KB batches
props.put("linger.ms", 5);  // Wait up to 5ms for batching
props.put("buffer.memory", 67108864);  // 64MB buffer
props.put("compression.type", "lz4");  // Fast compression
props.put("max.in.flight.requests.per.connection", 5);

Consumer Optimization

// High-throughput consumer configuration
Properties props = new Properties();
props.put("bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092");
props.put("group.id", "high-throughput-consumers");
props.put("enable.auto.commit", "false");  // Manual commit for better control
props.put("fetch.min.bytes", 1024);  // Wait for at least 1KB
props.put("fetch.max.wait.ms", 500);  // Max wait 500ms
props.put("max.partition.fetch.bytes", 1048576);  // 1MB per partition
props.put("session.timeout.ms", 30000);
props.put("heartbeat.interval.ms", 10000);

Monitoring Setup

Essential Metrics

# Broker metrics
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent

# Producer metrics
kafka.producer:type=producer-metrics,client-id=*,attribute=record-send-rate
kafka.producer:type=producer-metrics,client-id=*,attribute=batch-size-avg

# Consumer metrics
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,attribute=records-lag-max
kafka.consumer:type=consumer-coordinator-metrics,client-id=*,attribute=commit-latency-avg

Alerting Rules

# Prometheus alerting rules
groups:
- name: kafka
  rules:
  - alert: KafkaUnderReplicatedPartitions
    expr: kafka_server_replicamanager_underreplicatedpartitions > 0
    for: 5m
    
  - alert: KafkaConsumerLag
    expr: kafka_consumer_lag_max > 10000
    for: 3m
    
  - alert: KafkaBrokerDown
    expr: kafka_brokers < 3
    for: 1m
    
  - alert: KafkaHighProducerLatency
    expr: kafka_producer_record_send_rate < 1000
    for: 5m

Capacity Planning

Sizing Guidelines

# Calculate storage requirements
# Daily volume = Messages/day × Average message size × Replication factor
# Example: 10M messages/day × 1KB × 3 replicas = 30GB/day

# Partition count estimation
# Partitions = Max(Target throughput / Partition throughput, Max consumers)
# Example: 100MB/s target ÷ 10MB/s per partition = 10 partitions minimum

# Memory requirements
# Broker heap: 6-8GB for high throughput
# Page cache: 32-64GB for optimal performance
# Producer buffer: 64MB default, scale with throughput

Scaling Triggers

Consumer lag > 10,000 messages consistently
Disk utilization > 80%
Network bandwidth > 80%
CPU utilization > 70%
Memory pressure (high GC frequency)

Integration Patterns

Kafka Streams Integration

// Kafka Streams topology for real-time processing
StreamsBuilder builder = new StreamsBuilder();

KStream<String, UserEvent> userEvents = builder.stream("user-events");

// Real-time aggregation
KTable<String, Long> userCounts = userEvents
    .groupByKey()
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .count(Materialized.as("user-counts-store"));

// Filter and route
userEvents
    .filter((key, event) -> event.getType().equals("purchase"))
    .to("purchase-events");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Schema Registry Integration

// Avro producer with Schema Registry
Properties props = new Properties();
props.put("bootstrap.servers", "kafka1:9092");
props.put("schema.registry.url", "http://schema-registry:8081");
props.put("key.serializer", KafkaAvroSerializer.class);
props.put("value.serializer", KafkaAvroSerializer.class);

Producer<String, GenericRecord> producer = new KafkaProducer<>(props);

// Send Avro message
GenericRecord record = new GenericData.Record(schema);
record.put("userId", "user123");
record.put("eventType", "click");
producer.send(new ProducerRecord<>("user-events", "user123", record));

Security Best Practices

ACL Configuration

# Create ACLs for topic access
kafka-acls.sh --bootstrap-server kafka1:9092 \
  --add \
  --allow-principal User:producer-service \
  --operation Write \
  --topic user-events

kafka-acls.sh --bootstrap-server kafka1:9092 \
  --add \
  --allow-principal User:consumer-service \
  --operation Read \
  --topic user-events \
  --group consumer-group-1

Network Security

# SSL client configuration
security.protocol=SSL
ssl.truststore.location=/etc/kafka/ssl/client.truststore.jks
ssl.truststore.password=password
ssl.keystore.location=/etc/kafka/ssl/client.keystore.jks
ssl.keystore.password=password
ssl.key.password=password

Interview-Focused Content

Technology-Specific Questions

Junior Level (2-4 YOE)

Q: What are the key differences between Kafka and traditional message queues?

A: Kafka differs from traditional message queues in several fundamental ways:

Log-based storage: Kafka stores messages in an immutable, append-only log vs. queue-based deletion
Multiple consumers: Multiple consumer groups can read the same messages independently
Persistence: Messages are persisted to disk and can be replayed vs. deleted after consumption
Partitioning: Built-in horizontal scaling through partitions vs. single queue scaling limitations
Ordering guarantees: Per-partition ordering vs. global ordering in queues
Pull-based consumption: Consumers pull messages vs. brokers pushing messages

Q: Explain how Kafka ensures message durability and what the acks parameter does.

A: Kafka ensures durability through replication and the acks parameter:

acks=0: Producer doesn't wait for acknowledgment (highest throughput, lowest durability)
acks=1: Producer waits for leader replica acknowledgment (balanced)
acks=all/-1: Producer waits for all in-sync replicas (highest durability, lower throughput)

Additional durability factors:

Replication factor: Number of replicas across different brokers
min.insync.replicas: Minimum replicas that must acknowledge writes
unclean.leader.election: Prevents data loss by disabling out-of-sync leader election

Mid-Level (4-8 YOE)

Q: How would you handle consumer lag in a high-throughput Kafka deployment?

A: Consumer lag resolution involves systematic diagnosis and optimization:

1. Identify Root Cause:

Monitor consumer metrics (records-lag-max, records-consumed-rate)
Check consumer thread utilization and GC pressure
Analyze message processing time per record

2. Scaling Strategies:

Horizontal scaling: Add more consumer instances (up to partition count)
Partition increase: Add partitions to existing topics for more parallelism
Batch processing: Increase fetch.min.bytes and max.partition.fetch.bytes

3. Processing Optimization:

Async processing: Decouple message consumption from processing
Batching: Process multiple records together
Threading: Use thread pools for CPU-intensive processing

4. Configuration Tuning:

props.put("fetch.min.bytes", 1024);          // Wait for more data
props.put("fetch.max.wait.ms", 500);         // Don't wait too long
props.put("max.partition.fetch.bytes", 1MB); // Larger fetch sizes
props.put("enable.auto.commit", false);      // Manual commit control

Q: Describe Kafka's leader election process and what happens during broker failures.

A: Kafka's leader election ensures high availability:

Normal Operations:

Each partition has one leader and multiple followers
Leaders handle all reads/writes for their partitions
Followers continuously replicate from leaders
ZooKeeper (or KRaft) tracks in-sync replica (ISR) sets

Broker Failure Process:

Detection: ZooKeeper detects broker failure via session timeout
ISR Update: Failed broker removed from ISR for its partitions
Leader Election: New leader chosen from remaining ISR members
Client Update: Producers/consumers get metadata updates
Replication: New leader continues serving while rebuilding ISR

Split-Brain Prevention:

Requires majority of ZooKeeper nodes for coordination
min.insync.replicas prevents writes with insufficient replicas
unclean.leader.election.enable=false prevents data loss

Senior Level (8+ YOE)

Q: Design a Kafka-based event streaming architecture for a financial trading platform handling millions of trades per second.

A: Financial trading platform architecture:

Requirements Analysis:

Ultra-low latency (sub-millisecond)
High throughput (millions of trades/sec)
Strong consistency and durability
Regulatory compliance and audit trails
Global distribution with regional failover

Architecture Design:

Market Data → Kafka Cluster (Region 1) → Risk Engine
    ↓              ↓                        ↓
Trade Orders → Stream Processing → Position Management
    ↓              ↓                        ↓
Execution → Cross-Region Replication → Audit/Compliance

Implementation:

Topic Design:
- market-data: 100+ partitions, 1-day retention
- trade-orders: Partitioned by instrument, 7-day retention
- executions: Compacted topic for position tracking
Performance Optimization:
- Dedicated network interfaces for Kafka
- NVMe SSD storage with RAID 10
- Kernel bypass networking (DPDK)
- CPU affinity for Kafka threads
Durability Configuration:
- acks=all with min.insync.replicas=2
- Replication factor=3 within region
- Cross-region replication for disaster recovery
Security & Compliance:
- mTLS for all communications
- ACLs for service-level access control
- Encryption at rest for audit logs
- Immutable log retention for compliance

Operational Questions

Q: Your Kafka cluster is experiencing sudden throughput degradation. Walk through your investigation process.

A: Systematic performance investigation:

1. Immediate Assessment:

# Check broker health
kafka-broker-api-versions.sh --bootstrap-server kafka1:9092

# Monitor key metrics
kafka-run-class.sh kafka.tools.JmxTool \
  --object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec

# Check under-replicated partitions
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --under-replicated-partitions

2. Resource Analysis:

CPU: High GC pressure, thread contention
Memory: Heap utilization, page cache effectiveness
Disk I/O: Sequential vs. random I/O patterns
Network: Bandwidth saturation, packet loss

3. Configuration Review:

Producer/consumer batch settings
Replication and acknowledgment configurations
Log retention and compaction settings
JVM tuning parameters

4. Application Analysis:

Message size distribution
Producer sending patterns
Consumer processing latency
Topic/partition access patterns

Q: How would you handle a ZooKeeper outage affecting your Kafka cluster?

A: ZooKeeper outage response depends on the scope:

Partial ZooKeeper Outage (Minority nodes down):

Cluster continues operating normally
Monitor remaining nodes for increased load
Replace failed nodes from snapshots
No immediate action required for Kafka

Total ZooKeeper Outage:

Immediate Impact:
- Existing producers/consumers continue working
- No new topic/partition operations possible
- Consumer group rebalancing fails
- No leader elections possible
Response Actions:
- Restart ZooKeeper ensemble from latest snapshots
- Verify ZooKeeper data integrity
- Monitor Kafka broker logs for connection recovery
- Validate consumer group status
Prevention Strategies:
- Deploy ZooKeeper across availability zones
- Regular ZooKeeper snapshot backups
- Monitor ZooKeeper disk space and performance
- Consider migration to KRaft mode (ZooKeeper-less)

Design Integration

Q: How would you integrate Kafka with a microservices architecture for event-driven communication?

A: Event-driven microservices integration:

Event Design Patterns:

Event Sourcing: Store all state changes as events
CQRS: Separate command and query models
Saga Pattern: Distributed transaction coordination
Outbox Pattern: Ensure database and event consistency

Implementation Architecture:

Service A → [Database] → [Outbox Table] → Kafka → Service B
    ↓                                        ↓
[Local Events] ← [Event Store] ← [Event Bus] ← [Event Processor]

Technical Implementation:

// Event publishing with transactional outbox
@Transactional
public void processOrder(Order order) {
    // Business logic
    orderRepository.save(order);
    
    // Publish event
    OutboxEvent event = new OutboxEvent(
        "OrderCreated", 
        order.getId(), 
        orderToJson(order)
    );
    outboxRepository.save(event);
}

// Background processor publishes to Kafka
@Scheduled(fixedDelay = 1000)
public void publishOutboxEvents() {
    List<OutboxEvent> events = outboxRepository.findUnpublished();
    for (OutboxEvent event : events) {
        kafkaTemplate.send("order-events", event.getPayload());
        outboxRepository.markPublished(event.getId());
    }
}

Considerations:

Schema evolution with Schema Registry
Event versioning and backward compatibility
Dead letter queues for failed processing
Monitoring and observability across services

Trade-off Analysis

Q: When would you choose Kafka over other messaging systems like RabbitMQ or Amazon SQS?

A: Messaging system selection criteria:

Choose Kafka when:

High throughput: Need to process millions of messages/second
Event streaming: Require real-time stream processing capabilities
Event sourcing: Building event-driven architectures
Multiple consumers: Multiple systems need the same data
Replay capability: Need to reprocess historical events
Durability: Require long-term message persistence

Choose RabbitMQ when:

Complex routing: Need sophisticated message routing logic
Request-response patterns: Synchronous communication patterns
Message TTL: Require per-message expiration
Priority queues: Need message prioritization
Smaller scale: Lower throughput requirements (<100K messages/sec)

Choose Amazon SQS when:

Managed service: Want fully managed infrastructure
Simple use cases: Basic queue-based communication
Cost sensitivity: Pay-per-use pricing model preferred
AWS ecosystem: Deep integration with other AWS services
Minimal operations: Don't want to manage infrastructure

Specific Scenarios:

Financial trading: Kafka (low latency, high throughput)
IoT sensor data: Kafka (volume, stream processing)
Task queues: RabbitMQ (complex routing, reliability)
Serverless workflows: SQS (managed, event-driven)

Troubleshooting Scenarios

Q: A consumer group is stuck and not processing messages despite messages being available. How do you diagnose and fix this?

A: Consumer group troubleshooting:

1. Initial Diagnosis:

# Check consumer group status
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 \
  --describe --group stuck-group

# Look for common issues
# - Consumer lag increasing
# - No active consumers
# - Uneven partition assignment

2. Common Root Causes:

Consumer Crash/Exit:

# Check consumer instance logs
tail -f /var/log/consumer-app.log | grep -i error

# Verify consumer heartbeats
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 \
  --describe --group stuck-group --verbose

Rebalancing Loop:

# Monitor rebalancing events
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 \
  --describe --group stuck-group

# Check session timeout configuration
# session.timeout.ms vs heartbeat.interval.ms ratio

Processing Deadlock:

# Thread dump of consumer application
jstack $(pgrep -f consumer-app)

# Look for blocked threads or resource contention

3. Resolution Steps:

# Reset offsets if necessary
kafka-consumer-groups.sh --bootstrap-server kafka1:9092 \
  --group stuck-group --reset-offsets --to-latest --execute --topic mytopic

# Restart consumer instances
systemctl restart consumer-service

# Monitor recovery
watch "kafka-consumer-groups.sh --bootstrap-server kafka1:9092 --describe --group stuck-group"

4. Prevention:

Implement proper exception handling
Configure appropriate timeouts
Monitor consumer lag and processing time
Use circuit breakers for external dependencies

Apache Kafka

Apache Kafka

Overview

Architecture & Core Components

System Architecture

System Architecture Diagram

Core Components

1. Kafka Broker

2. Topics and Partitions

3. Producers

4. Consumers

5. ZooKeeper (Legacy/KRaft)

Data Flow & Processing Model

System Architecture Diagram

Kafka Protocol & Wire Format

Configuration & Deployment

Production Broker Configuration

Core Broker Settings

JVM Configuration

Topic Configuration

Production Topic Creation

Docker & Kubernetes Deployment

Docker Compose Setup

Kubernetes Deployment

Security Configuration

SSL/TLS Setup

Performance Characteristics

Throughput Metrics

Latency Characteristics

Resource Utilization Patterns

Memory Usage

CPU Patterns

Storage Patterns

Scalability Patterns

Operational Considerations

Failure Modes & Detection

Broker Failures

ZooKeeper Issues

Network Partitions

Disaster Recovery

Backup Strategies

Cross-Region Replication

Point-in-Time Recovery

Maintenance Procedures

Rolling Upgrades

Partition Rebalancing

Log Compaction Management

Troubleshooting Guide

High Consumer Lag

Producer Performance Issues

Broker Performance Issues

Production Best Practices

Configuration Tuning

Producer Optimization

Consumer Optimization

Monitoring Setup

Essential Metrics

Alerting Rules

Capacity Planning

Sizing Guidelines

Scaling Triggers

Integration Patterns

Kafka Streams Integration

Schema Registry Integration

Security Best Practices

ACL Configuration

Network Security

Interview-Focused Content

Technology-Specific Questions

Junior Level (2-4 YOE)

Mid-Level (4-8 YOE)

Senior Level (8+ YOE)

Operational Questions

Design Integration

Trade-off Analysis

Troubleshooting Scenarios

Further Reading

Official Documentation

Production Guides

Advanced Topics