Elasticsearch
System Architecture
Distributed search and analytics engine built on Apache Lucene for real-time search, logging, and data analytics
Elasticsearch
Overview
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, designed for horizontal scalability, reliability, and real-time search capabilities. It addresses the critical challenge of providing fast, relevant search and complex analytics across large volumes of structured and unstructured data.
Originally developed by Shay Banon in 2010, Elasticsearch has become the standard for search, logging, and analytics at companies like GitHub, Netflix, Uber, and Stack Overflow. It processes billions of documents and handles petabytes of data with sub-second query response times, designed for high availability, distributed architecture, and operational simplicity.
Key capabilities include:
- Full-text search: Advanced text analysis with relevance scoring and fuzzy matching
- Real-time analytics: Complex aggregations and time-series analysis at scale
- Distributed architecture: Automatic sharding, replication, and cluster management
- Schema flexibility: Dynamic mapping with support for complex nested data structures
- Near real-time indexing: Documents available for search within seconds of ingestion
Architecture & Core Components
System Architecture
System Architecture Diagram
Core Components
1. Cluster and Nodes
- Cluster: Collection of nodes storing data and providing search capabilities
- Master node: Manages cluster metadata, index creation, and shard allocation
- Data node: Stores data and executes search and aggregation operations
- Coordinating node: Routes requests and merges results from data nodes
- Ingest node: Preprocesses documents before indexing
2. Indices and Shards
- Index: Logical collection of documents with similar characteristics
- Shard: Physical subdivision of an index for horizontal scaling
- Primary shard: Original shard that handles write operations
- Replica shard: Copy of primary shard for redundancy and read scaling
- Segment: Immutable Lucene index containing subset of shard data
3. Documents and Mapping
- Document: JSON object stored in an index with unique ID
- Mapping: Schema definition specifying field types and analysis settings
- Field types: Text, keyword, numeric, date, boolean, geo, nested, object
- Dynamic mapping: Automatic field type detection and mapping creation
4. Search and Query Engine
- Query DSL: JSON-based query language for complex search operations
- Analyzers: Text processing pipeline for tokenization and normalization
- Scoring: Relevance calculation using TF-IDF and BM25 algorithms
- Aggregations: Real-time analytics and data summarization framework
Data Flow & Indexing Process
System Architecture Diagram
Lucene Integration
- Inverted index: Core data structure for fast text search
- Segment merging: Background optimization for search performance
- Translog: Write-ahead log for durability and crash recovery
- Memory management: Java heap, off-heap caches, and OS page cache
Configuration & Deployment
Production Cluster Configuration
Node Configuration
# elasticsearch.yml - Master Node
cluster.name: production-cluster
node.name: master-node-1
node.roles: [master]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
# Network settings
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
# Discovery settings
discovery.seed_hosts: ["master-1", "master-2", "master-3"]
cluster.initial_master_nodes: ["master-node-1", "master-node-2", "master-node-3"]
# Memory settings
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 10%
indices.memory.min_index_buffer_size: 48mb
# Security
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true
Data Node Configuration
# elasticsearch.yml - Data Node
cluster.name: production-cluster
node.name: data-node-1
node.roles: [data, ingest]
# Storage configuration
path.data: ["/data1/elasticsearch", "/data2/elasticsearch"]
cluster.routing.allocation.same_shard.host: false
# Index settings
indices.queries.cache.size: 10%
indices.requests.cache.size: 1%
indices.fielddata.cache.size: 30%
# Performance tuning
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000
JVM Configuration
# jvm.options
-Xms31g
-Xmx31g
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+UseG1OldGCMixedGCCount=4
-XX:+UseG1MixedGCLiveThresholdPercent=90
-XX:+DisableExplicitGC
-Djava.io.tmpdir=/tmp
-Dlog4j2.disable.jmx=true
Index Templates and Mappings
Production Index Template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "10s",
"index.codec": "best_compression",
"index.lifecycle.name": "logs-policy"
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_millis"
},
"level": {
"type": "keyword"
},
"message": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 512
}
}
},
"service": {
"type": "keyword"
},
"host": {
"properties": {
"name": {"type": "keyword"},
"ip": {"type": "ip"}
}
}
}
}
}
}
Custom Analyzers
{
"settings": {
"analysis": {
"analyzer": {
"custom_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_stemmer",
"custom_synonyms"
]
}
},
"filter": {
"custom_stemmer": {
"type": "stemmer",
"language": "english"
},
"custom_synonyms": {
"type": "synonym",
"synonyms": [
"fast,quick,rapid",
"big,large,huge"
]
}
}
}
}
}
Docker & Kubernetes Deployment
Docker Compose Setup
# docker-compose.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- cluster.name=docker-cluster
- node.name=es01
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
- xpack.security.enabled=false
volumes:
- esdata:/usr/share/elasticsearch/data
ports:
- "9200:9200"
- "9300:9300"
ulimits:
memlock:
soft: -1
hard: -1
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
volumes:
esdata:
Kubernetes Deployment
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: production-es
spec:
version: 8.11.0
nodeSets:
- name: master
count: 3
config:
node.roles: ["master"]
xpack.security.authc:
anonymous:
enabled: false
podTemplate:
spec:
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: "-Xms4g -Xmx4g"
resources:
requests:
memory: 8Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
- name: data
count: 6
config:
node.roles: ["data", "ingest"]
podTemplate:
spec:
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: "-Xms31g -Xmx31g"
resources:
requests:
memory: 64Gi
cpu: 4
limits:
memory: 64Gi
cpu: 8
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: fast-ssd
Security Configuration
Authentication and Authorization
# Security settings
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.authc.realms:
native:
native1:
order: 0
ldap:
ldap1:
order: 1
url: "ldap://ldap.company.com:389"
bind_dn: "cn=admin,dc=company,dc=com"
# Role-based access control
xpack.security.authz.run_as_enabled: true
TLS Configuration
# Transport layer security
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
# HTTP layer security
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: elastic-certificates.p12
Performance Characteristics
Search Performance Metrics
- Query latency: Sub-100ms for simple queries, <1s for complex aggregations
- Indexing throughput: 10K-100K documents/sec per node depending on size
- Search throughput: 1K-10K queries/sec per node based on complexity
- Index size: Typically 1.5-3x raw data size depending on mappings
Latency Characteristics
Query Type | P50 | P95 | P99 | P99.9
------------------------|---------|---------|---------|--------
Term query | 5-15ms | 15-50ms | 50-100ms| 100-300ms
Match query | 10-30ms | 30-100ms| 100-200ms| 200-500ms
Complex aggregation | 50-200ms| 200-1s | 1-5s | 5-20s
Large result set | 100-500ms| 500ms-2s| 2-10s | 10-30s
Cross-cluster search | 100-300ms| 300ms-1s| 1-3s | 3-10s
Resource Utilization Patterns
Memory Usage
- Heap memory: 50% of available RAM, max 31GB for compressed OOPs
- Page cache: Remaining RAM for Lucene segment caching
- Field data: In-memory data structures for aggregations and sorting
- Query cache: Cached query results for frequently accessed data
CPU Patterns
- Search operations: CPU-intensive for text analysis and scoring
- Indexing: CPU-intensive for document analysis and Lucene operations
- Merging: Background CPU usage for segment optimization
- GC overhead: 5-10% with properly tuned garbage collection
Storage Patterns
- Index growth: 1.5-3x raw data size depending on analysis and mappings
- Segment files: Multiple small files per shard requiring fast I/O
- Translog: Sequential writes for durability
- Snapshots: Point-in-time backups to object storage
Scalability Patterns
- Horizontal scaling: Add nodes and increase shard count
- Index partitioning: Time-based indices for data lifecycle management
- Cross-cluster replication: Multi-region deployment for global scaling
- Hot-warm-cold architecture: Tiered storage for cost optimization
Operational Considerations
Failure Modes & Detection
Node Failures
Symptoms:
- Unassigned shards
- Increased search latency
- Indexing failures
- Cluster state changes
Detection:
# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Monitor node status
curl -X GET "localhost:9200/_cat/nodes?v"
# Check shard allocation
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason"
Memory Issues
Symptoms:
- OutOfMemoryError exceptions
- Long GC pauses
- Circuit breaker trips
- Slow query performance
Detection:
# Monitor JVM memory
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"
# Check circuit breakers
curl -X GET "localhost:9200/_nodes/stats/breaker?pretty"
# Field data usage
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?pretty"
Split Brain Scenarios
Symptoms:
- Multiple master nodes
- Inconsistent cluster state
- Data inconsistency
- Write conflicts
Detection:
# Check master nodes
curl -X GET "localhost:9200/_cat/master?v"
# Monitor cluster state
curl -X GET "localhost:9200/_cluster/state?filter_path=master_node,nodes"
Disaster Recovery
Snapshot and Restore
# Register snapshot repository
PUT _snapshot/my_backup
{
"type": "s3",
"settings": {
"bucket": "elasticsearch-backups",
"region": "us-east-1",
"base_path": "production-cluster"
}
}
# Create snapshot
PUT _snapshot/my_backup/snapshot_1
{
"indices": "logs-*,metrics-*",
"ignore_unavailable": true,
"include_global_state": false,
"metadata": {
"taken_by": "backup-service",
"taken_because": "daily-backup"
}
}
# Restore from snapshot
POST _snapshot/my_backup/snapshot_1/_restore
{
"indices": "logs-2023-12-*",
"ignore_unavailable": true,
"index_settings": {
"index.number_of_replicas": 1
}
}
Cross-Cluster Replication
# Configure remote cluster
PUT _cluster/settings
{
"persistent": {
"cluster.remote.backup_cluster.seeds": [
"backup-es-1:9300",
"backup-es-2:9300"
]
}
}
# Set up CCR
PUT backup_cluster:logs-replica/_ccr/follow
{
"remote_cluster": "backup_cluster",
"leader_index": "logs-primary"
}
Point-in-Time Recovery
# Create point-in-time for search
POST logs-*/_pit?keep_alive=1m
# Search with PIT
POST _search
{
"pit": {
"id": "pit_id_here",
"keep_alive": "1m"
},
"query": {
"range": {
"@timestamp": {
"gte": "2023-12-01T00:00:00",
"lte": "2023-12-01T23:59:59"
}
}
}
}
Maintenance Procedures
Index Lifecycle Management
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "7d",
"max_docs": 100000000
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": {
"number_of_replicas": 0
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"include": {
"box_type": "cold"
}
}
}
},
"delete": {
"min_age": "90d"
}
}
}
}
Cluster Maintenance
# Rolling restart procedure
# 1. Disable shard allocation
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}
# 2. Stop indexing, restart node
# 3. Re-enable allocation
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
}
# 4. Wait for cluster recovery
GET _cluster/health?wait_for_status=green&timeout=30s
Troubleshooting Guide
Performance Issues
# Identify slow queries
GET _nodes/stats/indices/search?level=shards
# Check index statistics
GET logs-*/_stats
# Monitor thread pools
GET _nodes/stats/thread_pool
# Hot threads analysis
GET _nodes/hot_threads?threads=3&interval=500ms
Memory Problems
# Clear field data cache
POST _cache/clear?fielddata=true
# Clear query cache
POST _cache/clear?query=true
# Force garbage collection
POST _nodes/_reload_secure_settings
Indexing Issues
# Check indexing rate
GET _cat/indices?v&s=docs.count:desc
# Monitor refresh intervals
GET _stats/refresh
# Check merge statistics
GET _cat/segments?v
Production Best Practices
Index Design and Optimization
Mapping Optimization
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "date_optional_time||epoch_millis"
},
"message": {
"type": "text",
"analyzer": "english",
"index_options": "offsets",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"tags": {
"type": "keyword",
"doc_values": false,
"index": true
},
"metadata": {
"type": "object",
"enabled": false
}
}
}
}
Performance Tuning
# Bulk indexing optimization
PUT _template/bulk_template
{
"index_patterns": ["bulk-*"],
"settings": {
"refresh_interval": "30s",
"number_of_replicas": 0,
"index.translog.durability": "async",
"index.translog.sync_interval": "30s"
}
}
# Search optimization
PUT logs-*/_settings
{
"index.queries.cache.enabled": true,
"index.requests.cache.enable": true,
"index.refresh_interval": "10s"
}
Query Optimization
Efficient Query Patterns
{
"query": {
"bool": {
"filter": [
{
"term": {
"status": "error"
}
},
{
"range": {
"@timestamp": {
"gte": "now-1h"
}
}
}
],
"must": [
{
"match": {
"message": "database connection"
}
}
]
}
},
"aggs": {
"error_count_by_service": {
"terms": {
"field": "service.keyword",
"size": 10
}
}
}
}
Search Templates
{
"script": {
"lang": "mustache",
"source": {
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "{{start_time}}",
"lte": "{{end_time}}"
}
}
}
],
"must": [
{
"match": {
"{{search_field}}": "{{search_term}}"
}
}
]
}
}
}
}
}
Monitoring Setup
Essential Metrics
# Cluster health metrics
cluster.status
cluster.active_primary_shards
cluster.active_shards
cluster.unassigned_shards
# Node metrics
node.jvm.mem.heap_used_percent
node.process.cpu.percent
node.fs.io_stats.total.read_operations
node.fs.io_stats.total.write_operations
# Index metrics
index.docs.count
index.size_in_bytes
index.refresh.total_time_in_millis
index.search.query_total
Alerting Configuration
# Watcher alert example
PUT _watcher/watch/cluster_health_watch
{
"trigger": {
"schedule": {
"interval": "30s"
}
},
"input": {
"http": {
"request": {
"host": "localhost",
"port": 9200,
"path": "/_cluster/health"
}
}
},
"condition": {
"compare": {
"payload.status": {
"eq": "red"
}
}
},
"actions": {
"send_email": {
"email": {
"to": ["admin@company.com"],
"subject": "Elasticsearch Cluster Alert",
"body": "Cluster status is RED"
}
}
}
}
Capacity Planning
Sizing Guidelines
# Shard sizing calculation
# Optimal shard size: 20-40GB
# Max shards per node: 20 × heap_size_gb
# Example: 31GB heap → max 620 shards per node
# Storage estimation
# Index size = raw_data_size × (1 + replica_count) × mapping_overhead
# Example: 100GB × 2 (1 replica) × 1.3 (30% overhead) = 260GB
# Memory requirements
# Heap: 50% of RAM, max 31GB
# Remaining RAM: OS page cache for Lucene segments
# Example: 64GB RAM → 31GB heap + 33GB page cache
Scaling Triggers
- Search latency P95 > 1 second
- Heap utilization > 75%
- CPU utilization > 80%
- Disk utilization > 85%
- Queue rejections > 0
Integration Patterns
Application Integration
# Python Elasticsearch client
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk, scan
# Connection with retry and timeout
es = Elasticsearch(
['https://es-node1:9200', 'https://es-node2:9200'],
http_auth=('username', 'password'),
use_ssl=True,
verify_certs=True,
retry_on_timeout=True,
max_retries=3,
timeout=30
)
# Bulk indexing
def bulk_index_documents(documents):
actions = [
{
"_index": "logs-2023-12",
"_source": doc
}
for doc in documents
]
bulk(es, actions, chunk_size=1000, request_timeout=60)
# Search with aggregations
def search_logs(query, start_time, end_time):
body = {
"query": {
"bool": {
"must": [
{"match": {"message": query}}
],
"filter": [
{
"range": {
"@timestamp": {
"gte": start_time,
"lte": end_time
}
}
}
]
}
},
"aggs": {
"logs_over_time": {
"date_histogram": {
"field": "@timestamp",
"interval": "1h"
}
}
}
}
return es.search(index="logs-*", body=body)
ELK Stack Integration
# Logstash configuration
input {
beats {
port => 5044
}
}
filter {
if [fields][service] == "web" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch1:9200", "elasticsearch2:9200"]
index => "logs-%{+YYYY.MM.dd}"
template_name => "logs"
template => "/etc/logstash/templates/logs.json"
}
}
Security Best Practices
Authentication and Authorization
# Create roles
POST _security/role/logs_reader
{
"indices": [
{
"names": ["logs-*"],
"privileges": ["read", "view_index_metadata"]
}
]
}
# Create users
POST _security/user/log_analyst
{
"password": "secure_password",
"roles": ["logs_reader"],
"full_name": "Log Analyst",
"email": "analyst@company.com"
}
Field and Document Level Security
# Field level security
POST _security/role/sensitive_data_access
{
"indices": [
{
"names": ["user-data-*"],
"privileges": ["read"],
"field_security": {
"grant": ["*"],
"except": ["ssn", "credit_card"]
}
}
]
}
# Document level security
POST _security/role/regional_access
{
"indices": [
{
"names": ["sales-*"],
"privileges": ["read"],
"query": {
"term": {
"region": "{{user.metadata.region}}"
}
}
}
]
}
Interview-Focused Content
Technology-Specific Questions
Junior Level (2-4 YOE)
Q: What's the difference between a primary shard and a replica shard in Elasticsearch?
A: Primary and replica shards serve different purposes in Elasticsearch:
Primary Shard:
- Original shard containing the actual data
- Handles all write operations (index, update, delete)
- One primary shard per partition of data
- Cannot be changed after index creation
Replica Shard:
- Exact copy of a primary shard
- Provides redundancy for fault tolerance
- Can handle read operations (search, get)
- Can be added or removed dynamically
- Automatically promoted to primary if original primary fails
Example: An index with 3 primary shards and 1 replica will have 6 total shards (3 primary + 3 replica).
Q: How does Elasticsearch achieve near real-time search?
A: Elasticsearch achieves near real-time search through several mechanisms:
- In-memory indexing: Documents are first written to memory (index buffer)
- Refresh operation: Periodically (default 1s) moves documents from memory to searchable segments
- Translog: Write-ahead log ensures durability before refresh
- Segment creation: New segments become immediately searchable
- Background merging: Optimizes segments for better search performance
This process makes documents searchable within seconds of ingestion while maintaining durability.
Mid-Level (4-8 YOE)
Q: How would you optimize an Elasticsearch cluster experiencing slow search performance?
A: Search performance optimization involves multiple approaches:
1. Index Level Optimization:
- Mapping optimization: Use appropriate field types, disable unnecessary features
- Analyzer tuning: Choose efficient analyzers for your use case
- Shard sizing: Keep shards between 20-40GB for optimal performance
2. Query Optimization:
- Use filters over queries: Filters are cached and faster
- Avoid expensive operations: Script queries, wildcard queries on large datasets
- Optimize aggregations: Use composite aggregations for large cardinality
3. Hardware/Configuration:
- Memory allocation: 50% heap, rest for page cache
- SSD storage: Fast storage for segment files
- CPU optimization: Sufficient cores for search threads
4. Monitoring and Diagnostics:
# Identify slow queries
GET _nodes/hot_threads
GET _cat/thread_pool?v&h=name,active,queue,rejected
# Check index statistics
GET _cat/indices?v&s=search.query_time_in_millis:desc
Q: Explain Elasticsearch's distributed search execution process.
A: Distributed search in Elasticsearch follows a two-phase process:
Query Phase:
- Coordination: Coordinating node receives search request
- Broadcast: Query sent to all relevant shards (primary or replica)
- Local execution: Each shard executes query locally
- Scoring: Local relevance scores calculated
- Priority queue: Each shard returns top N document IDs and scores
Fetch Phase:
- Global scoring: Coordinating node merges and re-ranks results
- Document retrieval: Fetch actual documents from relevant shards
- Response assembly: Combine documents and return to client
Optimization:
- Use
_source
filtering to reduce network overhead - Implement proper routing for single-shard queries
- Use search templates for frequently executed queries
Senior Level (8+ YOE)
Q: Design an Elasticsearch architecture for a multi-tenant SaaS application handling 100TB of log data with strict data isolation requirements.
A: Multi-tenant logging architecture design:
Requirements Analysis:
- 100TB data volume requires distributed storage
- Data isolation prevents tenant data leakage
- Search performance across tenant boundaries
- Cost optimization through tiered storage
- Compliance and audit requirements
Architecture Design:
Application Layer → Load Balancer → Coordinating Nodes
↓
Hot Data Nodes ← ILM → Warm Data Nodes ← ILM → Cold Data Nodes
↓ ↓ ↓
Fast SSD Storage Standard SSD Object Storage
Implementation Strategy:
- Index Strategy:
- Index per tenant:
logs-{tenant-id}-{date}
- Time-based rollover for lifecycle management
- Index templates for consistent mapping
- Index per tenant:
- Security Implementation:
- Document Level Security (DLS) based on tenant ID
- Role-based access control per tenant
- API key authentication with tenant scoping
- Performance Optimization:
- Dedicated coordinating nodes for search routing
- Hot-warm-cold architecture for cost optimization
- Cross-cluster search for historical data
- Operational Considerations:
- Automated backup and restore per tenant
- Monitoring and alerting per tenant metrics
- Capacity planning based on tenant growth
Operational Questions
Q: Your Elasticsearch cluster shows "red" status with unassigned shards. Walk through your troubleshooting process.
A: Red cluster troubleshooting methodology:
1. Immediate Assessment:
# Check overall cluster health
GET _cluster/health?level=indices
# Identify unassigned shards
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason
# Check node availability
GET _cat/nodes?v
2. Common Root Causes:
Node Failure:
# Check if nodes left the cluster
GET _cluster/state/nodes
# Verify disk space and memory
GET _nodes/stats/fs,jvm
Shard Allocation Issues:
# Check allocation explain
GET _cluster/allocation/explain
# Review allocation settings
GET _cluster/settings?include_defaults=true
3. Resolution Steps:
For Missing Nodes:
- Restart failed nodes if possible
- If data is lost, consider allocating replicas as primaries
For Allocation Issues:
# Enable allocation if disabled
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
# Force allocation if necessary (data loss risk)
POST _cluster/reroute
{
"commands": [
{
"allocate_empty_primary": {
"index": "my-index",
"shard": 0,
"node": "node-1",
"accept_data_loss": true
}
}
]
}
Q: How do you handle capacity planning for an Elasticsearch cluster with unpredictable growth patterns?
A: Dynamic capacity planning approach:
1. Monitoring Strategy:
- Track growth rates: daily, weekly, monthly patterns
- Monitor resource utilization trends
- Set up predictive alerting based on growth rate
2. Elastic Scaling Architecture:
# Hot-warm-cold with automatic tier movement
PUT _ilm/policy/dynamic_logs_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "3d",
"actions": {
"allocate": {
"number_of_replicas": 0
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "cold_repository"
}
}
}
}
}
}
3. Automated Scaling:
- Kubernetes HPA for coordinating nodes
- Auto-scaling data nodes based on storage utilization
- Scheduled scaling for predictable load patterns
4. Cost Optimization:
- Use searchable snapshots for long-term retention
- Implement compression and forcemerge in warm phase
- Archive to cheaper storage tiers automatically
Design Integration
Q: How would you integrate Elasticsearch into a microservices architecture for centralized logging and monitoring?
A: Microservices logging integration design:
Architecture Overview:
Microservice A → Filebeat → Logstash → Elasticsearch ← Kibana
Microservice B → Fluentd → Kafka → Logstash ← Grafana
Microservice C → Direct API calls → ES Ingest Nodes
Implementation Strategy:
- Log Collection:
- Sidecar pattern with Filebeat for container logs
- Structured logging with correlation IDs
- Service mesh integration for automatic metadata
- Processing Pipeline:
# Logstash configuration
filter {
if [kubernetes][container][name] == "api-service" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:trace_id}\] %{GREEDYDATA:msg}" }
}
mutate {
add_field => { "service_type" => "api" }
}
}
}
- Index Strategy:
- Service-specific indices:
logs-{service}-{date}
- Common mapping for correlation across services
- Retention policies based on service criticality
- Service-specific indices:
- Observability Integration:
- Correlation with metrics (Prometheus) and traces (Jaeger)
- Alerting based on log patterns and anomalies
- Dashboard creation for service-specific insights
Trade-off Analysis
Q: When would you choose Elasticsearch over other search solutions like Solr or database-based search?
A: Search technology selection criteria:
Choose Elasticsearch when:
- Real-time analytics: Need complex aggregations and visualizations
- Scalability: Horizontal scaling requirements
- Developer experience: REST API and JSON-based queries
- Ecosystem: Integration with Logstash, Kibana, Beats
- Cloud-native: Containerized deployments and auto-scaling
Choose Solr when:
- Advanced search features: Faceting, highlighting, spell checking
- Document-centric: Traditional search applications
- SQL support: Need SQL-like query interface
- Mature deployments: Existing Solr expertise and infrastructure
Choose Database Search when:
- Simple search: Basic text search within existing data
- ACID requirements: Strong consistency guarantees
- Small scale: Limited data volume and query complexity
- Cost sensitivity: Avoiding additional infrastructure
Specific Scenarios:
- E-commerce product search: Elasticsearch (real-time updates, faceting)
- Academic paper search: Solr (complex text analysis, relevance tuning)
- Internal document search: Database FTS (simple, existing infrastructure)
- Log analytics: Elasticsearch (time-series data, real-time dashboards)
Troubleshooting Scenarios
Q: Users report that search results are missing recent documents. How do you investigate and resolve this issue?
A: Missing recent documents troubleshooting:
1. Initial Investigation:
# Check refresh interval and last refresh time
GET logs-*/_settings?filter_path=*.settings.index.refresh_interval
GET logs-*/_stats/refresh
# Verify document count trends
GET _cat/indices/logs-*?v&s=creation.date:desc
2. Common Root Causes:
Refresh Configuration:
# Check if refresh is disabled
GET logs-*/_settings | grep refresh_interval
# Manual refresh to test
POST logs-*/_refresh
Indexing Pipeline Issues:
# Check indexing rate
GET _stats/indexing
# Monitor ingest pipelines
GET _ingest/pipeline/_stats
Routing/Shard Issues:
# Verify document routing
GET logs-*/_search
{
"query": {
"range": {
"@timestamp": {
"gte": "now-5m"
}
}
}
}
3. Resolution Steps:
Immediate Fix:
# Force refresh if needed
POST _refresh
# Check for stuck operations
GET _cat/pending_tasks?v
Long-term Solution:
- Adjust refresh interval based on requirements
- Monitor indexing pipeline health
- Set up alerting for indexing delays
- Implement proper error handling in applications
4. Prevention:
- Regular monitoring of indexing metrics
- Proper refresh interval configuration
- Health checks for indexing pipeline
- Documentation of troubleshooting procedures