Elasticsearch

Overview

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, designed for horizontal scalability, reliability, and real-time search capabilities. It addresses the critical challenge of providing fast, relevant search and complex analytics across large volumes of structured and unstructured data.

Originally developed by Shay Banon in 2010, Elasticsearch has become the standard for search, logging, and analytics at companies like GitHub, Netflix, Uber, and Stack Overflow. It processes billions of documents and handles petabytes of data with sub-second query response times, designed for high availability, distributed architecture, and operational simplicity.

Key capabilities include:

Full-text search: Advanced text analysis with relevance scoring and fuzzy matching
Real-time analytics: Complex aggregations and time-series analysis at scale
Distributed architecture: Automatic sharding, replication, and cluster management
Schema flexibility: Dynamic mapping with support for complex nested data structures
Near real-time indexing: Documents available for search within seconds of ingestion

Architecture & Core Components

System Architecture

System Architecture Diagram

Core Components

1. Cluster and Nodes

Cluster: Collection of nodes storing data and providing search capabilities
Master node: Manages cluster metadata, index creation, and shard allocation
Data node: Stores data and executes search and aggregation operations
Coordinating node: Routes requests and merges results from data nodes
Ingest node: Preprocesses documents before indexing

2. Indices and Shards

Index: Logical collection of documents with similar characteristics
Shard: Physical subdivision of an index for horizontal scaling
Primary shard: Original shard that handles write operations
Replica shard: Copy of primary shard for redundancy and read scaling
Segment: Immutable Lucene index containing subset of shard data

3. Documents and Mapping

Document: JSON object stored in an index with unique ID
Mapping: Schema definition specifying field types and analysis settings
Field types: Text, keyword, numeric, date, boolean, geo, nested, object
Dynamic mapping: Automatic field type detection and mapping creation

4. Search and Query Engine

Query DSL: JSON-based query language for complex search operations
Analyzers: Text processing pipeline for tokenization and normalization
Scoring: Relevance calculation using TF-IDF and BM25 algorithms
Aggregations: Real-time analytics and data summarization framework

Data Flow & Indexing Process

System Architecture Diagram

Lucene Integration

Inverted index: Core data structure for fast text search
Segment merging: Background optimization for search performance
Translog: Write-ahead log for durability and crash recovery
Memory management: Java heap, off-heap caches, and OS page cache

Configuration & Deployment

Production Cluster Configuration

Node Configuration

# elasticsearch.yml - Master Node
cluster.name: production-cluster
node.name: master-node-1
node.roles: [master]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

# Network settings
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

# Discovery settings
discovery.seed_hosts: ["master-1", "master-2", "master-3"]
cluster.initial_master_nodes: ["master-node-1", "master-node-2", "master-node-3"]

# Memory settings
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 10%
indices.memory.min_index_buffer_size: 48mb

# Security
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Data Node Configuration

# elasticsearch.yml - Data Node
cluster.name: production-cluster
node.name: data-node-1
node.roles: [data, ingest]

# Storage configuration
path.data: ["/data1/elasticsearch", "/data2/elasticsearch"]
cluster.routing.allocation.same_shard.host: false

# Index settings
indices.queries.cache.size: 10%
indices.requests.cache.size: 1%
indices.fielddata.cache.size: 30%

# Performance tuning
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000

JVM Configuration

# jvm.options
-Xms31g
-Xmx31g
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+UseG1OldGCMixedGCCount=4
-XX:+UseG1MixedGCLiveThresholdPercent=90
-XX:+DisableExplicitGC
-Djava.io.tmpdir=/tmp
-Dlog4j2.disable.jmx=true

Index Templates and Mappings

Production Index Template

{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "10s",
      "index.codec": "best_compression",
      "index.lifecycle.name": "logs-policy"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "date_optional_time||epoch_millis"
        },
        "level": {
          "type": "keyword"
        },
        "message": {
          "type": "text",
          "analyzer": "standard",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 512
            }
          }
        },
        "service": {
          "type": "keyword"
        },
        "host": {
          "properties": {
            "name": {"type": "keyword"},
            "ip": {"type": "ip"}
          }
        }
      }
    }
  }
}

Custom Analyzers

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_stemmer",
            "custom_synonyms"
          ]
        }
      },
      "filter": {
        "custom_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "custom_synonyms": {
          "type": "synonym",
          "synonyms": [
            "fast,quick,rapid",
            "big,large,huge"
          ]
        }
      }
    }
  }
}

Docker & Kubernetes Deployment

Docker Compose Setup

# docker-compose.yml
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - cluster.name=docker-cluster
      - node.name=es01
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
      - xpack.security.enabled=false
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
      - "9300:9300"
    ulimits:
      memlock:
        soft: -1
        hard: -1

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

volumes:
  esdata:

Kubernetes Deployment

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: production-es
spec:
  version: 8.11.0
  nodeSets:
  - name: master
    count: 3
    config:
      node.roles: ["master"]
      xpack.security.authc:
        anonymous:
          enabled: false
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: "-Xms4g -Xmx4g"
          resources:
            requests:
              memory: 8Gi
              cpu: 2
            limits:
              memory: 8Gi
              cpu: 4
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
        storageClassName: fast-ssd
  
  - name: data
    count: 6
    config:
      node.roles: ["data", "ingest"]
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: "-Xms31g -Xmx31g"
          resources:
            requests:
              memory: 64Gi
              cpu: 4
            limits:
              memory: 64Gi
              cpu: 8
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Ti
        storageClassName: fast-ssd

Security Configuration

Authentication and Authorization

# Security settings
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.authc.realms:
  native:
    native1:
      order: 0
  ldap:
    ldap1:
      order: 1
      url: "ldap://ldap.company.com:389"
      bind_dn: "cn=admin,dc=company,dc=com"

# Role-based access control
xpack.security.authz.run_as_enabled: true

TLS Configuration

# Transport layer security
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12

# HTTP layer security
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: elastic-certificates.p12

Performance Characteristics

Search Performance Metrics

Query latency: Sub-100ms for simple queries, <1s for complex aggregations
Indexing throughput: 10K-100K documents/sec per node depending on size
Search throughput: 1K-10K queries/sec per node based on complexity
Index size: Typically 1.5-3x raw data size depending on mappings

Latency Characteristics

Query Type              | P50     | P95     | P99     | P99.9
------------------------|---------|---------|---------|--------
Term query              | 5-15ms  | 15-50ms | 50-100ms| 100-300ms
Match query             | 10-30ms | 30-100ms| 100-200ms| 200-500ms
Complex aggregation     | 50-200ms| 200-1s  | 1-5s    | 5-20s
Large result set        | 100-500ms| 500ms-2s| 2-10s   | 10-30s
Cross-cluster search    | 100-300ms| 300ms-1s| 1-3s    | 3-10s

Resource Utilization Patterns

Memory Usage

Heap memory: 50% of available RAM, max 31GB for compressed OOPs
Page cache: Remaining RAM for Lucene segment caching
Field data: In-memory data structures for aggregations and sorting
Query cache: Cached query results for frequently accessed data

CPU Patterns

Search operations: CPU-intensive for text analysis and scoring
Indexing: CPU-intensive for document analysis and Lucene operations
Merging: Background CPU usage for segment optimization
GC overhead: 5-10% with properly tuned garbage collection

Storage Patterns

Index growth: 1.5-3x raw data size depending on analysis and mappings
Segment files: Multiple small files per shard requiring fast I/O
Translog: Sequential writes for durability
Snapshots: Point-in-time backups to object storage

Scalability Patterns

Horizontal scaling: Add nodes and increase shard count
Index partitioning: Time-based indices for data lifecycle management
Cross-cluster replication: Multi-region deployment for global scaling
Hot-warm-cold architecture: Tiered storage for cost optimization

Operational Considerations

Failure Modes & Detection

Node Failures

Symptoms:

Unassigned shards
Increased search latency
Indexing failures
Cluster state changes

Detection:

# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Monitor node status
curl -X GET "localhost:9200/_cat/nodes?v"

# Check shard allocation
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason"

Memory Issues

Symptoms:

OutOfMemoryError exceptions
Long GC pauses
Circuit breaker trips
Slow query performance

Detection:

# Monitor JVM memory
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# Check circuit breakers
curl -X GET "localhost:9200/_nodes/stats/breaker?pretty"

# Field data usage
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?pretty"

Split Brain Scenarios

Symptoms:

Multiple master nodes
Inconsistent cluster state
Data inconsistency
Write conflicts

Detection:

# Check master nodes
curl -X GET "localhost:9200/_cat/master?v"

# Monitor cluster state
curl -X GET "localhost:9200/_cluster/state?filter_path=master_node,nodes"

Disaster Recovery

Snapshot and Restore

# Register snapshot repository
PUT _snapshot/my_backup
{
  "type": "s3",
  "settings": {
    "bucket": "elasticsearch-backups",
    "region": "us-east-1",
    "base_path": "production-cluster"
  }
}

# Create snapshot
PUT _snapshot/my_backup/snapshot_1
{
  "indices": "logs-*,metrics-*",
  "ignore_unavailable": true,
  "include_global_state": false,
  "metadata": {
    "taken_by": "backup-service",
    "taken_because": "daily-backup"
  }
}

# Restore from snapshot
POST _snapshot/my_backup/snapshot_1/_restore
{
  "indices": "logs-2023-12-*",
  "ignore_unavailable": true,
  "index_settings": {
    "index.number_of_replicas": 1
  }
}

Cross-Cluster Replication

# Configure remote cluster
PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.backup_cluster.seeds": [
      "backup-es-1:9300",
      "backup-es-2:9300"
    ]
  }
}

# Set up CCR
PUT backup_cluster:logs-replica/_ccr/follow
{
  "remote_cluster": "backup_cluster",
  "leader_index": "logs-primary"
}

Point-in-Time Recovery

# Create point-in-time for search
POST logs-*/_pit?keep_alive=1m

# Search with PIT
POST _search
{
  "pit": {
    "id": "pit_id_here",
    "keep_alive": "1m"
  },
  "query": {
    "range": {
      "@timestamp": {
        "gte": "2023-12-01T00:00:00",
        "lte": "2023-12-01T23:59:59"
      }
    }
  }
}

Maintenance Procedures

Index Lifecycle Management

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d",
            "max_docs": 100000000
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "include": {
              "box_type": "cold"
            }
          }
        }
      },
      "delete": {
        "min_age": "90d"
      }
    }
  }
}

Cluster Maintenance

# Rolling restart procedure
# 1. Disable shard allocation
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

# 2. Stop indexing, restart node

# 3. Re-enable allocation
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

# 4. Wait for cluster recovery
GET _cluster/health?wait_for_status=green&timeout=30s

Troubleshooting Guide

Performance Issues

# Identify slow queries
GET _nodes/stats/indices/search?level=shards

# Check index statistics
GET logs-*/_stats

# Monitor thread pools
GET _nodes/stats/thread_pool

# Hot threads analysis
GET _nodes/hot_threads?threads=3&interval=500ms

Memory Problems

# Clear field data cache
POST _cache/clear?fielddata=true

# Clear query cache
POST _cache/clear?query=true

# Force garbage collection
POST _nodes/_reload_secure_settings

Indexing Issues

# Check indexing rate
GET _cat/indices?v&s=docs.count:desc

# Monitor refresh intervals
GET _stats/refresh

# Check merge statistics
GET _cat/segments?v

Production Best Practices

Index Design and Optimization

Mapping Optimization

{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "date",
        "format": "date_optional_time||epoch_millis"
      },
      "message": {
        "type": "text",
        "analyzer": "english",
        "index_options": "offsets",
        "fields": {
          "raw": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      },
      "tags": {
        "type": "keyword",
        "doc_values": false,
        "index": true
      },
      "metadata": {
        "type": "object",
        "enabled": false
      }
    }
  }
}

Performance Tuning

# Bulk indexing optimization
PUT _template/bulk_template
{
  "index_patterns": ["bulk-*"],
  "settings": {
    "refresh_interval": "30s",
    "number_of_replicas": 0,
    "index.translog.durability": "async",
    "index.translog.sync_interval": "30s"
  }
}

# Search optimization
PUT logs-*/_settings
{
  "index.queries.cache.enabled": true,
  "index.requests.cache.enable": true,
  "index.refresh_interval": "10s"
}

Query Optimization

Efficient Query Patterns

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "status": "error"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h"
            }
          }
        }
      ],
      "must": [
        {
          "match": {
            "message": "database connection"
          }
        }
      ]
    }
  },
  "aggs": {
    "error_count_by_service": {
      "terms": {
        "field": "service.keyword",
        "size": 10
      }
    }
  }
}

Search Templates

{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "@timestamp": {
                  "gte": "{{start_time}}",
                  "lte": "{{end_time}}"
                }
              }
            }
          ],
          "must": [
            {
              "match": {
                "{{search_field}}": "{{search_term}}"
              }
            }
          ]
        }
      }
    }
  }
}

Monitoring Setup

Essential Metrics

# Cluster health metrics
cluster.status
cluster.active_primary_shards
cluster.active_shards
cluster.unassigned_shards

# Node metrics
node.jvm.mem.heap_used_percent
node.process.cpu.percent
node.fs.io_stats.total.read_operations
node.fs.io_stats.total.write_operations

# Index metrics
index.docs.count
index.size_in_bytes
index.refresh.total_time_in_millis
index.search.query_total

Alerting Configuration

# Watcher alert example
PUT _watcher/watch/cluster_health_watch
{
  "trigger": {
    "schedule": {
      "interval": "30s"
    }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "payload.status": {
        "eq": "red"
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": ["admin@company.com"],
        "subject": "Elasticsearch Cluster Alert",
        "body": "Cluster status is RED"
      }
    }
  }
}

Capacity Planning

Sizing Guidelines

# Shard sizing calculation
# Optimal shard size: 20-40GB
# Max shards per node: 20 × heap_size_gb
# Example: 31GB heap → max 620 shards per node

# Storage estimation
# Index size = raw_data_size × (1 + replica_count) × mapping_overhead
# Example: 100GB × 2 (1 replica) × 1.3 (30% overhead) = 260GB

# Memory requirements
# Heap: 50% of RAM, max 31GB
# Remaining RAM: OS page cache for Lucene segments
# Example: 64GB RAM → 31GB heap + 33GB page cache

Scaling Triggers

Search latency P95 > 1 second
Heap utilization > 75%
CPU utilization > 80%
Disk utilization > 85%
Queue rejections > 0

Integration Patterns

Application Integration

# Python Elasticsearch client
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk, scan

# Connection with retry and timeout
es = Elasticsearch(
    ['https://es-node1:9200', 'https://es-node2:9200'],
    http_auth=('username', 'password'),
    use_ssl=True,
    verify_certs=True,
    retry_on_timeout=True,
    max_retries=3,
    timeout=30
)

# Bulk indexing
def bulk_index_documents(documents):
    actions = [
        {
            "_index": "logs-2023-12",
            "_source": doc
        }
        for doc in documents
    ]
    
    bulk(es, actions, chunk_size=1000, request_timeout=60)

# Search with aggregations
def search_logs(query, start_time, end_time):
    body = {
        "query": {
            "bool": {
                "must": [
                    {"match": {"message": query}}
                ],
                "filter": [
                    {
                        "range": {
                            "@timestamp": {
                                "gte": start_time,
                                "lte": end_time
                            }
                        }
                    }
                ]
            }
        },
        "aggs": {
            "logs_over_time": {
                "date_histogram": {
                    "field": "@timestamp",
                    "interval": "1h"
                }
            }
        }
    }
    
    return es.search(index="logs-*", body=body)

ELK Stack Integration

# Logstash configuration
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] == "web" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch1:9200", "elasticsearch2:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    template_name => "logs"
    template => "/etc/logstash/templates/logs.json"
  }
}

Security Best Practices

Authentication and Authorization

# Create roles
POST _security/role/logs_reader
{
  "indices": [
    {
      "names": ["logs-*"],
      "privileges": ["read", "view_index_metadata"]
    }
  ]
}

# Create users
POST _security/user/log_analyst
{
  "password": "secure_password",
  "roles": ["logs_reader"],
  "full_name": "Log Analyst",
  "email": "analyst@company.com"
}

Field and Document Level Security

# Field level security
POST _security/role/sensitive_data_access
{
  "indices": [
    {
      "names": ["user-data-*"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["*"],
        "except": ["ssn", "credit_card"]
      }
    }
  ]
}

# Document level security
POST _security/role/regional_access
{
  "indices": [
    {
      "names": ["sales-*"],
      "privileges": ["read"],
      "query": {
        "term": {
          "region": "{{user.metadata.region}}"
        }
      }
    }
  ]
}

Interview-Focused Content

Technology-Specific Questions

Junior Level (2-4 YOE)

Q: What's the difference between a primary shard and a replica shard in Elasticsearch?

A: Primary and replica shards serve different purposes in Elasticsearch:

Primary Shard:

Original shard containing the actual data
Handles all write operations (index, update, delete)
One primary shard per partition of data
Cannot be changed after index creation

Replica Shard:

Exact copy of a primary shard
Provides redundancy for fault tolerance
Can handle read operations (search, get)
Can be added or removed dynamically
Automatically promoted to primary if original primary fails

Example: An index with 3 primary shards and 1 replica will have 6 total shards (3 primary + 3 replica).

Q: How does Elasticsearch achieve near real-time search?

A: Elasticsearch achieves near real-time search through several mechanisms:

In-memory indexing: Documents are first written to memory (index buffer)
Refresh operation: Periodically (default 1s) moves documents from memory to searchable segments
Translog: Write-ahead log ensures durability before refresh
Segment creation: New segments become immediately searchable
Background merging: Optimizes segments for better search performance

This process makes documents searchable within seconds of ingestion while maintaining durability.

Mid-Level (4-8 YOE)

Q: How would you optimize an Elasticsearch cluster experiencing slow search performance?

A: Search performance optimization involves multiple approaches:

1. Index Level Optimization:

Mapping optimization: Use appropriate field types, disable unnecessary features
Analyzer tuning: Choose efficient analyzers for your use case
Shard sizing: Keep shards between 20-40GB for optimal performance

2. Query Optimization:

Use filters over queries: Filters are cached and faster
Avoid expensive operations: Script queries, wildcard queries on large datasets
Optimize aggregations: Use composite aggregations for large cardinality

3. Hardware/Configuration:

Memory allocation: 50% heap, rest for page cache
SSD storage: Fast storage for segment files
CPU optimization: Sufficient cores for search threads

4. Monitoring and Diagnostics:

# Identify slow queries
GET _nodes/hot_threads
GET _cat/thread_pool?v&h=name,active,queue,rejected

# Check index statistics
GET _cat/indices?v&s=search.query_time_in_millis:desc

Q: Explain Elasticsearch's distributed search execution process.

A: Distributed search in Elasticsearch follows a two-phase process:

Query Phase:

Coordination: Coordinating node receives search request
Broadcast: Query sent to all relevant shards (primary or replica)
Local execution: Each shard executes query locally
Scoring: Local relevance scores calculated
Priority queue: Each shard returns top N document IDs and scores

Fetch Phase:

Global scoring: Coordinating node merges and re-ranks results
Document retrieval: Fetch actual documents from relevant shards
Response assembly: Combine documents and return to client

Optimization:

Use _source filtering to reduce network overhead
Implement proper routing for single-shard queries
Use search templates for frequently executed queries

Senior Level (8+ YOE)

Q: Design an Elasticsearch architecture for a multi-tenant SaaS application handling 100TB of log data with strict data isolation requirements.

A: Multi-tenant logging architecture design:

Requirements Analysis:

100TB data volume requires distributed storage
Data isolation prevents tenant data leakage
Search performance across tenant boundaries
Cost optimization through tiered storage
Compliance and audit requirements

Architecture Design:

Application Layer → Load Balancer → Coordinating Nodes
                                         ↓
Hot Data Nodes ← ILM → Warm Data Nodes ← ILM → Cold Data Nodes
     ↓                      ↓                      ↓
 Fast SSD Storage      Standard SSD          Object Storage

Implementation Strategy:

Index Strategy:
- Index per tenant: logs-{tenant-id}-{date}
- Time-based rollover for lifecycle management
- Index templates for consistent mapping
Security Implementation:
- Document Level Security (DLS) based on tenant ID
- Role-based access control per tenant
- API key authentication with tenant scoping
Performance Optimization:
- Dedicated coordinating nodes for search routing
- Hot-warm-cold architecture for cost optimization
- Cross-cluster search for historical data
Operational Considerations:
- Automated backup and restore per tenant
- Monitoring and alerting per tenant metrics
- Capacity planning based on tenant growth

Operational Questions

Q: Your Elasticsearch cluster shows "red" status with unassigned shards. Walk through your troubleshooting process.

A: Red cluster troubleshooting methodology:

1. Immediate Assessment:

# Check overall cluster health
GET _cluster/health?level=indices

# Identify unassigned shards
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason

# Check node availability
GET _cat/nodes?v

2. Common Root Causes:

Node Failure:

# Check if nodes left the cluster
GET _cluster/state/nodes

# Verify disk space and memory
GET _nodes/stats/fs,jvm

Shard Allocation Issues:

# Check allocation explain
GET _cluster/allocation/explain

# Review allocation settings
GET _cluster/settings?include_defaults=true

3. Resolution Steps:

For Missing Nodes:

Restart failed nodes if possible
If data is lost, consider allocating replicas as primaries

For Allocation Issues:

# Enable allocation if disabled
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

# Force allocation if necessary (data loss risk)
POST _cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "my-index",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}

Q: How do you handle capacity planning for an Elasticsearch cluster with unpredictable growth patterns?

A: Dynamic capacity planning approach:

1. Monitoring Strategy:

Track growth rates: daily, weekly, monthly patterns
Monitor resource utilization trends
Set up predictive alerting based on growth rate

2. Elastic Scaling Architecture:

# Hot-warm-cold with automatic tier movement
PUT _ilm/policy/dynamic_logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "cold_repository"
          }
        }
      }
    }
  }
}

3. Automated Scaling:

Kubernetes HPA for coordinating nodes
Auto-scaling data nodes based on storage utilization
Scheduled scaling for predictable load patterns

4. Cost Optimization:

Use searchable snapshots for long-term retention
Implement compression and forcemerge in warm phase
Archive to cheaper storage tiers automatically

Design Integration

Q: How would you integrate Elasticsearch into a microservices architecture for centralized logging and monitoring?

A: Microservices logging integration design:

Architecture Overview:

Microservice A → Filebeat → Logstash → Elasticsearch ← Kibana
Microservice B → Fluentd  → Kafka   → Logstash    ← Grafana
Microservice C → Direct API calls    → ES Ingest Nodes

Implementation Strategy:

Log Collection:
- Sidecar pattern with Filebeat for container logs
- Structured logging with correlation IDs
- Service mesh integration for automatic metadata
Processing Pipeline:

# Logstash configuration
filter {
  if [kubernetes][container][name] == "api-service" {
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:trace_id}\] %{GREEDYDATA:msg}" }
    }
    mutate {
      add_field => { "service_type" => "api" }
    }
  }
}

Index Strategy:
- Service-specific indices: logs-{service}-{date}
- Common mapping for correlation across services
- Retention policies based on service criticality
Observability Integration:
- Correlation with metrics (Prometheus) and traces (Jaeger)
- Alerting based on log patterns and anomalies
- Dashboard creation for service-specific insights

Trade-off Analysis

Q: When would you choose Elasticsearch over other search solutions like Solr or database-based search?

A: Search technology selection criteria:

Choose Elasticsearch when:

Real-time analytics: Need complex aggregations and visualizations
Scalability: Horizontal scaling requirements
Developer experience: REST API and JSON-based queries
Ecosystem: Integration with Logstash, Kibana, Beats
Cloud-native: Containerized deployments and auto-scaling

Choose Solr when:

Advanced search features: Faceting, highlighting, spell checking
Document-centric: Traditional search applications
SQL support: Need SQL-like query interface
Mature deployments: Existing Solr expertise and infrastructure

Choose Database Search when:

Simple search: Basic text search within existing data
ACID requirements: Strong consistency guarantees
Small scale: Limited data volume and query complexity
Cost sensitivity: Avoiding additional infrastructure

Specific Scenarios:

E-commerce product search: Elasticsearch (real-time updates, faceting)
Academic paper search: Solr (complex text analysis, relevance tuning)
Internal document search: Database FTS (simple, existing infrastructure)
Log analytics: Elasticsearch (time-series data, real-time dashboards)

Troubleshooting Scenarios

Q: Users report that search results are missing recent documents. How do you investigate and resolve this issue?

A: Missing recent documents troubleshooting:

1. Initial Investigation:

# Check refresh interval and last refresh time
GET logs-*/_settings?filter_path=*.settings.index.refresh_interval
GET logs-*/_stats/refresh

# Verify document count trends
GET _cat/indices/logs-*?v&s=creation.date:desc

2. Common Root Causes:

Refresh Configuration:

# Check if refresh is disabled
GET logs-*/_settings | grep refresh_interval

# Manual refresh to test
POST logs-*/_refresh

Indexing Pipeline Issues:

# Check indexing rate
GET _stats/indexing

# Monitor ingest pipelines
GET _ingest/pipeline/_stats

Routing/Shard Issues:

# Verify document routing
GET logs-*/_search
{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-5m"
      }
    }
  }
}

3. Resolution Steps:

Immediate Fix:

# Force refresh if needed
POST _refresh

# Check for stuck operations
GET _cat/pending_tasks?v

Long-term Solution:

Adjust refresh interval based on requirements
Monitor indexing pipeline health
Set up alerting for indexing delays
Implement proper error handling in applications

4. Prevention:

Regular monitoring of indexing metrics
Proper refresh interval configuration
Health checks for indexing pipeline
Documentation of troubleshooting procedures