Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching...

Core Idea

This paper speeds up local LLM inference on weak edge devices by letting multiple nearby devices share cached intermediate model states (e.g., KV cache / internal activations) for similar prompts, so each device doesn’t have to recompute everything from scratch. It also supports partial prompt matches, so even if only the beginning of a prompt matches something previously seen, the device can still reuse part of the work.

Why it matters: edge devices (like Raspberry Pi-class hardware) often have limited CPU/RAM and no GPU, making LLM latency painful. The paper shows that cooperative caching over a wireless network can dramatically cut user-perceived latency (especially Time to First Token).

Technical Details

Key innovations (in plain terms)

  • Distributed prompt caching:
    Instead of keeping prompt caches only on the local device, devices in the same network cooperate: if Device A already processed a similar prompt, Device B can request the cached internal state and continue generation from there.
  • Partial matching for prompt similarity:
    Real prompts aren’t identical; they often share a prefix (e.g., same system prompt, same instruction template, same few-shot examples). The system can reuse cached state for the longest matching prefix, reducing compute even when prompts differ later.
  • Catalog using Bloom filters to avoid wasted network calls:
    Asking every peer “do you have cache for this prefix?” can create lots of wireless overhead. The paper introduces a catalog: a compact probabilistic index (Bloom filter) that helps decide whether a peer likely has the needed cached state before contacting it.
    • Bloom filter explained: a memory-efficient set membership structure that can say “possibly present” or “definitely not present.” It can have false positives (you might contact a peer unnecessarily) but no false negatives (you won’t miss a cache that exists, assuming updates are correct).

Important architecture / flow

  1. Prompt arrives on an edge device.
  2. Device computes a signature for prompt prefixes (e.g., token-prefix identifiers).
  3. It consults a local catalog of peers (Bloom-filter-based) to find which peer is likely to have cached state for that prefix.
  4. If a candidate exists, it fetches the intermediate state (e.g., KV cache up to token k).
  5. The device resumes inference from token k+1 instead of recomputing tokens 1..k.

Novel approaches introduced

  • Treating LLM internal states as shareable artifacts across devices (not just final outputs).
  • Combining partial prefix reuse with network-efficient discovery (Bloom-filter catalogs) to make distributed caching practical on low-bandwidth/variable wireless links.
  • Measuring improvements with user-facing latency metrics:
    • TTFT (Time to First Token): how quickly the first token appears (perceived responsiveness).
    • TTLT (Time to Last Token): total time to finish the full response.

How This Relates to Interviews

System design relevance

  • Distributed caching design: cache keys, eviction, consistency, hit/miss behavior, and tradeoffs between compute savings vs network cost.
  • Edge + wireless constraints: intermittent connectivity, limited bandwidth, higher latency, and energy constraints—all common in real-world systems.
  • Probabilistic data structures: Bloom filters are a classic interview topic for scalable membership checks and reducing network chatter.

Common interview scenarios where this applies

  • Designing a shared cache across multiple nodes (e.g., CDN edge nodes, microservices, mobile/IoT swarms).
  • Reducing latency for expensive computation via memoization and reuse of intermediate results (not only final outputs).
  • Building a discovery layer (“who has what?”) that avoids broadcast storms—similar to:
    • service discovery with hints
    • cache directory protocols
    • distributed indexes

Key concepts to understand

  • KV cache / intermediate states: in transformer inference, caching attention keys/values for previous tokens avoids recomputing attention for the entire prefix.
  • Prefix-based caching: reuse is strongest when prompts share a long common prefix (system prompts, templates, few-shot examples).
  • Bloom filter tradeoffs: memory vs false positives; why false positives are acceptable when the alternative is expensive probing of every peer.
  • Latency metrics: TTFT vs TTLT and why TTFT often matters most for UX.

Key Takeaways

  • Sharing intermediate LLM states across nearby edge devices can massively reduce latency, especially TTFT (reported ~93% average reduction in the paper’s setup).
  • Partial prompt matching is crucial because real prompts rarely match exactly; prefix reuse captures the common “prompt boilerplate.”
  • Communication overhead can erase gains unless discovery is efficient—Bloom-filter catalogs reduce unnecessary wireless requests.
  • Probabilistic indexing is a practical systems tool: accept occasional wasted fetch attempts (false positives) to avoid constant network probing.
  • Practical applications:
    • Local assistants running on multiple home/office devices that cooperate (phones, TVs, Raspberry Pis).
    • Industrial/retail IoT deployments where devices share common instruction prompts and can pool compute without cloud dependence.
    • Any edge inference stack where templates and repeated prefixes dominate requests (customer support scripts, form filling, guided workflows).