Core Idea
This paper speeds up local LLM inference on weak edge devices by letting multiple nearby devices share cached intermediate model states (e.g., KV cache / internal activations) for similar prompts, so each device doesn’t have to recompute everything from scratch. It also supports partial prompt matches, so even if only the beginning of a prompt matches something previously seen, the device can still reuse part of the work.
Why it matters: edge devices (like Raspberry Pi-class hardware) often have limited CPU/RAM and no GPU, making LLM latency painful. The paper shows that cooperative caching over a wireless network can dramatically cut user-perceived latency (especially Time to First Token).
Technical Details
Key innovations (in plain terms)
- Distributed prompt caching:
Instead of keeping prompt caches only on the local device, devices in the same network cooperate: if Device A already processed a similar prompt, Device B can request the cached internal state and continue generation from there. - Partial matching for prompt similarity:
Real prompts aren’t identical; they often share a prefix (e.g., same system prompt, same instruction template, same few-shot examples). The system can reuse cached state for the longest matching prefix, reducing compute even when prompts differ later. - Catalog using Bloom filters to avoid wasted network calls:
Asking every peer “do you have cache for this prefix?” can create lots of wireless overhead. The paper introduces a catalog: a compact probabilistic index (Bloom filter) that helps decide whether a peer likely has the needed cached state before contacting it.- Bloom filter explained: a memory-efficient set membership structure that can say “possibly present” or “definitely not present.” It can have false positives (you might contact a peer unnecessarily) but no false negatives (you won’t miss a cache that exists, assuming updates are correct).
Important architecture / flow
- Prompt arrives on an edge device.
- Device computes a signature for prompt prefixes (e.g., token-prefix identifiers).
- It consults a local catalog of peers (Bloom-filter-based) to find which peer is likely to have cached state for that prefix.
- If a candidate exists, it fetches the intermediate state (e.g., KV cache up to token k).
- The device resumes inference from token k+1 instead of recomputing tokens
1..k.
Novel approaches introduced
- Treating LLM internal states as shareable artifacts across devices (not just final outputs).
- Combining partial prefix reuse with network-efficient discovery (Bloom-filter catalogs) to make distributed caching practical on low-bandwidth/variable wireless links.
- Measuring improvements with user-facing latency metrics:
- TTFT (Time to First Token): how quickly the first token appears (perceived responsiveness).
- TTLT (Time to Last Token): total time to finish the full response.
How This Relates to Interviews
System design relevance
- Distributed caching design: cache keys, eviction, consistency, hit/miss behavior, and tradeoffs between compute savings vs network cost.
- Edge + wireless constraints: intermittent connectivity, limited bandwidth, higher latency, and energy constraints—all common in real-world systems.
- Probabilistic data structures: Bloom filters are a classic interview topic for scalable membership checks and reducing network chatter.
Common interview scenarios where this applies
- Designing a shared cache across multiple nodes (e.g., CDN edge nodes, microservices, mobile/IoT swarms).
- Reducing latency for expensive computation via memoization and reuse of intermediate results (not only final outputs).
- Building a discovery layer (“who has what?”) that avoids broadcast storms—similar to:
- service discovery with hints
- cache directory protocols
- distributed indexes
Key concepts to understand
- KV cache / intermediate states: in transformer inference, caching attention keys/values for previous tokens avoids recomputing attention for the entire prefix.
- Prefix-based caching: reuse is strongest when prompts share a long common prefix (system prompts, templates, few-shot examples).
- Bloom filter tradeoffs: memory vs false positives; why false positives are acceptable when the alternative is expensive probing of every peer.
- Latency metrics: TTFT vs TTLT and why TTFT often matters most for UX.
Key Takeaways
- Sharing intermediate LLM states across nearby edge devices can massively reduce latency, especially TTFT (reported ~93% average reduction in the paper’s setup).
- Partial prompt matching is crucial because real prompts rarely match exactly; prefix reuse captures the common “prompt boilerplate.”
- Communication overhead can erase gains unless discovery is efficient—Bloom-filter catalogs reduce unnecessary wireless requests.
- Probabilistic indexing is a practical systems tool: accept occasional wasted fetch attempts (false positives) to avoid constant network probing.
- Practical applications:
- Local assistants running on multiple home/office devices that cooperate (phones, TVs, Raspberry Pis).
- Industrial/retail IoT deployments where devices share common instruction prompts and can pool compute without cloud dependence.
- Any edge inference stack where templates and repeated prefixes dominate requests (customer support scripts, form filling, guided workflows).