May 6, 2026

IC-Cache: Efficient Large Language Model Serving via In-context Caching

In-Context Caching system/ leverage caching, let small LLM serves like a giant.

Venue: SOSP 2025
Link: ACM DL

Topic:

end-to-end ML <-> system co-designs.
caching
offloading

Summary

in LLM, caching does help a lot. Utilize caching and let small LLMs work like much high performance model.

Leverage historical request-response pairs as in-context examples to enable live LLM capability augmentation at scale.

-> empowers smaller LLMs to produce higher-quality responses by imitating the behavior of larger models, thereby adaptively offloading traffic from them.

result: system that significantly reduces serving costs and latency.

Background

We use caching in serving LLMs, but we are not doing correctly.

Situation:
1. serving LLM at scale is challenging, due to their substantial resource demands and high latency.
2. Service providers rely on complex/overprovisioned infrastructure to maintain responsiveness.
Observation: over 70% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge transer among requests. = cacheable
Problem: naively caching & reusing responses doesn’t work well.

major providers already support context caching mechanisms to reduce deplyment costs. But,
- exact match: yields low hit rates - KV caches
- similarity-based match: quality degradation (risk unrelated replies)

Goal

We need to cache to utilize similarity, but in a smart way.

Key Idea

Let small LLMs to imitate & exceed problem-solving abilities of their larger counterparts.

by leveraging historical request-response pairs from larger models as in-context examples
by enabling selective offloading of requests to small LLMS, to reduce cost and latency.

Goals

response quality <-> latency <-> system throughput

must efficiently identify high-utility examples.
Request offloading to smaller LLMs must ensure response quality: optimize efficiency <-> quality tradeoff
- aggressively offloading requests -> degrade response quality.
- should consider: request complexity, example utility, and current serving load.
managing the example cache under data & load dynamics: requires optimizing example quality & efficiency at scale.

Solutions

give relative samples with request: efficiently selects similar, high-utility examples and, put them in front of the new request’s input.
Routes requests across LLMs with various specs: to find the right one to ensure response quality & serving loads.
offline cost-aware cache reply mechanism: refines sample quality offline to maximize online cache utility and efficiency.

Design

IC-Cache serves as a complementary layer: between exisiting serving systems (ex. vLLM) <-> LLM applications.

selecting helpful examples at scale
adaptively offloading requests among LLMs
managing examples to improve efficiency & utility.

Example Selector

: selects high-utility requestresponse pairs from history as examples to augment LLMs

2-stage example selection mechanism to balances selection efficiency and quality.
1. pre-selects a small subset of examples with high relevance to ensure scalability (ex. cosine similarity).
  - but this alone can’t secure response quality.
  - cluster cached examples offline into K groups.
2. use a lightweight proxy model to estimate their end-to-end utility.
  - estimate the example’s end-to-end helpfulness.
  - use user feedback.
  - updated asynchronously.
  - little overhead(<1%)
    - the number of examples should be adapted per query: the Example Selector uses a dynamic utility threshold to filter out low-impact examples.

Request Router: MAB-based request router

: routes new requests to the most suitable LLM, such as small models augmented with examples, or larger models, based on request complexity and the current serving load.

A lightweight, bandit-based request router that jointly considers the request and selected examples to route requests to LLMs of varying capabilites.
modeling the routing decision
- as a contextual multi-armed bandit (MAB) problem, a lightweight and data-efficient approach often used in online recommendation systems.
- aiming to maximize cumulative rewards such as maximizing response quality.
tracks the Exponential Moving Average (EMA) of the system serving load over time.
- EMA is lower than desired operational threshold : priortize response quality.
- EMA exceeds : the router triggers a feedback controller to compute a corrective bias(calculated using the hyperbolic tangent (tanh) function), reduce the selection scores of high-cost models and favoring more efficient, lower-cost alternatives to relieve system pressure.
lightweight control mechanism adjusts routing preferences without modifying or retraining the underlying Request Router, effectively decoupling overload management from the core routing logic.

Example Manager: for Management

: manages the caching of requests over time (e.g., eviction) and opportunistically improves example quality (e.g., by asynchronously replaying requests and storing the best response).

selectively retains & evicts examples to maintain a bounded cache size / while respecting privacy.
it ranks examples by their potential gain, G (e), and stops replaying examples whose potential gain falls below a cut-off.
cut-off is determined online
an online cache management policy that evicts low-utility examples.
IC-Cache removes personally identifiable information using the widely adopted tool spaCy

Serving Workflow

The Example Retriever retrieves the most helpful request-response pairs (e.g., based on relevance and quality) from the cache to serve as in-context examples.
The new request, now augmented with these examples, is passed to the Request Router, which determines the appropriate LLM to handle the request.
The selected model processes the request and generates a response, following the usual LLM generation process. The response is then delivered to the user.
Finally, the Example Manager may add the request-response pair to the cache, depending on application-specific requirements (e.g., removing sensitive information), improve example quality via cost-aware replay, and evict stale or low-quality entries over time.

Implementation

3 popular LLM serving frameworks: HuggingFace Runtime, vLLM, LangChain.

Evaluation

on millions of realistic requests

improves LLM serving throughput by 1.4–5.9x
reduces latency by 28–71% without hurting response quality.

prefix caching
Sementic caching
Retrieval-Augmented Generation systems

improves the reliability of LLM outputs/ by integrating knowledge retrieved from external sources.
- cons: RAG relies on long external sources and is vulnerable to out-of-domain or low-quality documents.
- lacking the compositional reasoning captured in LLM responses.
- IC-Cache complements RAG by generating cached queries with RAG to incorporate external knowledge.

Contributions

Make caching in LLM usable with some heuristic thresholds. offload works to small LLMs,-> utilize current resource better.

Inputs

Retrieval-Augmented Generation systems

improves the reliability of LLM outputs/ by integrating knowledge retrieved from external, static sources.
In-context learning (ICL): LLMs can learn from high-quality examples, enabling on-the-fly knowledge transfer and skill imitation.
- Ceil: trains an example selector to pick examples from external documents.
Approach to solve latency-throughput problem
1. optimizing resource efficiency: resource allocation.
2. work on user-perceived latency: better request scheduling.
Maximizing end-to-end efficiency hinges on
1. improving the utility of individual examples to enable more effective per-request offloading
2. maintaining a high-utility example pool for offloading various requests under capacity & privacy constraints.
Example replay: same request is queried multiple times, and the best response is selected for reuse.

Questions

what about the overhead by prepending examples within small LLM? doesn’t it harm performance?

prepending examples slightly increases prefilling time due to longer input—the decoding time remains largely unaffected—but still far lower than that of large models.

Meeting Notes

caching: even similar(not same) request can improve performance
Designing Systems
1. Goal abstraction
  - interface
  - performance
2. Problem
  - Design goal
  - Trade off
  - Workloads
Principles
1. HoL -> preemption
2. Concurrency Control -> transaction
3. Too many workloads -> make system adaptive/customize
4. Central entity for communications.
5. communitivity -> scalability

Thoughts.

So caching similar data (not exactly the same data) works.
The weak point was “how” to implement that.
therefore, figuring out the exact bottlenecks is important.

For example, in this case, bottlenecks were not just caching similar data itself but how to collect that info and share among multiple LLMs efficiently.

And for efficiency itself, heuristic threshold does work well. Caladan also used the similar one.

This work also use centralized Router to control + heuristic threshold to manage control signals. similar with Caladan which works with centralized IOkernel to control singals + heuristic threshold to manage control signals.

to add control on something new scenario, maybe having a leader components and give it the rules to control might be common way.

Anyway, the biggest gain with this work for me is that I knew caching similar data also works.

By the way, I’m not sure this would work outside of AI model sementics.

Intrinsically, AI doesn’t have clear borderline in data to use, since it’s continously learning and evolving with provided inputs.

But traditional system is different.

“caching similar data” doesn’t make sense. Being conservative and cache less necessary data, might be possible, but in traditional system, cache performance only decided with hit/miss. There is no function to “learn” or “evolve” from provided data.

I think caching similar data works only in LLM context.

But it’s still interesting that their idea of “Utilize current resource well”.

if we use caching + offloading, maybe we can offloading some GPU works to CPU?

Hmm but isn’t GPU usually used for “strong computation power” like parallelism? not just about the speed which cache can benefits.

Can we apply this to memory?: it’s already doing caching well with tiered architecture itsself.