IC-Cache: Efficient Large Language Model Serving via In-context Caching
In-Context Caching system/ leverage caching, let small LLM serves like a giant.
Venue: SOSP 2025
Link: ACM DL
Topic:
- end-to-end ML <-> system co-designs.
- caching
- offloading
Summary
in LLM, caching does help a lot. Utilize caching and let small LLMs work like much high performance model.
Leverage historical request-response pairs as in-context examples to enable live LLM capability augmentation at scale.
-> empowers smaller LLMs to produce higher-quality responses by imitating the behavior of larger models, thereby adaptively offloading traffic from them.
result: system that significantly reduces serving costs and latency.
Background
We use caching in serving LLMs, but we are not doing correctly.
- Situation:
- serving LLM at scale is challenging, due to their substantial resource demands and high latency.
- Service providers rely on complex/overprovisioned infrastructure to maintain responsiveness.
-
Observation: over 70% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge transer among requests. = cacheable
-
- Problem: naively caching & reusing responses doesn’t work well.
- major providers already support context caching mechanisms to reduce deplyment costs. But,
- exact match: yields low hit rates - KV caches
- similarity-based match: quality degradation (risk unrelated replies)
Goal
We need to cache to utilize similarity, but in a smart way.
Key Idea
Let small LLMs to imitate & exceed problem-solving abilities of their larger counterparts.
- by leveraging historical request-response pairs from larger models as in-context examples
- by enabling selective offloading of requests to small LLMS, to reduce cost and latency.
Goals
response quality <-> latency <-> system throughput
- must efficiently identify high-utility examples.
- Request offloading to smaller LLMs must ensure response quality: optimize efficiency <-> quality tradeoff
- aggressively offloading requests -> degrade response quality.
- should consider: request complexity, example utility, and current serving load.
- managing the example cache under data & load dynamics: requires optimizing example quality & efficiency at scale.
Solutions
- give relative samples with request: efficiently selects similar, high-utility examples and, put them in front of the new request’s input.
- Routes requests across LLMs with various specs: to find the right one to ensure response quality & serving loads.
- offline cost-aware cache reply mechanism: refines sample quality offline to maximize online cache utility and efficiency.
Design
IC-Cache serves as a complementary layer: between exisiting serving systems (ex. vLLM) <-> LLM applications.
- selecting helpful examples at scale
- adaptively offloading requests among LLMs
- managing examples to improve efficiency & utility.
Example Selector
: selects high-utility requestresponse pairs from history as examples to augment LLMs
- 2-stage example selection mechanism to balances selection efficiency and quality.
- pre-selects a small subset of examples with high relevance to ensure scalability (ex. cosine similarity).
- but this alone can’t secure response quality.
- cluster cached examples offline into K groups.
- use a lightweight proxy model to estimate their end-to-end utility.
- estimate the example’s end-to-end helpfulness.
- use user feedback.
- updated asynchronously.
- little overhead(<1%)
- the number of examples should be adapted per query: the Example Selector uses a dynamic utility threshold to filter out low-impact examples.
- pre-selects a small subset of examples with high relevance to ensure scalability (ex. cosine similarity).
Request Router: MAB-based request router
: routes new requests to the most suitable LLM, such as small models augmented with examples, or larger models, based on request complexity and the current serving load.
- A lightweight, bandit-based request router that jointly considers the request and selected examples to route requests to LLMs of varying capabilites.
- modeling the routing decision
- as a contextual multi-armed bandit (MAB) problem, a lightweight and data-efficient approach often used in online recommendation systems.
- aiming to maximize cumulative rewards such as maximizing response quality.
- tracks the Exponential Moving Average (EMA) of the system serving load over time.
- EMA is lower than desired operational threshold : priortize response quality.
- EMA exceeds : the router triggers a feedback controller to compute a corrective bias(calculated using the hyperbolic tangent (tanh) function), reduce the selection scores of high-cost models and favoring more efficient, lower-cost alternatives to relieve system pressure.
- lightweight control mechanism adjusts routing preferences without modifying or retraining the underlying Request Router, effectively decoupling overload management from the core routing logic.
Example Manager: for Management
: manages the caching of requests over time (e.g., eviction) and opportunistically improves example quality (e.g., by asynchronously replaying requests and storing the best response).
- selectively retains & evicts examples to maintain a bounded cache size / while respecting privacy.
- it ranks examples by their potential gain, G (e), and stops replaying examples whose potential gain falls below a cut-off.
- cut-off is determined online
- an online cache management policy that evicts low-utility examples.
- IC-Cache removes personally identifiable information using the widely adopted tool spaCy
Serving Workflow
- The Example Retriever retrieves the most helpful request-response pairs (e.g., based on relevance and quality) from the cache to serve as in-context examples.
- The new request, now augmented with these examples, is passed to the Request Router, which determines the appropriate LLM to handle the request.
- The selected model processes the request and generates a response, following the usual LLM generation process. The response is then delivered to the user.
- Finally, the Example Manager may add the request-response pair to the cache, depending on application-specific requirements (e.g., removing sensitive information), improve example quality via cost-aware replay, and evict stale or low-quality entries over time.
Implementation
3 popular LLM serving frameworks: HuggingFace Runtime, vLLM, LangChain.
Evaluation
on millions of realistic requests
- improves LLM serving throughput by 1.4–5.9x
- reduces latency by 28–71% without hurting response quality.
Related Works
- prefix caching
- Sementic caching
-
- Retrieval-Augmented Generation systems
- improves the reliability of LLM outputs/ by integrating knowledge retrieved from external sources.
- cons: RAG relies on long external sources and is vulnerable to out-of-domain or low-quality documents.
- lacking the compositional reasoning captured in LLM responses.
- IC-Cache complements RAG by generating cached queries with RAG to incorporate external knowledge.
Contributions
Make caching in LLM usable with some heuristic thresholds. offload works to small LLMs,-> utilize current resource better.
Inputs
-
- Retrieval-Augmented Generation systems
- improves the reliability of LLM outputs/ by integrating knowledge retrieved from external, static sources.
- In-context learning (ICL):
LLMs can learn from high-quality examples, enabling on-the-fly knowledge transfer and skill imitation.
- Ceil: trains an example selector to pick examples from external documents.
- Approach to solve latency-throughput problem
- optimizing resource efficiency: resource allocation.
- work on user-perceived latency: better request scheduling.
- Maximizing end-to-end efficiency hinges on
- improving the utility of individual examples to enable more effective per-request offloading
- maintaining a high-utility example pool for offloading various requests under capacity & privacy constraints.
- Example replay: same request is queried multiple times, and the best response is selected for reuse.
Questions
-
- what about the overhead by prepending examples within small LLM? doesn’t it harm performance?
- prepending examples slightly increases prefilling time due to longer input—the decoding time remains largely unaffected—but still far lower than that of large models.
Meeting Notes
- caching: even similar(not same) request can improve performance
- Designing Systems
- Goal abstraction
- interface
- performance
- Problem
- Design goal
- Trade off
- Workloads
- Goal abstraction
- Principles
- HoL -> preemption
- Concurrency Control -> transaction
- Too many workloads -> make system adaptive/customize
- Central entity for communications.
- communitivity -> scalability
Thoughts.
So caching similar data (not exactly the same data) works.
The weak point was “how” to implement that.
therefore, figuring out the exact bottlenecks is important.
For example, in this case, bottlenecks were not just caching similar data itself but how to collect that info and share among multiple LLMs efficiently.
And for efficiency itself, heuristic threshold does work well. Caladan also used the similar one.
This work also use centralized Router to control + heuristic threshold to manage control signals.
similar with Caladan which works with centralized IOkernel to control singals + heuristic threshold to manage control signals.
to add control on something new scenario, maybe having a leader components and give it the rules to control might be common way.
Anyway, the biggest gain with this work for me is that I knew caching similar data also works.
By the way, I’m not sure this would work outside of AI model sementics.
Intrinsically, AI doesn’t have clear borderline in data to use, since it’s continously learning and evolving with provided inputs.
But traditional system is different.
“caching similar data” doesn’t make sense. Being conservative and cache less necessary data, might be possible, but in traditional system, cache performance only decided with hit/miss. There is no function to “learn” or “evolve” from provided data.
I think caching similar data works only in LLM context.
But it’s still interesting that their idea of “Utilize current resource well”.
if we use caching + offloading, maybe we can offloading some GPU works to CPU?
Hmm but isn’t GPU usually used for “strong computation power” like parallelism? not just about the speed which cache can benefits.
Can we apply this to memory?: it’s already doing caching well with tiered architecture itsself.