Jenga: Effective Memory Management for Serving LLM with Heterogeneity
a memory allocation framework for managing heterogeneous embeddings by leveraging layer properties
Venue: SOSP 2025
Link: doi[https://dl.acm.org/doi/pdf/10.1145/3731569.3764823]
Topic: LLMs, memory management, leverage condition(trend) changed (homo -> hetero), add layer
Summary
No memory allocation framework for manage heterogenous embeddings(rising in modern LLMs).
Jenga: memory allocation framework for managing heterogeneous embeddings by leveraging layer properties.
reduces memory fragmentation enables customizable caching policies for different types of embeddings.
higher GPU memory utilization and increased serving throughput across a variety of models and scenarios.
Background
Problem: GPU memory capacity is primary bottleneck to high throughput
- to improve serving throughput: batching multiple requests in parallel.
- GPU memory needs grows linearly with the batch size: each request needs to cache its entire prefix Need memory management which increase serving throughput & reduce costs.
Priorworks/ PagedAttention: the most widely used memory management algorithm for LLM serving
Leverages homogeneous-style management Divides the memory into fixed-size pages & allocates pages to requests dynamically based on token count in the request (request length) assume
- fixed-size embeddings: embedding(KV cache) sizes are the same for different tokens & layers.
- Full attention: each layer requires the same number of pages.
Limitation: struggle with modern LLMs (heterogeneous)
: because of the growing heterogeneity in the sizes of model’s internal embeddings & attention mechanisms(token dependencies to generate a new token).
- Recent models often have embeddings with different sizes.
- not all layers use the same attention mechanism: number of pages needed for each layers is different.
Drawback with modern LLMS: memory under-utlization
uniformly partitioning memory using PagedAttention leads to a throughput reduction of up to 2.16X compared to an oracle-guided memory partition because the under-utilization of memory limits the maximum batch size that can be used. = too conservative. not smartly leveraging resources.
Requirements: memory management handles modern LLMs (heterogeneous)
In modern heterogeneous LLMs, memory consumption
- varies significantly across layers
- is highly dependent on request length these heterogeneity necessitates allocating KV cache of different sizes for different tokens.
-> we need a heterogeneity-aware memory management system to address the variability in both
- embedding size
- attention mechanism
-> we need dynamic memory exchange among different layers at runtime
Challenges
1. Should reduce memory fragmentation because LLM can’t use that.
as most LLMs only have 2-3 possible allocation sizes, one for each layer type, fragments often can’t satisfy any allocation sizes.
2. Unpredictable memory usage from varying attention mechanisms across layers
different attention machanisms complicate GPU memory partitioning for prefix caching.
1. Cache hit dependency
we need to carefully balance cache sizes across different types for a good overall cache hit rate
the model’s cache hit rate is bounded by the layer type with the worst rate. cache hit length(max number of tokens in the common prefix) is limited by the layer with the shortest hit length. Because the cache miss of any one layer will require all layers to recompute that token.
we need to customize an evict policy for each layer type to enable effective prefix caching
However, to reach similar cache hit length, different attention mechanisms need to cache a different number of token.
- the optimal memory partition across layers depends heavily on request length & can change during runtime.
2. Trade-off between increasing cache hits & batch size
caching more token -> cache hits increase caching more token -> uses more memory -> limits batch size & reduce processing speed per token.
Design
a memory management framework specifically designed for heterogeneous KV caches
Key Idea: unified layer property interface that abstracts the behavior of each attention mechanism.
memory-related behaviors (ex. possible allocation sizes & page access patterns) can be identified in advance. Models these behaviors with layer properties. enable advanced memory management based on the layer properties.
What to monitor?
- page size (i.e., per-token KV memory footprint)
- active pages required for future token generation
- valid prefix cache hit lengths
Components
- Global LCM allocator
- Layer-specific allocators
- Layer-specific prefix cache managers
how to do this at runtime? : use factors(memory related behaviors) which can be identified in advance
Challenge 1. reduce memory fragmentation: 2-level memory allocator
Based on all possible embedding sizes, JENGA minimizes fragmentation by introducing a 2-level memory allocator.
Bottom layer: LCM Allocator
large, fixed-size pages. to mimimize internal fragmentation of large pages, pick a compatible large page size so that it is a mulitple of each embedding size.
“LCM allocator” that chooses the page size as least common multiple of all possible token embedding sizes. leverage the least common multiple of embedding sizes to optimize memory usage
LCM allocator can be seen as a particular case of a slab memory allocator.
Top layer : Customized allocators for each specific layer type
different-size embeddings divided from the large pages. Jenga allocates small pages within a single large page to the same request whenever feasible. This ensures that large pages can be returned to the large-page allocator once the request is completed.
Challenge 2. how to balance cache hit rate among layers
Based on the page access pattern of different attention mechanisms,
customizes the cache hit & eviction policies for various attention mechanisms maintains a balanced cache hit rate among layers.
proposes
- a common-prefix predictor to guide which tokens to cache: temporal locality
- a prefix cache simulator to find the best trade-off between cache hit and batch size.
Jenga maintain 2 pools
Jenga uses all memory not allocated to running requests to maintain the “cached page pool” and adds pages to the pool immediately after they are freed by running requests.
- common page pool: separately cache pages highly likely to be reused in future requests,
- regular cached page pool: size to about 1000 pages is enough for avoiding the cache miss due to maximized batch size
Customized Cache eviction
LRU based - last access time
- priortize cache eviction for unimportant pages
- maintain balanced eviction across layers
- To allocate space for new pages, JENGA only evicts pages from the regular cached page pool.
Customized Cache Hit
to unify: Upon receiving a new request, Jenga invokes get_possible_prefix for each layer type to identify valid prefixes. The longest common prefix valid across all layers is selected as the cache hit prefix. Jenga matches incoming request prefixes using pages from both pools.
Common Prefix Prediction
if a request A shares a common prefix with a recent request B, and the interval between their arrivals is within one minute, there is an 81% probability that this common prefix exactly matches the prefix shared between request B and an even earlier request C. this indicates that GPU can simultaneously hold all requests arriving between B and A.
Effects
reduces fragmentation & acheives higher cache hit rates.
Implementation
implement in vLLM in a wide variety of models and use cases: from heterogeneous attention layers within the same model, to KV caches of draft and target models in speculative decoding, and to vision embeddings of VLMs.
Evaluation
improves GPU memory utilization by up to 83% <- avoid fragmentation with 2 layer memory allocator serving throughput by up to 2.16× (1.46× on average). <- enable batching more efficient.
Compared to vLLM, a state-of-the-art LLM serving engine, Jenga improves memory utilization by up to 79.6%, and achieves up to 4.92× higher throughput (1.80× on average) without impacting end-to-end latency. Jenga also enables the full 10M context length of Llama 4 on one 8 × H 200 GPU node (vLLM reaches only 3.6M).
Related Works
VMbased solutions still require a prefix caching system to decide which pages to cache in the physical memory for KV cache: jenga uses temporal locality.
Contribution
- Observation: “heterogenious” trends in LLM
-
Added layer to control this: used slab-like way.
Inputs
1. embeddings
numerical representations of data — such as words, sentences, or images — in a continuous vector space.
2. prefix caching
let a new request reuse KV caches of an early request with common prefix
3. How a Slab Allocator Works
In OS memory management, a slab allocator works like this:
You know ahead of time what types of objects you’ll allocate (e.g., inodes, file descriptors, mutexes) Each object type has a fixed, known size You carve memory into slabs — chunks sized exactly for one object type When you need an object, you grab a pre-sized slot — no fragmentation, no waste
Questions
-
- Should we really need LCM? why not just use 4KB?
- I think it is necessary to use LCM. seems like the page size need to “larger than embedding size”. Unless , they might just have used smallest least common divisor or 4KB. oh wait. I think this is just abstraction. actually, chunk from Layer-specific allocators works like a page, which is just embedding size. LCM allocator is just abstraction to manage in higher level, clean way. —
Meeting Notes
-
- Come up with new ideas
- Does the paper completely solve the problem?