Jenga: Effective Memory Management for Serving LLM with Heterogeneity

a memory allocation framework for managing heterogeneous embeddings by leveraging layer properties

Featured image

Venue: SOSP 2025
Link: doi[https://dl.acm.org/doi/pdf/10.1145/3731569.3764823]

Topic: LLMs, memory management, leverage condition(trend) changed (homo -> hetero), add layer


Summary

No memory allocation framework for manage heterogenous embeddings(rising in modern LLMs).

Jenga: memory allocation framework for managing heterogeneous embeddings by leveraging layer properties.

reduces memory fragmentation enables customizable caching policies for different types of embeddings.

higher GPU memory utilization and increased serving throughput across a variety of models and scenarios.


Background

Problem: GPU memory capacity is primary bottleneck to high throughput

Priorworks/ PagedAttention: the most widely used memory management algorithm for LLM serving

Leverages homogeneous-style management Divides the memory into fixed-size pages & allocates pages to requests dynamically based on token count in the request (request length) assume

  1. fixed-size embeddings: embedding(KV cache) sizes are the same for different tokens & layers.
  2. Full attention: each layer requires the same number of pages.

Limitation: struggle with modern LLMs (heterogeneous)

: because of the growing heterogeneity in the sizes of model’s internal embeddings & attention mechanisms(token dependencies to generate a new token).

  1. Recent models often have embeddings with different sizes.
  2. not all layers use the same attention mechanism: number of pages needed for each layers is different.

Drawback with modern LLMS: memory under-utlization

uniformly partitioning memory using PagedAttention leads to a throughput reduction of up to 2.16X compared to an oracle-guided memory partition because the under-utilization of memory limits the maximum batch size that can be used. = too conservative. not smartly leveraging resources.

Requirements: memory management handles modern LLMs (heterogeneous)

In modern heterogeneous LLMs, memory consumption

-> we need a heterogeneity-aware memory management system to address the variability in both

-> we need dynamic memory exchange among different layers at runtime

Challenges

1. Should reduce memory fragmentation because LLM can’t use that.

as most LLMs only have 2-3 possible allocation sizes, one for each layer type, fragments often can’t satisfy any allocation sizes.

2. Unpredictable memory usage from varying attention mechanisms across layers

different attention machanisms complicate GPU memory partitioning for prefix caching.

1. Cache hit dependency

we need to carefully balance cache sizes across different types for a good overall cache hit rate

the model’s cache hit rate is bounded by the layer type with the worst rate. cache hit length(max number of tokens in the common prefix) is limited by the layer with the shortest hit length. Because the cache miss of any one layer will require all layers to recompute that token.

we need to customize an evict policy for each layer type to enable effective prefix caching

However, to reach similar cache hit length, different attention mechanisms need to cache a different number of token.

2. Trade-off between increasing cache hits & batch size

caching more token -> cache hits increase caching more token -> uses more memory -> limits batch size & reduce processing speed per token.


Design

a memory management framework specifically designed for heterogeneous KV caches

Key Idea: unified layer property interface that abstracts the behavior of each attention mechanism.

memory-related behaviors (ex. possible allocation sizes & page access patterns) can be identified in advance. Models these behaviors with layer properties. enable advanced memory management based on the layer properties.

What to monitor?

  1. page size (i.e., per-token KV memory footprint)
  2. active pages required for future token generation
  3. valid prefix cache hit lengths

Components

  1. Global LCM allocator
  2. Layer-specific allocators
  3. Layer-specific prefix cache managers

Challenge 1. reduce memory fragmentation: 2-level memory allocator

Based on all possible embedding sizes, JENGA minimizes fragmentation by introducing a 2-level memory allocator.

Bottom layer: LCM Allocator

large, fixed-size pages. to mimimize internal fragmentation of large pages, pick a compatible large page size so that it is a mulitple of each embedding size.

“LCM allocator” that chooses the page size as least common multiple of all possible token embedding sizes. leverage the least common multiple of embedding sizes to optimize memory usage

LCM allocator can be seen as a particular case of a slab memory allocator.

Top layer : Customized allocators for each specific layer type

different-size embeddings divided from the large pages. Jenga allocates small pages within a single large page to the same request whenever feasible. This ensures that large pages can be returned to the large-page allocator once the request is completed.

Challenge 2. how to balance cache hit rate among layers

Based on the page access pattern of different attention mechanisms,

customizes the cache hit & eviction policies for various attention mechanisms maintains a balanced cache hit rate among layers.

proposes

Jenga maintain 2 pools

Jenga uses all memory not allocated to running requests to maintain the “cached page pool” and adds pages to the pool immediately after they are freed by running requests.

Customized Cache eviction

LRU based - last access time

  1. priortize cache eviction for unimportant pages
  2. maintain balanced eviction across layers
  3. To allocate space for new pages, JENGA only evicts pages from the regular cached page pool.

Customized Cache Hit

to unify: Upon receiving a new request, Jenga invokes get_possible_prefix for each layer type to identify valid prefixes. The longest common prefix valid across all layers is selected as the cache hit prefix. Jenga matches incoming request prefixes using pages from both pools.

Common Prefix Prediction

if a request A shares a common prefix with a recent request B, and the interval between their arrivals is within one minute, there is an 81% probability that this common prefix exactly matches the prefix shared between request B and an even earlier request C. this indicates that GPU can simultaneously hold all requests arriving between B and A.


Effects

reduces fragmentation & acheives higher cache hit rates.


Implementation

implement in vLLM in a wide variety of models and use cases: from heterogeneous attention layers within the same model, to KV caches of draft and target models in speculative decoding, and to vision embeddings of VLMs.


Evaluation

improves GPU memory utilization by up to 83% <- avoid fragmentation with 2 layer memory allocator serving throughput by up to 2.16× (1.46× on average). <- enable batching more efficient.

Compared to vLLM, a state-of-the-art LLM serving engine, Jenga improves memory utilization by up to 79.6%, and achieves up to 4.92× higher throughput (1.80× on average) without impacting end-to-end latency. Jenga also enables the full 10M context length of Llama 4 on one 8 × H 200 GPU node (vLLM reaches only 3.6M).


Related Works

VMbased solutions still require a prefix caching system to decide which pages to cache in the physical memory for KV cache: jenga uses temporal locality.


Contribution

  1. Observation: “heterogenious” trends in LLM
  2. Added layer to control this: used slab-like way.

Inputs

1. embeddings

numerical representations of data — such as words, sentences, or images — in a continuous vector space.

2. prefix caching

let a new request reuse KV caches of an early request with common prefix

3. How a Slab Allocator Works

In OS memory management, a slab allocator works like this:

You know ahead of time what types of objects you’ll allocate (e.g., inodes, file descriptors, mutexes) Each object type has a fixed, known size You carve memory into slabs — chunks sized exactly for one object type When you need an object, you grab a pre-sized slot — no fragmentation, no waste


Questions

  1. Should we really need LCM? why not just use 4KB?
    I think it is necessary to use LCM. seems like the page size need to “larger than embedding size”. Unless , they might just have used smallest least common divisor or 4KB. oh wait. I think this is just abstraction. actually, chunk from Layer-specific allocators works like a page, which is just embedding size. LCM allocator is just abstraction to manage in higher level, clean way. —

Meeting Notes

  1. Come up with new ideas
    Does the paper completely solve the problem?

Thoughts.

1. To control something at runtime scale, we need to watch some factor can be pre-determined