cache_ext: Customizing the Page Cache with eBPF
OS page cache evicition policy: only 1 -> supports multiple policy, FLEXIBILITY
Venue: SOSP 2025 doi
Topic: cache eviction, flexibility, add layer, eBPF
Summary
OS page cache evicition policy: one-size-fits-all -> supports multiple policy, FLEXIBILITY provide interface so that programmers don’t need to modify kernel to use different policy
Contribution
achieve compatibility with current linux environment + multiple page eviction strategies (even app characteristics aware) by leveraging eBPF’s kfuncs & struct_ops
Background
- OS page cache reduces excessive accesses to storage.
- OS page cache eviction policy is: one-size-fits-all (approximate LRU) -> performs poor
- Applying different eviction policies is difficult in the page cache: modifying kernel code is complex.
Goal: allows developers to customize the page cache without modifying the kernel
flexible eBPF-based framework for the Linux page cache
Challenges
Scalability
custom page cache policies must run with low overhead. modern storage devices support millions of IOPS with low latency
Flexibility
caching algorithms are very diverse & use complex data structures
Isolation & sharing
different applications’ policies do not interfere with each other
Security
dealing with invalid references which could lead to kernel crashes or security breaches
Design
High Level
- allows users to run custom eviction policy functions (implemented as eBPF functions in the kernel)
- The policy functions
- are triggered by particular events (e.g., folio eviction, access, admission)
- operate on a user-specified number of variable-sized eviction lists (store pointers to folios managed by the policy)
- The policy functions determine which folios to admit or evict to and from the lists based on metadata (e.g., access frequency, recency, which thread accessed the folio), which is stored in eBPF maps.
- Upon eviction, cache_ext runs a user-defined eviction function to propose a set of eviction candidates for the kernel to evict.
Solve
Scalability: custom page cache policies must run with low overhead
eBPF-based policies run in the kernel: avoids expensive & frequent synchronization between the kernel and userspace.
Flexibility: should supports caching algorithms: very diverse & use complex data structures
allows applications to define
- one or more variable-sized lists of pages
- a set of policy functions (e.g., admission, eviction) that operate on these lists, which can be used to express a wide range of eviction policies.
Isolation & Sharing: different applications’ policies do not interfere with each other
identify cgroups as a natural isolation boundary. each cgroup implement its own eviction policy without interfering with other cgroups
Security: dealing with invalid references which could lead to kernel crashes or security breaches
maintains a registry of valid page references, which is used to validate the page references returned by the user-defined policies.
Implementation
- takes advantage of eBPF: a Linux (and Windows) supported runtime that allows safely running application code inside the kernel.
- by implementing policies in-kernel, cache_ext can delegate all policy decisions to custom policies and enforce them directly.
- consider multi-tenant isolation & application-informed policies.
- used struct_ops and new kfunc APIs in accordance with modern eBPF practices.
Policy Functions
allows applications to define custom eviction policies as policy functions
- policy functions: a set of eBPF programs that trace caching events and determine which folios to evict from the page cache.
- triggered by five events: policy initialization, request for eviction, folio admission, folio access, folio removal.
- implemented using eBPF’s struct_ops kernel interface
Application-Informed Eviction
- applications to use eviction algorithms tailored to their design.
- tag the pages, and differentiate priority
Application-Informed Admission Filter
act like a bouncer. if something doesn’t need to be cached, (large, streaming) then don’t save that to cache
Evaluation
it is beneficial for applications to customize the page cache to match their workloads’ unique properties,
- achieve up to 70% higher throughput
- 58% lower tail latency.
Related Work
eBPF policy customization in Linux
P2Cache
can’t support eviction policies that don’t rely on LRU queue or require multiple queues.
PageFlex
not eviction: swapping & prefetching
FetchBPF
not eviction: customize prefetching policies
sched_ext: eBPF framework for custom Linux scheduling policies
more performant for process scheduling than page eviction.
- number of processes on a machine < pages in memory
- process scheduling occurs less frequently than page cache events (e.g., page access).
Page cache customization: Extensible kernels
allow applications to customize kernel interfaces and policies.
- ex) VINO, SPIN These OS designs never achieved widespread use.
Page cache customization: Extensible file systems
lack of compatibility: wouldn’t work with existing Linux or legacy file systems.
- ex) ACFS, XN, libOS, Bento
Userspace caches
: simply implement their own userspace cache, and bypass the OS page cache with direct I/O
- ex) TriCache many popular data systems still rely on the page cache
Linux: not very meaningful
add customization options to its LRU policy: add new MGLRU policy to replace the old one MGLRU sometimes underperforms the default LRU algorithm. -> no golden rule. we need more granularity following the workloads.
Inputs
no one-size-fits-all policy that improves all workloads
customization is necessary to maximize performance.
cgroups
a Linux mechanism that isolate resource usage for groups of processes
2 ways to access page cache
Pages in the page cache can be accessed through memory mappings (in which case the page table entry access bit will be set) or through file-based interfaces (e.g., read()).
folio
page container that can hold multiple pages
LRU is notoriously bad for scan-like access patterns.
madvise & fadvise
don’t work well sometimes: their actual behavior is highly dependent on the kernel implementation, can yield unexpected results.
sandbox: a restricted environment where code can run safely without affecting the rest of the system
eBPF
allows userspace functions to run in a sandbox within the Linux kernel in a safe and controlled manner.
- how: eBPF runs “userspace-written logic in the kernel” by compiling it into restricted bytecode, verifying it for safety, and executing it in a tightly controlled sandbox with limited memory access and predefined hook points.
usage
observability, security, scheduling, I/O acceleration, customizing page cache behavior, customizing memory management policies such as prefetching
eBPF programs characteristics
eBPF programs are verified by the kernel before they can be run, ensuring, for example, that the programs do not contain illegal memory accesses, and that they will terminate within a fixed number of instructions.
struct_ops: exposes an interface of function-pointer callbacks to userspace.
struct_ops makes it much easier to introduce new user-defined policies by minimizing complex verifier changes.
- has been used for: TCP congestion control algorithms, FUSE BPF filesystems, correcting HID device behavior, packet scheduling, and by sched_ext, an eBPF scheduling framework.
kfuncs: eBPF programs interact with the kernel via kfuncs
- kfuncs: do not necessarily have a stable interface
- helper: eBPF programs also interact with the kernel through helper functions, which are considered a stable interface
- kfuncs are simpler to implement and are more flexible.
Userspace-offload overhead
can be expensive. not true : userspace == fast
many of selection policies can be implemeted using linked lists
either exactly or approximately using linked lists, where the policy iterates over one or more lists and evicts items based on a calculated per-item score.
cgroup isolation
common pattern of deploying modern applications via containers: isolate each application in its own memory cgroup.
Eviction policies
S3-FIFO, Least Hit Density, Multi-Generational LRU, LFU, MRU, FIFO
Questions
customization can be dangerous when it happens with arbitary situation
maybe we can use it with memory allocation (something requires multiple policies)
it’s decades ago maybe there would be some changes that we can examine.
Thoughts.
there is no “one-size-fits-all” policy
there is no one-size-fits-all policy that improves all workloads – customization is necessary to maximize performance. -> maybe this one also helps our ideas too. only controlling core and let cache and membw freely is not okay.
Better finishing OS textbook… my file system knowledge is not enough
the kernel provides madvise() and fadvise() system calls. These interfaces allow userspace applications to give hints to the kernel about how to handle certain ranges of memory or files.
But I think I read that in most case LRU works in real world situation. HHmmmmmmmm
with chatgpt, now there is literally no excuse that saying “I couldn’t understand the paper”.. hahaha
Finding a blank space in academia & doing good implemention
so it’s not like new idea. everyone knows linux’s single LRU eviction strategy, and there were tries with eBPF. But they failed handling heterogeniousity or not handling page cache eviction.
so they implemented the layer to link eBPF and customed eviction policy management, kinda from scratch using struct_ops and kfuncs.
eBPF is relatively new tech. And i guess it would be quite recent that LRU really doesn’t severely work well. (actually i still a bit doubt but anyway…) Guess they succeed to catch the right timing and made a good implementation.