Caladan: Mitigating Interference at Microsecond Timescales

Dedicated scheduler core + fast core allocation reacts to interference in microseconds — no slow hardware partitioning

Featured image

Venue: OSDI 2020
Link: USENIX OSDI ‘20

Topic: Hardware resource partitioning reacts too slowly (tens of seconds) to interference between latency-critical and best-effort tasks. Caladan uses a dedicated scheduler core that monitors fine-grained control signals and adjusts core allocations in microseconds.


Summary

Data centers co-locate multiple tasks on one machine for utilization, but interference between tasks spikes tail latency. Hardware partitioning (static or dynamic) is too slow — static wastes utilization, dynamic takes seconds to converge. Caladan replaces partitioning with core allocation driven by real-time control signals, reacting to interference in microseconds.


Background

The interference problem

Types of CPU interference

  1. Hyperthreading interference (within one physical core): execution unit contention.
  2. Memory bandwidth interference: affects all cores on the same physical CPU.
  3. LLC interference: affects all cores on the same physical CPU.

Limitation of partitioning


Key Idea

Core allocation as the isolation mechanism

Two challenges

  1. Sensitivity: many interference types → need signals that accurately detect each type.
  2. Scalability: must gather signals and adjust allocations faster than interference builds up.

Design

Dedicated scheduler core

Scheduler workflow

  1. Collect fine-grained measurements: memory bandwidth usage, request processing times.
  2. Detect interference: memory bandwidth saturation, hyperthreading contention.
  3. Adjust core allocations: grant more cores to LC tasks experiencing queueing delays.
  4. Restrict BE tasks from CPU cores until interference is resolved.

Controllers

KSCHED: Fast and Scalable Scheduling

KSCHED workflow


Insights


Meeting Notes

(to be filled)