Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

Co-optimize resource allocation and parallelization plan for heterogeneous GPU training with accurate simulation

Featured image

Venue: SOSP 2025
Link: ACM DL

Topic: Heterogeneous and geo-distributed GPU clusters offer more resources per job but introduce challenges: vast configuration spaces, inaccurate simulators, and frameworks that don’t support heterogeneous parallelization plans. Sailor addresses all three.


Summary

Using heterogeneous GPU types and geo-distributed resources gives model developers access to more GPUs per job — improving training throughput. But current systems can’t efficiently use heterogeneous resources because they:

  1. Don’t co-optimize resource allocation with the job’s parallelization plan.
  2. Rely on inaccurate simulators for throughput/memory estimation.
  3. Use training frameworks (e.g., Megatron-LM) that don’t support heterogeneous parallelization plans.

Sailor introduces a profiler, configuration planner, simulator, and extended distributed training framework that jointly solve all three problems.


Background

Heterogeneous and geo-distributed resources

Existing system limitations

  1. No co-optimization: current systems optimize resource allocation and parallelization plan independently → suboptimal configurations.
  2. Inaccurate simulators: existing memory footprint and iteration time estimates miss critical factors.
  3. Inflexible frameworks: Megatron-LM and similar systems are slow to reconfigure and don’t support different microbatch sizes per GPU — necessary for heterogeneous plans.

Key Idea

Co-optimize resource allocation + parallelization plan

Accurate simulation

Heterogeneous framework


Design

  1. Profiler: collects job information, compute node specs, and network bandwidth.
  2. Configuration planner: navigates the search space, recommends configurations optimizing a user-defined objective (max throughput or min cost) under constraints (budget, min throughput).
  3. Simulator: accurately models iteration time and memory footprint for any given configuration.
  4. Distributed training framework: adds support for heterogeneous configs, fault tolerance, and elasticity to Megatron-LM.

Evaluation


Limitations


Insights


Meeting Notes