Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters
Co-optimize resource allocation and parallelization plan for heterogeneous GPU training with accurate simulation
Venue: SOSP 2025
Link: ACM DL
Topic: Heterogeneous and geo-distributed GPU clusters offer more resources per job but introduce challenges: vast configuration spaces, inaccurate simulators, and frameworks that don’t support heterogeneous parallelization plans. Sailor addresses all three.
Summary
Using heterogeneous GPU types and geo-distributed resources gives model developers access to more GPUs per job — improving training throughput. But current systems can’t efficiently use heterogeneous resources because they:
- Don’t co-optimize resource allocation with the job’s parallelization plan.
- Rely on inaccurate simulators for throughput/memory estimation.
- Use training frameworks (e.g., Megatron-LM) that don’t support heterogeneous parallelization plans.
Sailor introduces a profiler, configuration planner, simulator, and extended distributed training framework that jointly solve all three problems.
Background
Heterogeneous and geo-distributed resources
- Different GPU types (A100, V100, etc.) and different inter-node bandwidths across zones.
- More GPUs per job → higher throughput, but requires carefully planned parallelization.
Existing system limitations
- No co-optimization: current systems optimize resource allocation and parallelization plan independently → suboptimal configurations.
- Inaccurate simulators: existing memory footprint and iteration time estimates miss critical factors.
- Inflexible frameworks: Megatron-LM and similar systems are slow to reconfigure and don’t support different microbatch sizes per GPU — necessary for heterogeneous plans.
Key Idea
Co-optimize resource allocation + parallelization plan
- Creates a large search space → address with:
- Heuristic pruning: prune based on memory footprint, GPU capacity, and scalability constraints.
- Dynamic programming: reuse performance estimates for subproblems to avoid redundant computation.
Accurate simulation
- Collect more signals for estimation.
- Use a principled formula to model iteration time and memory footprint for any configuration.
Heterogeneous framework
- Extends Megatron-LM to support heterogeneous parallelization plans, different microbatch sizes per GPU, fast reconfiguration, fault tolerance, and elasticity.
Design
- Profiler: collects job information, compute node specs, and network bandwidth.
- Configuration planner: navigates the search space, recommends configurations optimizing a user-defined objective (max throughput or min cost) under constraints (budget, min throughput).
- Simulator: accurately models iteration time and memory footprint for any given configuration.
- Distributed training framework: adds support for heterogeneous configs, fault tolerance, and elasticity to Megatron-LM.
Evaluation
- Evaluated on heterogeneous and geo-distributed clusters.
- Outperforms baselines in throughput and cost efficiency.
Limitations
- Heterogeneity may prevent use of high-performance collective communication libraries.
- Geo-distributed networks are prone to unpredictable jitter and packet loss.
- Parallelization strategies are optimized for homogeneous settings → may need new strategies for heterogeneous-optimized collectives.
Insights
- Leverages ML training domain properties for smart search space pruning:
- Predictable workloads → accurate simulation.
- Domain knowledge (memory footprint constraints) → prune the search space.
- Throughput-only objective → simplifies optimization.
- Solid evaluation with many baselines and a well-structured motivation section.
Meeting Notes
- Existing hardware limitations are fundamental → sometimes need to modify hardware assumptions.
- Even with the same hardware, finding the right use case creates new value.
- Taking app feedback ≠ application modification.
- Fairness as a goal: showed 21.6% improvement — not necessarily high enough to be the primary contribution.
- Difficult to improve performance through resource reallocation alone in general data centers, but certain domains may benefit.