January 28, 2026

Robust LLM Training Infrastructure at ByteDance

Automated failure diagnosis and recovery for large-scale LLM training — minimize unproductive time across 16K+ GPUs

Venue: SOSP 2025

Topic: Large-scale LLM training is inherently unstable — failures are frequent and hard to detect. Diagnosis and recovery currently take hours to days. This paper presents an automated framework that minimizes unproductive training time.

Summary

Training large language models at 16K+ GPU scale is routinely interrupted by failures. Current practice: timeout-based detection, manual diagnosis, and full-job rescheduling — resulting in hours to days of wasted GPU time. Three types of failures make this hard: implicit failures (hard to detect), ultra-large scale (not enough spares for naive recovery), and continuously evolving user code (interacts with failure patterns).

Key insight: prioritize rapid isolation over precise root cause localization — finding the exact faulty machine is slow; quickly excluding suspected machines and resuming is faster.

Background

Why large-scale LLM training is unstable

Implicit failures: silent corruption, hangs — hard to detect without active monitoring.
Ultra-large scale: 16K+ GPUs → not enough spare machines to simply replace any failed component.
Months-long training runs: user code evolves continuously → failures interact with code updates.

Current failure handling

Timeouts and process termination to identify faulty machines → significant waste of GPU cycles.
Fail-stop → diagnose → reassume: entire procedure takes hours to days.

Key Idea

Three principles

1. Rapid isolation, not precise localization

Precise failure pinpointing leaves vast GPU fleets idle during diagnosis.
Instead: lightweight real-time detection + hierarchical stop-time diagnostics.
- Quickly single out suspect machines with minimal overhead.
- Data-driven clustering of runtime stack traces to isolate suspected machines.
- Over-evict suspects rather than chasing exact root causes.

2. Control variability during recovery

Automated fault tolerance framework with machine fault detection.
Code rollback for rapid verification — new code changes are merged with deterministic failures via a lazy update approach.
Exploits the inevitability and high frequency of failures as natural rollback trigger points.

3. Controlled and rapid recovery

Pre-provisioned warm standbys: execute self-checks before being assigned → avoid full job rescheduling.
Checkpointing module: fast backup and restart from recent checkpoint.

Design

Architecture

Control Plane:
- Robust Controller: orchestrates the automated failure mitigation framework.
- Runtime Analyzer: addresses job hangs and performance degradations.
Data Plane: resides within each training pod — monitors health and execution state.

Interesting Points

Optimization techniques and parallelization strategies derived from small-scale testing are often suboptimal at large scale.
Lazy update approach turns inevitable failures into a useful mechanism: merge code updates at failure/restart boundaries.

Assessment

Strengths:

Handles 16K+ GPU scale with fault tolerance and redundancy.
Step-by-step error detection procedure with empirical validation.
Detailed technical implementation.

Weaknesses:

Abstract is light on contributions.
Limited novelty compared to prior work on distributed fault tolerance.

Meeting Notes

(to be filled)