Robust LLM Training Infrastructure at ByteDance

Automated failure diagnosis and recovery for large-scale LLM training — minimize unproductive time across 16K+ GPUs

Featured image

Venue: SOSP 2025

Topic: Large-scale LLM training is inherently unstable — failures are frequent and hard to detect. Diagnosis and recovery currently take hours to days. This paper presents an automated framework that minimizes unproductive training time.


Summary

Training large language models at 16K+ GPU scale is routinely interrupted by failures. Current practice: timeout-based detection, manual diagnosis, and full-job rescheduling — resulting in hours to days of wasted GPU time. Three types of failures make this hard: implicit failures (hard to detect), ultra-large scale (not enough spares for naive recovery), and continuously evolving user code (interacts with failure patterns).

Key insight: prioritize rapid isolation over precise root cause localization — finding the exact faulty machine is slow; quickly excluding suspected machines and resuming is faster.


Background

Why large-scale LLM training is unstable

  1. Implicit failures: silent corruption, hangs — hard to detect without active monitoring.
  2. Ultra-large scale: 16K+ GPUs → not enough spare machines to simply replace any failed component.
  3. Months-long training runs: user code evolves continuously → failures interact with code updates.

Current failure handling


Key Idea

Three principles

1. Rapid isolation, not precise localization

2. Control variability during recovery

3. Controlled and rapid recovery


Design

Architecture


Interesting Points


Assessment

Strengths:

Weaknesses:


Meeting Notes

(to be filled)