Training Frameworks: ZeRO, FSDP, and the Memory Math That Gets You Hired
A practitioner's guide to distributed training frameworks — the memory formulas, parallelism strategies, and failure-mode reasoning that ML infra interviews actually test. Covers DDP, FSDP, DeepSpeed ZeRO, 3D parallelism, and fault tolerance.