Daily Tech Feed: From the Labs

Deep dives into foundational AI and ML research papers

30: The Megatron Problem

Every competitive frontier model going forward is sparse — a Mixture-of-Experts architecture where each token activates only a fraction of the total parameters. That decoupling of parameter count from per-token compute sounds like a free lunch. The engineering...

Show Notes

The Megatron Problem — Show Notes

DTF:FTL Episode 0030 | March 12, 2026

Every competitive frontier model going forward is sparse. Mixture-of-Experts architectures decouple parameter count from per-token compute — but training them at scale creates coupled constraints across memory, communication, and computation that dense models never had. NVIDIA's Megatron Core team published the full engineering receipt: 88 pages, 42 figures, production-tested on clusters of thousands of GPUs.


Why it matters.

MoE is not a research curiosity. DeepSeek-V3, Qwen3, Mixtral, and most frontier models in active development are sparse. The question was never whether MoE architectures were theoretically superior — the question was whether anyone could actually train them efficiently at scale. This paper answers that question with production numbers: one thousand two hundred thirty-three TFLOPS per GPU on GB300 for a 685-billion-parameter model, roughly 50 percent of theoretical hardware peak. The framework is open source. Any serious lab can now train competitive sparse models. The moat just got narrower.


Primary Source

  • Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685
  • Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM
  • Megatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core

Models Referenced

  • DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
  • DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3
  • Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/
  • Qwen GitHub: https://github.com/QwenLM/Qwen3
  • Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088

MoE Foundations

  • Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538
  • Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905
  • Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368

Parallelism and Training Infrastructure

  • Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053
  • Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473
  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668
  • FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023

Compute Primitives

  • FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691
  • Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlass
  • NVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/

Hardware

  • NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/
  • NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

Related Reading

  • Scaling Laws for Neural Language Models (Kaplan et al.): https://arxiv.org/abs/2001.08361
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: https://arxiv.org/abs/2201.05596

DTF:FTL — Dispatches from the edge. New episodes daily. All content AI-assisted; factual claims sourced from cited papers.