DeepSeek - Why All The Fuss?
Innovations
Deepseek R1
- Training with FP8
- Training without SFT (R1-zero)
- training data and scaling from non-human
- SLMs and distillation
Deepseek V3:
- Fp8 training for the first time (accelerated training, reduced GPU usage)
- efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths
- meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.
- training on 14.8T high-quality and diverse tokens.
- Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training.
- Blog claims capex more close to 1.6 B (https://semianalysis.com/2025/01/31/deepseek-debates/)
Just a day after Christmas in 2024, Deepseek released a model that caused ripples. After just a month, they released another model. This time, the model erased $ 1 Trillion in the stock market. So what is Deepseek and what are these two models? Let’s get into the details. But before that - here’s a picture of me in Hangzhou.
Subscribe to my mailing list!
Comments