DeepSeek - Why All The Fuss?

less than 1 minute read

Innovations

Deepseek R1

  • Training with FP8
  • Training without SFT (R1-zero)
  • training data and scaling from non-human
  • SLMs and distillation

Deepseek V3:

  • Fp8 training for the first time (accelerated training, reduced GPU usage)
  • efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths
  • meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.
  • training on 14.8T high-quality and diverse tokens.
  • Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training.
  • Blog claims capex more close to 1.6 B (https://semianalysis.com/2025/01/31/deepseek-debates/)

Just a day after Christmas in 2024, Deepseek released a model that caused ripples. After just a month, they released another model. This time, the model erased $ 1 Trillion in the stock market. So what is Deepseek and what are these two models? Let’s get into the details. But before that - here’s a picture of me in Hangzhou.

Subscribe to my mailing list!

Subscribe

* indicates required

Intuit Mailchimp

Tags:

Categories:

Updated:

Comments