DeepSeek - Why All The Fuss?

less than 1 minute read

Innovations

Deepseek R1

Training with FP8
Training without SFT (R1-zero)
training data and scaling from non-human
SLMs and distillation

Deepseek V3:

Fp8 training for the first time (accelerated training, reduced GPU usage)
efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths
meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.
training on 14.8T high-quality and diverse tokens.
Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training.
Blog claims capex more close to 1.6 B (https://semianalysis.com/2025/01/31/deepseek-debates/)

Just a day after Christmas in 2024, Deepseek released a model that caused ripples. After just a month, they released another model. This time, the model erased $ 1 Trillion in the stock market. So what is Deepseek and what are these two models? Let’s get into the details. But before that - here’s a picture of me in Hangzhou.

Subscribe to my mailing list!

Twitter Facebook LinkedIn

Skanda Vivek

DeepSeek - Why All The Fuss?

Innovations

Deepseek R1

Deepseek V3:

Comments

You May Also Enjoy

Keeping LLMs Classy

Raga Bhairavi in Carnatic Music

Getting GitHub Pages Sites To Enforce https

Rich Dad/Poor Dad