Some believe DeepSeek is so efficient that we don’t need more compute and everything has now massive overcapacity because of the model changes. Jevons Paradox is closer to reality because demand has already increased H100 and H200 pricing.
Deepseek and High Flyer have a mix of 50,000 H20s, H800s, A100s and H100 GPUs. Deepseek has about $1.3 billion in AI servers.
DeepSeek has hired in China based on capability and curiosity. DeepSeek recruited from top universities like PKU and Zhejiang. Hires are given flexibility and access to 10,000s GPUs with no usage limitations. They offer salaries of over $1.3 million dollars USD for promising candidates. This was more than competing big Chinese tech companies and AI labs like Moonshot. They have ~150 employees and are growing fast.
The $6M pre-training number is nowhere the actual amount spent on the model.
Multi-Head Latent Attention, a key DeepSeek breakthrough, took several months to develop and cost a whole team of manhours and GPU hours.
Deepseek V3 beats the performance of Open AI 4o. GPT-4o was released in May of 2024.
DeepSeek r1 matches OpenAI o1. Deepseek used the OpenAI o1 API to quickly train on its toughest questions and correct answers. It was easier to catch up in the newer AI reasoning models. However, the richer companies with more resources will scale reasoning even more and there are a lot more gains to be had from reasoning improvements and test time memory.
Dario, CEO of Anthropic says that algorithmic advancements are even faster and can yield a 10x improvement. As far as inference pricing goes for GPT-3 quality, costs have fallen 1200x. The DeepSeek difference was that a Chinese company made the AI cost improvement.
AI costs will likely fall another 5x by the end of 2025. The cost leader could any one of several competitors.
Google’s Gemini Flash 2.0 Thinking is considerably cheaper than DeepSeek R1.
DeepSeek improvement will be copied by Western labs almost immediately.
DeepSeek V3 uses Multi-Token Prediction (MTP) at a scale not seen before. Attention modules predict the next few tokens as opposed to a singular token. This improves model performance during training and can be discarded during inference. This algorithmic innovation improved performance with lower compute.
DeepSeek v3 is a mix of experts model. One large model is made of many other smaller models that specialize in different things.
DeepSeek useds gating network that routed tokens to the right expert in a balanced way without hurting model performance.
Deepseek was not allowed to use OpenAI O1 API to train another model. KYC (Know Your Customer) and other means will be used to stop distillation training.
Analysis on Which of Big Tech companies Win?
Deepseek has more innovations to release and are a significant competitor especially with strong backing from the China government and China banks.
Estimating share of the future AI market. Winners will need to have complete application offerings and be able to win the trust and loyalty of a large set of customers.
This can be estimated by thinking about what will you have open or around and which AI will be used from that device or system? Is it in your bot, your glasses, your phone. Or your car, your home system or your social media?
What is open and what do you have the relationship with? It would be Grok if you are in X all the time. Where is your Jiminy Cricket ? Who is the confidant and advisor that you are always talking ? Maybe it will be your neuralink ? Which is the company that you will trust ?
Winners could be Amazon by making mostly equal and very good models available on AWS.
Keep reading with a 7-day free trial
Subscribe to next BIG future to keep reading this post and get 7 days of free access to the full post archives.