The distilled versions of Deepseek are not as good as the full model. They are vastly inferior and other models out perform them handily. Running the full model, with a 16K or greater context window, is possible for about $2000 at about 4 tokens per second. This uses an
Machine Specs
AMD EPYC 7702
512GB DDR4 2400 in 32GB 2×4 ECC DIMMS
Gigabyte MZ32-AR0 Single Socket Mobo
Typical storage was an 4 TB mirror of NVMe U.2’s but that is pulled right now for the storage redo.
Leaving me the boot mirror 512GB NVMe pair
100GbE Mellanox ConnectX-4
Proxmox 8.3.3
4x 3090 MSI Ventus GPUs (no sli)
Corsair 1500w PSU
Rig Frame
Corsair H170i Elite XT 420mm Water Block Works for SP3 with bracket.
LXC container settings running docker. Docker runs ollama/owui stack.
120 CPUs (threaded cores, recommend backing it off 8 to keep temps down 4c at peak)
496GB RAM
unprivileged container
Upgrading GPUs from 3090 to RTX5080 can improve the speed.
RTX 5090: 21,760 CUDA cores, 32GB GDDR7 memory, 575W TGP ($1999)
RTX 5080: 10,752 CUDA cores, 16GB GDDR7 memory, 360W TGP ($999)
RTX 5070 Ti: 8,960 CUDA cores, 16GB GDDR7 memory, 300W TGP6 ($799)
RTX 5070: 6,144 CUDA cores, 12GB GDDR7 memory, 250W TGP ($549)
Keep reading with a 7-day free trial
Subscribe to next BIG future to keep reading this post and get 7 days of free access to the full post archives.