Taming the Beast: Running DeepSeek V3-0324 Locally
After hearing good things about this model, I tried, and succeeded, in getting it to run locally.
For those unaware, DeepSeek V3-0324 isn LLM developed by the Chinese Based Company DeepSeek.
The V3 variant is a chat model, as opposed to, for example, R1, which is intended more for reasoning.
The GGUF Q8 version of this model is around 700 GB’s in size, much larger than the 120B (Q8) models that I have been running up to now. However, DeepSeek is a MOE, a Mixture of Experts model, and only has 37 Billion active parameters, meaning that it can be run on systems with less VRAM.
Crudely speaking, MOE’s trade space for compute. IE, MOE’s tend to be much larger than “Dense” models, however, if you can fit them into RAM, they run much faster.
My AI Rig currently has an A6000 Ampere and two 3090s, for a total of 96 GB’s of VRAM, and it has 256 GB’s of DDR4 RAM in 8 Channel mode, and a Threadripper 3975WX CPU.
I am running Unsloths GGUF models, available from HERE.
This gives me a total of 352 GB’s of RAM, which is just enough to run the Q3_K_XL quant of this model.
Q3 is not particularly high, but for a very large model like this, it should be usable.
What follows is my experiences with setting up and evaluating the model.
Firstly, I ensured that both KoboldCPP and SillyTavern were both updated to their latest release (1.93 for Kobold, and 1.13 for SillyTavern). The latest versions of KoboldCPP, in particular, are needed to load MOE models.
The model itself is loaded in the same way as any GGUF model.
The Unsloth GGUF that I downloaded has 62 layers, I was able to offload 17 of them to the GPU’s VRAM, using a tensor split of 0.9,2.0,1.0.
Please not that in later versions of KoboldCPP, the GPU order has changed to the PCIE Bus ID, which, for me, was different to the previous load order (Please see the issue I opened HERE on Github). This meant that I had to change the order of the tensor split.
Once I did this, I noticed that the model was loading successfully, and KoboldCPP was starting, but it was failing with a “CUDA error: out of memory” whenever I would start inference. This was caused, I believe, to the context being too long.
I don’t know if MOE models in general require more RAM for context, or just DeepSeek.
After several days of tinkering, I managed to come up with some settings that worked well for my system.
The main issue for me was the “BLAS Batch Size” under the hardward tab in KoboldCPP. This was set to 512, I changed it to 64.
The BLAS Batch Size has a huge impact on the speed of prompt processing, and this makes and even bigger difference for MOE models due to how they work. A Batch size of 64 makes for VERY slow prompt processing, but once the prompt processing is done, the generation time is very fast.
I wanted to have at least 32k of context, since this is the minimum that I am used to dealing with. I believe deepseek supports up to 128k context, which is huge, but I haven’t verified this.
I also had to set “Quantize KV Cache” to “8-Bit” (Default was FP16) to prevent OOM errors.
I also set the Context to 32k in SillyTavern as well, to match KoboldCpp.
My settings now look like this:
KoboldCPP v1.93 Settings for DeepSeek | |
GPU Layers | 17/62 |
Context Size | 32768 |
Use ContextShift | On |
Use FlashAttention | On |
Tensor Split | 0.9,2.0,1.0 |
BLAS Batch Size | 64 |
Quantize KV Cache | 8-Bit |
As a result of all of this, I have managed to get DeepSeek V3 0324 Q3_K_XL running on my system with 32k context.
Almost all of my VRAM and System RAM is used up with these settings!
Prompt processing is very slow with the 64 bit batch size, and I am currently looking into reducing the layers being offloaded to the GPU to free up memory, so that I can increase the Batch size to at least 128.
However, the token generation is much faster!
With context set to 2k I was getting 8.51 tokens per second!
2k is obviously too low to be useful, and as I increased the context size, and the context filled up, token generation rates droped to about 4-5 tokens per second, and then, at 32k with some of the context filled, I am getting around 1.5 tokens per second at least, and this is including the slow prompt processing times.
This is not ideal, but it is still faster than what I would get with a dense model.
So, how good is DeepSeek compared to the 120b Dense models that I have been using?
Subjectively, the quality of the writing seems far superior. The Model seems to follow the prompt better, and characters in creative writing exercises seem to have a lot more personality, and feel more unique than other models.
Even though prompt processing is slower with my settings, token generation is much faster,.
I am limited by my available system RAM at the moment (I should have used a Q2 quant, not a Q3 one), and I am considering a RAM upgrade to 512 GB’s.
This presents an interesting question. 512 GB’s of DDR4 RAM (At 3200 MHZ) would cost about the same as anothe Power supply and another 3090 for my current rig, so, which would be better?
For running dense models, the extra 3090 would be much more useful. With 120GB’s of VRAM I could run 120b models at Q8 almost entirely in VRAM (Maybe with a little spilling over into system RAM at high context sizes) at very high speeds. However, for running MOE models, I would be far better with the 512 GB’s of system RAM (I could probably run a Q5 of DeepSeek, with much higher context and without quantising the KV cache. I could also use the GPU’s just for prompt processing, further increasing speed).
So, the question is: Are MOE models generally better than Dense models?
I don’t think there is a hard and fast answer to this.
There are some that argue that MOE models are inferior to dense models in terms of quality, and that they are only popular due to the fact that they can be run on cheaper hardware (RAM and compute is a lot cheaper than VRAM), and that if you can run a large, dense, model, that would be the better option.
However, there are others that might argue that dense models will, or have, hit a wall, and that Mixture of Experts models are a necessary step to continue to improve the quality of models.
I have come across a formula on reddit that seems to provide a rough estimate of the performance of a MOE model compared to a dense model. It is:
sqrt(active parameters * total parameters)
This would seem to indicate that DeepSeek with its 700 Billion total parameters, and 37 Billion active, would be generally equivalent to a 160B dense model. Based on the few days that I have spent working with it, this would seem accurate.
The model feels powerful, and, subjectively, it feels superior to 120b models, but not six times better.
Time will tell where MOE type models become the future or not, but even if they dominate the high end LLM space, Dense models (Including distills of MOE’s) will likely still have a place in the sub-70b space for a long time to come.