AI: Assembly and setup of AI Inference Rig
I have finally finished the construction of my AI inference Rig!
The system specs are:
Threadripper PRO 3975 CPU,
2x ASUS TUF 3090 GPU’s (These have two PCI-E power pins instead of three, which is important!),
256 GBs of DDR4 3200 MHZ RAM,
Corsair 1500 watt PSU.
I had quite a few more issues that I was expecting when setting up the AI rig.
The main lesson that I learned is that the Threadripper PRO CPU’s are very sensitive to mounting pressure!
After I had assembled the system, it failed to boot. It seemed like the issues were CPU related.
After reseating the CPU and reapplying the thermal paste, everything seemed fine, except for one RAM stick not showing up, which I fixed by simply reseating it too.
Other than the CPU issues, the build went more or less ok.
I was planning to use Linux Mint for the software side, but I had issues getting this to work (Mostly Driver and CUDA issues).
I eventually went with popOS (Another Ubuntu-based Linux Distro) and this worked a lot better. I prefer the Interface for Mint (It is more similiar to Windows) but popOS seems to have better nVidia/CUDA support, which I need for this build.
I spend some time setting up the software side of the system.
Eventually, I got the KoboldCPP CUDA version 12 working properly. I had to update some drivers and the CUDA toolkit for this to work, popOS comes with 11.5 by default.
I had no issues adding enabling network access to kobold, but I did have some problems customising RDP/XDRP for remote admin of the server. This took some time to fix.
I also installed stable diffusion on the system, but I will probably use it in CPU-only mode, to save VRAM for LLM inference.
I have not fully tested or optimised the system yet, but initial results are good.
First, the RAM bandwidth seems to be closer to 150 GB/s, which is something I was worried about. This is extremely good!
Based on my post HERE, I was estimating that the memory bandwidth of the CPU would be around 140 GB/s maximum, in actual fact, it is around 146-148, which is even higher than I had hoped.
This means that it was worth it to get the more expensive 3975 CPU, rather than the 3945 (Due to the memory bandwidth limitations mentioned in the post above).
I have tested just a few models and quants so far, and the results have been decent, but not spectacular.
Before building this system, I was hoping for between 1 and 2 tk/s when running a 120b model at Q8 with full context (32k for most models).
It seems, based on very preliminary testing, that I am getting about 1.23 tk/s. This is a little slower than I was hoping, but this is a total value, including context and prompt processing (Which were not included in the post I made above, estimating token generation rates).
In practice, these values are actually quite usable, and I am satisfied with the results.
It does seem however, that as context fills up, token generation rates (And, in particular, prompt processing rates) drop substantially. I will need to do more testing into this, but I suspect that adding a third 3090 to the system will be desired at some point in the future. However, I shouldn’t need any more upgrades beyond that.
Running a 70b Q5 on my current dev machine (Ryzen 5950x and a single 3090, with 128 GB DDR4 RAM) gets me about 0.94 tokens per second. On my new system, that same model gets at least 5.3 tokens per second, a lot faster.
I intend to do a lot more testing with the AI rig in the future, and hopefully post my results here. The power of this system should allow me to explore some very interesting concepts.
I could potentially even run Falcon 180b at slow, but reasonable, speeds!