AI: Estimated token generation rates for selected model sizes (and algorithm to calculate same)
Following on from my previous post on AI, I have come up with a very rudimentary algorithm to try to estimate the token generation rates that my AI rig will be able to produce when it is built.
Algorithm
The algorithm that I am using here is simple.
The token generation rate is, very roughly, the memory bandwidth divided by the model size in gigabytes.
So, if the model is running entirely on the GPU (With a bandwidth of 936 GB/s) and the model size is 120GB, then the token generation speed would be:
936/120= 7.8 tk/s
When running on the CPU, assuming a RAM bandwidth of 140 GB/s (See the graph in my previous post HERE).
140/120= 1.16 tk/s
These numbers do NOT take into account the context size of the model (Which can be a lot, for higher context models) or the time spent on prompt processing (Which can also be significant). In addition, it doesn’t take into account the overhead involved in, for example, combining multiple GPU’s together (IE, 48 GB’s VRAM Split over two 3090s is not going to perform as well as a single 48 GB GPU).
However, it seems that, as a general rule of thumb, these numbers work.
But what about hybrid systems? Where the processing is occurring on both the GPU and CPU?
This is where I had to made some assumptions, and I don’t yet know how well these assumptions hold up in the real world, if at all.
To generate the table below, I simply combined the GPU bandwidth and CPU bandwidth speeds together into an average bandwidth speed based on the percentage of the model that is on the GPU vs the CPU.
For example, assuming a 127.9 GB model, and 48 GBs of VRAM:
37.5% of the model is on the GPU, 62.4% of the model is on the CPU.
So, the average bandwidth is:
vram bandwith * (percentage of model on gpu /100) + system ram bandwidth * (percentage of model on cpu /100)
Or:
936*(37.5/100) + 140*(62.4/140) = 438.73 GB/s
Now, to get the tk/s, simply divide this by the model size:
438.73/127.9=3.43 tk/s.
Again, this is very approximate, and I won’t know how close it actually is to being accurate until the system is actually built, but I felt that this algorithm, and the data below, could be useful for some people.
Table
For the Table below, I have tested four conditions:
CPU and GPU, and CPU only (No GPU) for both a RAM bandwidth of 140 GB/s and a RAM bandwidth of 100 GB/s.
The reason why I have chosen to test two assumed RAM bandwidth speeds is because, first of all, RAM bandwidth is the most important factor is determining the performance of an AI system, and second of all, there can be a huge variation in the actual, real world, RAM bandwidth of a system (See my previous post).
NB: The following graph assumes that the system has access to 48 GB’s of VRAM. The numbers will of course be different if more or less processing is being done on the GPU (Although the numbers for CPU only will not change).
Model | Size (Gbs) | 100 Gb/s RAM GPU+CPU | 100 Gb/s RAM CPU ONLY | 140 Gb/s RAM GPU+CPU | 140 Gb/s RAM CPU ONLY |
180b Q8 | 193 | 1.59 tk/s | 0.51 tk/s | 1.75 tk/s | 0.72 tk/s |
180b Q6 | 150 | 2.45 tk/s | 0.66 tk/s | 2.63 tk/s | 0.93 tk/s |
120b Q8 | 127.9 | 3.23 tk/s | 0.78 tk/s | 3.43 tk/s | 1.09 tk/s |
120b Q6 | 98.8 | 5.12 tk/s | 1.01 tk/s | 5.33 tk/s | 1.41 tk/s |
70b Q8 | 73.6 | 8.76 tk/s | 1.35 tk/s | 8.95 tk/s | 1.90 tk/s |
70b Q6 | 57.0 | 14.10 tk/s | 1.75 tk/s | 14.21 tk/s | 2.45 tk/s |
Conclusion
The numbers above look very good. With a 120b Q8 model and 140 Gb’s RAM bandwidth, the estimate is 3.43 tk/s! This, as I said, doesn’t take into account prompt processing time, context, and other real world considerations, but even as a rule of thumb, this is encouraging.
It will be interesting to see how well these numbers match up (If at all!) once I get the actual system built.