October 12 2024

AI: Estimated token generation rates for selected model sizes (and algorithm to calculate same)

Following on from my previous post on AI, I have come up with a very rudimentary algorithm to try to estimate the token generation rates that my AI rig will be able to produce when it is built.

Algorithm

The algorithm that I am using here is simple.

The token generation rate is, very roughly, the memory bandwidth divided by the model size in gigabytes.

So, if the model is running entirely on the GPU (With a bandwidth of 936 GB/s) and the model size is 120GB, then the token generation speed would be:

936/120= 7.8 tk/s

When running on the CPU, assuming a RAM bandwidth of 140 GB/s (See the graph in my previous post HERE).

140/120= 1.16 tk/s

These numbers do NOT take into account the context size of the model (Which can be a lot, for higher context models) or the time spent on prompt processing (Which can also be significant). In addition, it doesn’t take into account the overhead involved in, for example, combining multiple GPU’s together (IE, 48 GB’s VRAM Split over two 3090s is not going to perform as well as a single 48 GB GPU).

However, it seems that, as a general rule of thumb, these numbers work.

But what about hybrid systems? Where the processing is occurring on both the GPU and CPU?

This is where I had to made some assumptions, and I don’t yet know how well these assumptions hold up in the real world, if at all.

To generate the table below, I simply combined the GPU bandwidth and CPU bandwidth speeds together into an average bandwidth speed based on the percentage of the model that is on the GPU vs the CPU.

For example, assuming a 127.9 GB model, and 48 GBs of VRAM:

37.5% of the model is on the GPU, 62.4% of the model is on the CPU.

So, the average bandwidth is:

vram bandwith * (percentage of model on gpu /100) + system ram bandwidth * (percentage of model on cpu /100)

Or:

936*(37.5/100) + 140*(62.4/140) = 438.73 GB/s

Now, to get the tk/s, simply divide this by the model size:

438.73/127.9=3.43 tk/s.

Again, this is very approximate, and I won’t know how close it actually is to being accurate until the system is actually built, but I felt that this algorithm, and the data below, could be useful for some people.

Table

For the Table below, I have tested four conditions:

CPU and GPU, and CPU only (No GPU) for both a RAM bandwidth of 140 GB/s and a RAM bandwidth of 100 GB/s.

The reason why I have chosen to test two assumed RAM bandwidth speeds is because, first of all, RAM bandwidth is the most important factor is determining the performance of an AI system, and second of all, there can be a huge variation in the actual, real world, RAM bandwidth of a system (See my previous post).

NB: The following graph assumes that the system has access to 48 GB’s of VRAM. The numbers will of course be different if more or less processing is being done on the GPU (Although the numbers for CPU only will not change).

Model	Size (Gbs)	100 Gb/s RAM GPU+CPU	100 Gb/s RAM CPU ONLY	140 Gb/s RAM GPU+CPU	140 Gb/s RAM CPU ONLY
180b Q8	193	1.59 tk/s	0.51 tk/s	1.75 tk/s	0.72 tk/s
180b Q6	150	2.45 tk/s	0.66 tk/s	2.63 tk/s	0.93 tk/s
120b Q8	127.9	3.23 tk/s	0.78 tk/s	3.43 tk/s	1.09 tk/s
120b Q6	98.8	5.12 tk/s	1.01 tk/s	5.33 tk/s	1.41 tk/s
70b Q8	73.6	8.76 tk/s	1.35 tk/s	8.95 tk/s	1.90 tk/s
70b Q6	57.0	14.10 tk/s	1.75 tk/s	14.21 tk/s	2.45 tk/s

Conclusion

The numbers above look very good. With a 120b Q8 model and 140 Gb’s RAM bandwidth, the estimate is 3.43 tk/s! This, as I said, doesn’t take into account prompt processing time, context, and other real world considerations, but even as a rule of thumb, this is encouraging.

It will be interesting to see how well these numbers match up (If at all!) once I get the actual system built.

2 Comments

Name
October 30, 2025 @ 3:41 pm

Hellom the information in this article is absolutely wrong. If your inference was really memory bound and you could fit the model into single GPU, you could get away with:

(2 * n * 1000) / (b_w)

where

n – number of billion parameters
b_w – GPU bandwidth in GB/s

Taking your example:

> So, if the model is running entirely on the GPU (With a bandwidth of 936 GB/s) and the model size is 120GB, then the token generation speed would be:

we would have

(2 * 120 * 1000) / (936) = 256.410

which again, would be close if the model wouldn’t need to be split across several GPUs and the card was 100% memory bound – this depends on the model as much as on the card

- PhoenixGames
  October 30, 2025 @ 4:00 pm
  
  Thank you for your reply.
  
  The numbers and algorithms posted above were always rough rules of thumb, and having tested them in real world conditions, they are certainly not entirely accurate, however, they are not entirely inaccurate either.
  
  I was running 120b Q8 models with two 3090s (The rest of the model being on the CPU) and I was getting anything from 1-2 tk/s depending on the context, etc.
  
  The graph above shows 3.43 tk/s for 140 gb/s RAM bandwidth, which is noticeable higher than I was getting in reality, but again, it’s a rough estimate.
  
  The formula I used (memorybandwidth / model size) was based on internet searches, reddit, etc. I admit, not the most reliable sources, but I couldn’t find anything better at the time. Can you tell me where you got yours?
  
  Another problem that I have noticed is that you are dividing by the GPU bandwidth, so, as the GPU bandwidth increases, the token generation rate decreases, which, surely, should be the opposite?
  
  For example, you said:
  
  (2 * 120 * 1000) / (936) = 256.410
  
  So, for a GPU with a bandwidth of 936 gb/s, we would get 256 tk/s.
  
  Ok, lets now assume that we double the bandwidth, to 1800 gb/s. We would expect to get twice the tokens, right? But with your algorithm:
  
  (2*120*1000)/(1800), we get: 133.33
  
  So instead of doubling, the tk/s is halving? Which can’t be right surely, unless I’m missing something?

AI: Estimated token generation rates for selected model sizes (and algorithm to calculate same)

Algorithm

Table

Conclusion

Related Posts

WAN Video And Phantom 14B in ComfyUI Issues

SillyTavern Extension: Weather Checker

SillyTavern Extension: Email Checker

2 Comments

Leave a Reply Cancel reply