Skip to content

2 Comments

  1. Name
    October 30, 2025 @ 3:41 pm

    Hellom the information in this article is absolutely wrong. If your inference was really memory bound and you could fit the model into single GPU, you could get away with:

    (2 * n * 1000) / (b_w)

    where

    n – number of billion parameters
    b_w – GPU bandwidth in GB/s

    Taking your example:

    > So, if the model is running entirely on the GPU (With a bandwidth of 936 GB/s) and the model size is 120GB, then the token generation speed would be:

    we would have

    (2 * 120 * 1000) / (936) = 256.410

    which again, would be close if the model wouldn’t need to be split across several GPUs and the card was 100% memory bound – this depends on the model as much as on the card

    Reply

    • PhoenixGames
      October 30, 2025 @ 4:00 pm

      Thank you for your reply.

      The numbers and algorithms posted above were always rough rules of thumb, and having tested them in real world conditions, they are certainly not entirely accurate, however, they are not entirely inaccurate either.

      I was running 120b Q8 models with two 3090s (The rest of the model being on the CPU) and I was getting anything from 1-2 tk/s depending on the context, etc.

      The graph above shows 3.43  tk/s for 140 gb/s RAM bandwidth, which is noticeable higher than I was getting in reality, but again, it’s a rough estimate.

      The formula I used (memorybandwidth / model size) was based on internet searches, reddit, etc. I admit, not the most reliable sources, but I couldn’t find anything better at the time. Can you tell me where you got yours?

      Another problem that I have noticed is that you are dividing by the GPU bandwidth, so, as the GPU bandwidth increases, the token generation rate decreases, which, surely, should be the opposite?

      For example, you said:

      (2 * 120 * 1000) / (936) = 256.410

      So, for a GPU with a bandwidth of 936 gb/s, we would get 256 tk/s.

      Ok, lets now assume that we double the bandwidth, to 1800 gb/s. We would expect to get twice the tokens, right? But with your algorithm:

      (2*120*1000)/(1800), we get: 133.33

      So instead of doubling, the tk/s is halving? Which can’t be right surely, unless I’m missing something?

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *