AI: Memory Bandwidth comparison for selected DDR4 CPU’s
Introduction
As I mentioned in my previous post (HERE), I have purchased a Threadripper PRO 3955WX CPU for the purposes of building an LLM inference machine.
However, I have since discovered that there is a serious issue with using some Threadripper and Epyc CPU’s (Including the 3955wx) for this purpose.
The issue is that these CPU’s use 2 CCD’s (Core Chiplet Dies), which substantially reduces the memory bandwidth.
The issue is discussed here:
AMD Epyc 7002 Rome CPU’s with Half Memory Bandwidth. (Serve the Home)
And here:
Comparing Threadripper 7000 memory bandwidth for all models. (Reddit, r/Threadripper)
There are also many other sources confirming the results for various generations and models of chips.
The fundamental issue is that in order to reach the maximum bandwidth of 8-Channel RAM (About 200 GB/s) it is necessary to have not just 8 Channels supported, but 8 CCD’s as well.
Only extremely expensive CPU’s have 8 CCD’s.
The cheaper CPU’s have only 2, which effectively limits their ram bandwidth to quad channel speeds (Around 80-100 GB/s) or slightly more.
Ordinarily, this would not be a problem, since very, very, few use cases are going to come close to saturating the memory bandwidth like that.
The issue is that Large Language Model inference is one of those use cases (Computational Fluid Dynamics is another one). Not only that, but for LLM’s VRAM/RAM bandwidth is the single most important factor determining the performance of the system.
This means that the 3955wx which I bought is going to be extremely sub-optimal for this purposes, and needs to be replaced.
The problem is that it is extremely difficult to get actual, real world, values for RAM bandwidth online.
Most sources only quote maximum bandwidth values, which is 200 GB/s in 8 Channel mode. They make no reference to the CCD bandwidth issue, which can cause users to make bad purchasing decisions.
After spending some time researching this issue, I have decided to prepare a small graph showing my results, with sources, so that hopefully other users will be more informed than I was.
Data
I am focusing entirely on DDR4 CPU’s here, since DDR5 is out of my budget at this point.
I am open to expanding this graph if I can come across more data in the future.
The HTML Graph is quite squashed, so I have uploaded an image instead (Click for full image):
HTML Graph:
CPU | Read Speed | Write Speed | Copy Speed | CCD’s | Ram Type | Ram Channels | Ram Speed | Sources |
Ryzen 9 5950x | 54 GB/s | 54 GB/s | – | 2 | DDR4 | 2 | 3600 | Link |
Threadripper 3960x | 95 GB/s | 93 GB/s | 101 GB/s | 4 | DDR4 | 4 | 3200 | Link |
Threadripper 3970x | 96 GB/s | 98 GB/s | 102 GB/s | 4 | DDR4 | 4 | 3200 | Link |
Threadripper Pro 3955wx | 82GB/s | 51 GB/s | 94 GB/s | 2 | DDR4 | 8 | 3200 | Link |
Threadripper Pro 3975wx | 137-139 GB/s | 102 GB/s | 137 GB/s | 4 | DDR4 | 8 | 3200 | Link Link Link |
Threadripper Pro 3995wx (64C/128t) | – | 149 GB/s | – | 8 | DDR4 | 8 | 3200 | Link |
Epyc 7302p | 115 GB/s | 85 GB/s | 128 GB/s | 4 | DDR4 | 8 | 2933 | Link |
Epyc 7443 | 136 GB/s | – | – | 4 | DDR4 | 8 | 3200 | Link |
Epyc 7551p | 154 GB/S | 156 GB/S | 145 GB/S | – | DDR4 | 8 | 2600 | Link |
2xEpyc 7302p (Dual CPU) | 219 GB/s | – | 4*2 | DDR4 | 8×2 | 2400 | Link | |
Threadripper Pro 5995wx | 160 GB/s | 171 GB/s | – | 8 | DDR4 | 8 | 3200 | Link Link |
Threadripper Pro 5975wx | 147 GB/s | 171 GB/s | – | 4 | DDR4 | 8 | 3200 | Link |
Threadripper Pro 5965wx | 147GB/s | 172 GB/s | – | 4 | DDR4 | 8 | 3200 | Link |
Analysis
At my price point, it seems that 4 CCDs with 8 Channels is basically the best that can be achieved. This means that the actual read/write speeds in the real world are in the region of 150 GB/s read 100 Gb/s Write.
It seems, based on my research, that there is no inherent difference between Threadripper/Pro and Epyc CPUs that have the same number of CCDs.
IE, A 4 CCD Threadripper Pro will have similar read/write speeds to a 4 CCD Epyc.
This conflicts with some users that indicate that the Epyc series have superior RAM bandwidth even with the same number of CCDs. I have come across multiples sources, for example, that indicate tha the 7302p, with 4 CCD’s, can reach 200 GB/s RAM bandwidth. I believe that this is incorrect, based on my research.
For my budget, a CPU like the Epyc 7302p probably would have been a better choice.
It is (substantially) less expensive than the threadripper, but has similar RAM bandwidth, due to it’s 4 CCD’s.
However, I eventually chose to go with a Threadripper PRO 3975WX.
There were several reasons for this.
Firstly, the Threadripper PRO is a substantially more powerful CPU.
It has a higher clock speed, higher boost clock, and twice as many cores (32 vs 16), and comfortable outperforms the Epyc in both single and multi core applications:
Comparision between Threadripper PRO 3975WX and Epyc 7302.
CPU speed and Core count are not priorities for LLM inference, however they do make a difference, and more/faster cores are still preferable.
The motheboards for the Threadripper PRO are also better suited to my use case. They have more PCI-E x16 slots, and seem to have better support for peripherals, including graphics cards, etc. They also seem to be cheaper, at least in my location.
Secondly, I did not want to have to return my Motherboard in addition to my CPU, since unlike the CPU, this is a bulky item.
The Threadripper PRO 3975 is quite a bit more expensive than the 3955 and (especially) the 7302, but it was just about within budget.
Conclusion
It is vitally important that users buying hardware for LLM’s and other memory bandwidth intensive applications be aware of the bandwidth limits caused by CCD’s, otherwise the performance of CPU-based inference can be substantially lower than expected.
In my case, for example, the 3955wx would have had a read speed of 82 GB/s, when compared to the 137 GB/s of the 3975!
Of course, if the inference is mainly happening on the GPU, the CPU is far less important, but I intend to run large models, (70b, 120b, etc), which means that these memory bandwidth limitations will have a substantial effect on the performance of the system for its intended use case.