AI

November 3 2024

AI: Assembly and setup of AI Inference Rig

I have finally finished the construction of my AI inference Rig!

The system specs are:

Threadripper PRO 3975 CPU,
2x ASUS TUF 3090 GPU’s (These have two PCI-E power pins instead of three, which is important!),
256 GBs of DDR4 3200 MHZ RAM,
Corsair 1500 watt PSU.

I had quite a few more issues that I was expecting when setting up the AI rig.

The main lesson that I learned is that the Threadripper PRO CPU’s are very sensitive to mounting pressure!

After I had assembled the system, it failed to boot. It seemed like the issues were CPU related.

After reseating the CPU and reapplying the thermal paste, everything seemed fine, except for one RAM stick not showing up, which I fixed by simply reseating it too.

Other than the CPU issues, the build went more or less ok.

I was planning to use Linux Mint for the software side, but I had issues getting this to work (Mostly Driver and CUDA issues).

I eventually went with popOS (Another Ubuntu-based Linux Distro) and this worked a lot better. I prefer the Interface for Mint (It is more similiar to Windows) but popOS seems to have better nVidia/CUDA support, which I need for this build.

I spend some time setting up the software side of the system.

Eventually, I got the KoboldCPP CUDA version 12 working properly. I had to update some drivers and the CUDA toolkit for this to work, popOS comes with 11.5 by default.

I had no issues adding enabling network access to kobold, but I did have some problems customising RDP/XDRP for remote admin of the server. This took some time to fix.

I also installed stable diffusion on the system, but I will probably use it in CPU-only mode, to save VRAM for LLM inference.

I have not fully tested or optimised the system yet, but initial results are good.

First, the RAM bandwidth seems to be closer to 150 GB/s, which is something I was worried about. This is extremely good!

Based on my post HERE, I was estimating that the memory bandwidth of the CPU would be around 140 GB/s maximum, in actual fact, it is around 146-148, which is even higher than I had hoped.

This means that it was worth it to get the more expensive 3975 CPU, rather than the 3945 (Due to the memory bandwidth limitations mentioned in the post above).

I have tested just a few models and quants so far, and the results have been decent, but not spectacular.

Before building this system, I was hoping for between 1 and 2 tk/s when running a 120b model at Q8 with full context (32k for most models).

It seems, based on very preliminary testing, that I am getting about 1.23 tk/s. This is a little slower than I was hoping, but this is a total value, including context and prompt processing (Which were not included in the post I made above, estimating token generation rates).

In practice, these values are actually quite usable, and I am satisfied with the results.

It does seem however, that as context fills up, token generation rates (And, in particular, prompt processing rates) drop substantially. I will need to do more testing into this, but I suspect that adding a third 3090 to the system will be desired at some point in the future. However, I shouldn’t need any more upgrades beyond that.

Running a 70b Q5 on my current dev machine (Ryzen 5950x and a single 3090, with 128 GB DDR4 RAM) gets me about 0.94 tokens per second. On my new system, that same model gets at least 5.3 tokens per second, a lot faster.

I intend to do a lot more testing with the AI rig in the future, and hopefully post my results here. The power of this system should allow me to explore some very interesting concepts.

I could potentially even run Falcon 180b at slow, but reasonable, speeds!

October 12 2024

AI: Estimated token generation rates for selected model sizes (and algorithm to calculate same)

PhoenixGames AI, Tutorials 2

Following on from my previous post on AI, I have come up with a very rudimentary algorithm to try to estimate the token generation rates that my AI rig will be able to produce when it is built.

Algorithm

The algorithm that I am using here is simple.

The token generation rate is, very roughly, the memory bandwidth divided by the model size in gigabytes.

So, if the model is running entirely on the GPU (With a bandwidth of 936 GB/s) and the model size is 120GB, then the token generation speed would be:

936/120= 7.8 tk/s

When running on the CPU, assuming a RAM bandwidth of 140 GB/s (See the graph in my previous post HERE).

140/120= 1.16 tk/s

These numbers do NOT take into account the context size of the model (Which can be a lot, for higher context models) or the time spent on prompt processing (Which can also be significant). In addition, it doesn’t take into account the overhead involved in, for example, combining multiple GPU’s together (IE, 48 GB’s VRAM Split over two 3090s is not going to perform as well as a single 48 GB GPU).

However, it seems that, as a general rule of thumb, these numbers work.

But what about hybrid systems? Where the processing is occurring on both the GPU and CPU?

This is where I had to made some assumptions, and I don’t yet know how well these assumptions hold up in the real world, if at all.

To generate the table below, I simply combined the GPU bandwidth and CPU bandwidth speeds together into an average bandwidth speed based on the percentage of the model that is on the GPU vs the CPU.

For example, assuming a 127.9 GB model, and 48 GBs of VRAM:

37.5% of the model is on the GPU, 62.4% of the model is on the CPU.

So, the average bandwidth is:

vram bandwith * (percentage of model on gpu /100) + system ram bandwidth * (percentage of model on cpu /100)

Or:

936*(37.5/100) + 140*(62.4/140) = 438.73 GB/s

Now, to get the tk/s, simply divide this by the model size:

438.73/127.9=3.43 tk/s.

Again, this is very approximate, and I won’t know how close it actually is to being accurate until the system is actually built, but I felt that this algorithm, and the data below, could be useful for some people.

Table

For the Table below, I have tested four conditions:

CPU and GPU, and CPU only (No GPU) for both a RAM bandwidth of 140 GB/s and a RAM bandwidth of 100 GB/s.

The reason why I have chosen to test two assumed RAM bandwidth speeds is because, first of all, RAM bandwidth is the most important factor is determining the performance of an AI system, and second of all, there can be a huge variation in the actual, real world, RAM bandwidth of a system (See my previous post).

NB: The following graph assumes that the system has access to 48 GB’s of VRAM. The numbers will of course be different if more or less processing is being done on the GPU (Although the numbers for CPU only will not change).

Model	Size (Gbs)	100 Gb/s RAM GPU+CPU	100 Gb/s RAM CPU ONLY	140 Gb/s RAM GPU+CPU	140 Gb/s RAM CPU ONLY
180b Q8	193	1.59 tk/s	0.51 tk/s	1.75 tk/s	0.72 tk/s
180b Q6	150	2.45 tk/s	0.66 tk/s	2.63 tk/s	0.93 tk/s
120b Q8	127.9	3.23 tk/s	0.78 tk/s	3.43 tk/s	1.09 tk/s
120b Q6	98.8	5.12 tk/s	1.01 tk/s	5.33 tk/s	1.41 tk/s
70b Q8	73.6	8.76 tk/s	1.35 tk/s	8.95 tk/s	1.90 tk/s
70b Q6	57.0	14.10 tk/s	1.75 tk/s	14.21 tk/s	2.45 tk/s

Conclusion

The numbers above look very good. With a 120b Q8 model and 140 Gb’s RAM bandwidth, the estimate is 3.43 tk/s! This, as I said, doesn’t take into account prompt processing time, context, and other real world considerations, but even as a rule of thumb, this is encouraging.

It will be interesting to see how well these numbers match up (If at all!) once I get the actual system built.

October 2 2024

AI: Memory Bandwidth comparison for selected DDR4 CPU’s

PhoenixGames AI, Tutorials 2

Introduction

As I mentioned in my previous post (HERE), I have purchased a Threadripper PRO 3955WX CPU for the purposes of building an LLM inference machine.

However, I have since discovered that there is a serious issue with using some Threadripper and Epyc CPU’s (Including the 3955wx) for this purpose.

The issue is that these CPU’s use 2 CCD’s (Core Chiplet Dies), which substantially reduces the memory bandwidth.

The issue is discussed here:

AMD Epyc 7002 Rome CPU’s with Half Memory Bandwidth. (Serve the Home)

And here:

Comparing Threadripper 7000 memory bandwidth for all models. (Reddit, r/Threadripper)

There are also many other sources confirming the results for various generations and models of chips.

The fundamental issue is that in order to reach the maximum bandwidth of 8-Channel RAM (About 200 GB/s) it is necessary to have not just 8 Channels supported, but 8 CCD’s as well.

Only extremely expensive CPU’s have 8 CCD’s.

The cheaper CPU’s have only 2, which effectively limits their ram bandwidth to quad channel speeds (Around 80-100 GB/s) or slightly more.

Ordinarily, this would not be a problem, since very, very, few use cases are going to come close to saturating the memory bandwidth like that.

The issue is that Large Language Model inference is one of those use cases (Computational Fluid Dynamics is another one). Not only that, but for LLM’s VRAM/RAM bandwidth is the single most important factor determining the performance of the system.

This means that the 3955wx which I bought is going to be extremely sub-optimal for this purposes, and needs to be replaced.

The problem is that it is extremely difficult to get actual, real world, values for RAM bandwidth online.

Most sources only quote maximum bandwidth values, which is 200 GB/s in 8 Channel mode. They make no reference to the CCD bandwidth issue, which can cause users to make bad purchasing decisions.

After spending some time researching this issue, I have decided to prepare a small graph showing my results, with sources, so that hopefully other users will be more informed than I was.

Data

I am focusing entirely on DDR4 CPU’s here, since DDR5 is out of my budget at this point.

I am open to expanding this graph if I can come across more data in the future.

The HTML Graph is quite squashed, so I have uploaded an image instead (Click for full image):

HTML Graph:

CPU	Read Speed	Write Speed	Copy Speed	CCD’s	Ram Type	Ram Channels	Ram Speed	Sources
Ryzen 9 5950x	54 GB/s	54 GB/s	–	2	DDR4	2	3600	Link
Threadripper 3960x	95 GB/s	93 GB/s	101 GB/s	4	DDR4	4	3200	Link
Threadripper 3970x	96 GB/s	98 GB/s	102 GB/s	4	DDR4	4	3200	Link
Threadripper Pro 3955wx	82GB/s	51 GB/s	94 GB/s	2	DDR4	8	3200	Link
Threadripper Pro 3975wx	137-139 GB/s	102 GB/s	137 GB/s	4	DDR4	8	3200	Link Link Link
Threadripper Pro 3995wx (64C/128t)	–	149 GB/s	–	8	DDR4	8	3200	Link
Epyc 7302p	115 GB/s	85 GB/s	128 GB/s	4	DDR4	8	2933	Link
Epyc 7443	136 GB/s	–	–	4	DDR4	8	3200	Link
Epyc 7551p	154 GB/S	156 GB/S	145 GB/S	–	DDR4	8	2600	Link
2xEpyc 7302p (Dual CPU)	219 GB/s		–	4*2	DDR4	8×2	2400	Link
Threadripper Pro 5995wx	160 GB/s	171 GB/s	–	8	DDR4	8	3200	Link Link
Threadripper Pro 5975wx	147 GB/s	171 GB/s	–	4	DDR4	8	3200	Link
Threadripper Pro 5965wx	147GB/s	172 GB/s	–	4	DDR4	8	3200	Link

Analysis

At my price point, it seems that 4 CCDs with 8 Channels is basically the best that can be achieved. This means that the actual read/write speeds in the real world are in the region of 150 GB/s read 100 Gb/s Write.

It seems, based on my research, that there is no inherent difference between Threadripper/Pro and Epyc CPUs that have the same number of CCDs.

IE, A 4 CCD Threadripper Pro will have similar read/write speeds to a 4 CCD Epyc.

This conflicts with some users that indicate that the Epyc series have superior RAM bandwidth even with the same number of CCDs. I have come across multiples sources, for example, that indicate tha the 7302p, with 4 CCD’s, can reach 200 GB/s RAM bandwidth. I believe that this is incorrect, based on my research.

For my budget, a CPU like the Epyc 7302p probably would have been a better choice.

It is (substantially) less expensive than the threadripper, but has similar RAM bandwidth, due to it’s 4 CCD’s.

However, I eventually chose to go with a Threadripper PRO 3975WX.

There were several reasons for this.

Firstly, the Threadripper PRO is a substantially more powerful CPU.

It has a higher clock speed, higher boost clock, and twice as many cores (32 vs 16), and comfortable outperforms the Epyc in both single and multi core applications:

Comparision between Threadripper PRO 3975WX and Epyc 7302.

CPU speed and Core count are not priorities for LLM inference, however they do make a difference, and more/faster cores are still preferable.

The motheboards for the Threadripper PRO are also better suited to my use case. They have more PCI-E x16 slots, and seem to have better support for peripherals, including graphics cards, etc. They also seem to be cheaper, at least in my location.

Secondly, I did not want to have to return my Motherboard in addition to my CPU, since unlike the CPU, this is a bulky item.

The Threadripper PRO 3975 is quite a bit more expensive than the 3955 and (especially) the 7302, but it was just about within budget.

Conclusion

It is vitally important that users buying hardware for LLM’s and other memory bandwidth intensive applications be aware of the bandwidth limits caused by CCD’s, otherwise the performance of CPU-based inference can be substantially lower than expected.

In my case, for example, the 3955wx would have had a read speed of 82 GB/s, when compared to the 137 GB/s of the 3975!

Of course, if the inference is mainly happening on the GPU, the CPU is far less important, but I intend to run large models, (70b, 120b, etc), which means that these memory bandwidth limitations will have a substantial effect on the performance of the system for its intended use case.

September 15 2024

AI: Running Large Language Models: System Specs

PhoenixGames AI, Tutorials 0

I have recently become very intersted in running AI Large language Models (LLM’s).

With a view to furthering my research in this area, I have been planning a build for a machine dedicated to AI inference.

My goal is to be able to run 70b models at Q6 or even Q8 quants with a tk/s of 3-6 tk/s at least, and, hopefully, a 120b model with at least a Q5 quant at at least 1 tk/s.

The spec that I arrived at is:

These have 24 gbs VRAM each, for a total of 48 GBs, and they have just two PCIE power connectors, not three, making it easier to power them.

2×3090 Asus TUF Gaming GPUs	These have 24 gbs VRAM each, for a total of 48 GBs, and they have just two PCIE power connectors, not three, making it easier to power them.
Threadripper PRO 3955WX	I went with the PRO threadripper because of it’s support for more than 256 GBS RAM, and it’s 128 PCE lanes. I could have probably went with the 3945 model, since the clock speeds are similar, and the extra 4 cores (16 vs 12) of the 3955 probably won’t make that much difference for inference.
256 GB 3200 MHZ DDR4 RAM	3200 Mhz DDR4 is not the fastest, but it’s the fastest speed that the 3955wx supports, and I don’t think that over clocking 8x32GB sticks is going to work. I need 8 sticks because I want to use 8 channel memory. Memory bandwidth is very important for LLM’s, and 8 channel memory has about a 200 GB/s bandwidth, vs 100 GB/s for quad channel.
WRX80-E SAGE Motherboard	This actually cost more than the CPU, but it has 7 PCI-x16 ports, which I will need in the future if I intend to add more GPU’s, and because it has 8 channel memory support.
Corsair HX1500	A 1500 watt PSU should be ok for two 3090’s, maybe even 3 if I underclock the card. If I get any more in the future I will have to get another PSU and connect them together.
2 TB M.2 SSD
Noctua Cooler
Mining Case	I went with an open Air mining rig because it is the only setup that would allow me to add more than 2 GPUS.

It will be some time before I get all of the parts, because most of them are used, and shipping will take time.

December 31 2023

A Simple Beginners Guide to Creating the Same Character in Multiple Poses with Stable Diffusion

PhoenixGames AI, Art and Designs, Tutorials

I also posted this guide to the reddit sub “/r/StableDiffusion” under the username “spiritusastrum”.

Generating the same character in multiple poses is one of the most common questions for beginners in AI generated art.

Most solutions involve LORA’s, Dreamboth, etc, etc, and these are preferred, however they are complex, and require good quality training sets.

It is possible to generate enough training images using just SD, but this is difficult.

I have, after some research and trial and error, discovered a very simple way to create a unique character using entirely Stable Diffusion and then change that characters pose, while keeping most of their likeness (Clothing, hair, face, etc, intact).

I am using “DivineAnimeMix” to generate the images used in this guide.

The guide is aimed mainly at simple artwork, such as the kind you would see in Visual Novels, etc. With complex art or close ups of a characters face, etc, this technique may not work as well.

First, (using txt2img) create an extremely specific prompt, and use it to produce a single image of a character.

This image should be the “main” character image, used for the characters default image in the visual novel, etc.

It is important to make sure that this image matches the prompt as much as possible.

For example, if the colour of the characters jacket is different from the prompt, it should be fixed now, otherwise the jacket will be the wrong colour when the pose changes. This can be fixed, but it is easier to fix it now.

Also, it can be very difficult (Or almost impossible) to get things like tattoos, makeup, etc, to match properly when the pose changes, so it can be desirable to avoid creating characters with tattoos, etc. If a character does have tattoos, makeup, etc, it is important to specify the location of the tattoo and what it is.

So, instead of just “With tattoos” say “With a tattoo on their left arm”, etc.

The final point with prompt generation is to add a pose or stance with a “weight” modifier, such as:

(Standing: 5.5).

This is used to change the pose, without changing the prompt.

This is the image that I generated at this stage of the process:

With the default image and prompt generated, it is now possible to change the pose.

This is done by just changing the weighed prompt, so:

(Standing: 5.5) could become (Riding a Motorcycle: 5.5), or anything else.

Of course, cfg scale, restore faces, etc, can all be used to improve the quality of the images, however the prompt should not be changed, apart from the pose.

What will now happen is that many similar, not identical, characters will be generated.

This is the image that I generated for this step:

Notice that the character is similar, but not identical, to the first image.

Once an image is produced that firstly, matches the desired pose, and secondly, has relatively few differences to the “main” image, it is time for the next step.

Download the second image and open it in photoshop, or any other basic image editor (Even MS paint would work fine for this).

Now, with the main image as a guide, roughly mark any areas which the AI has gotten “wrong”.

Is the jacket the wrong colour? Use the colour dropper tool to paint over the jacket on the second image with the colour from the first image.

Is the AI wearing long pants in the second image, and shorts in the first?

Again, use the eye dropper tool to paint a skin-colour texture over the long pants texture.

Is the hair too long, or too short? Are there tattoos where there shouldn’t be?

Do the same thing.

If something is missing from the second image, simply select and copy it from the first image.

Tattoos, a belt, a style of glove, even an entire face, can be very crudely copied, pasted, and scaled, onto the second image.

This will result in something that looks awful, but this is perfectly fine. The goal is simply to add visual cues to tell the AI which parts of the image to regenerate, the AI will do the rest.

This is the image that I created here:

Notice that I have painted over her right arm (Her sleeves should be short) and her right hand (She should be wearing gloves). I have also copied and pasted the face, right sleeve, and the fur collar from the first image.

When this is done, upload the modified image to stable diffusion, this time to img2img.

Use the same prompt that was used to generate the second image (not the main image!), and set the “denoising strength” as low as possible. The idea is to JUST regenerate the parts that you painted over in photoshop, not the rest of the image.

You can use inpainting for this (painting over only the parts of the image that you want to regenerate, leaving the rest), but I found that img2img works as good or better (I seemed to end up with bad-quality faces more often with inpainting).

You may need to generate several images, but you should end up with a character that looks MUCH more like the main character that you initially generated, but with the pose of the second character.

If there are any minor issues remaining, simply take the best result, download it to photoshop, and go through the process again, you can repeat this as many times as necessary.

This is one of my final images:

Notice that the right arm is still wrong (She has veins instead of tattoos, and she has a bracelet instead of gloves) but these issues could be fixed in photoshop. The major details match the original character, while the pose is different.

Here is another image:

Again, the likeness is not perfect, but it is close enough that I think most people would regard this character as the “same” as the first one.

Here is a side by side comparision between the initial character and the final image:

I think that, for a single pass, this is a very good result.

If I did another pass in photoshop, I could fix the issues with the right glove, and the red stitching on her pants, as well as modify the tattoos to help them match up more.

I have tried this with “realistic” and anime style checkpoints and it works very well with both. I suspect it would work better with illustrated or manga specific checkpoints, because there would generally be less detail involved, and so it would be harder to detect differences between the images.

This solution, of course, does NOT solve the problem of creating the “same character in a different pose”. You are generating a new character, that just happens to look similar, however, this process seems to work well enough for my purposes at least, and it may work for others as well.

December 31 2023

Visual Novel/Interactive Story System using AI-Generated Art

PhoenixGames AI, Software and Games, Software Engines

In addition to my main project, I have been working on a simple concept for a system (Using the Unity game engine) that I can use to create Visual novels, and interactive stories.

Initially, this will use simple 2D art, sprites, etc, which means I can generate almost all of the artwork using AI-Generated art.

However, if it works out, I can add more complex 3D geometry in future, and create point-and-click games, and other projects that are a hybrid of 2D and conventional 3D art.

Using AI generated art would allow me to develop smaller projects more quickly, while still being able to work on my main game, which I would not be able to do if I was working on multiple complex 3D projects.

Most of the basic logic for displaying, selecting, and interacting with sprites is already done for the interactive story system, as well as basic text display, etc.

The main problem now is figuring out how to generate consistent characters with stable diffusion.

December 31 2023

Stable Diffusion: Generating the same Character in multiple poses

PhoenixGames AI, Art and Designs

I have been working with stable diffusion to generate AI Art for some time, but the main problem that I am having is with generating the same character in multiple poses.

This is, apparently, a common problem with AI generated art in general.

It is possible to generate similar characters using a strong enough prompt, but not identical characters.

I have been working on a simple workflow to address this problem, and the initial results are looking promising.

1 2