Experiments with Hunyuan Video
I have spent most of this week working on video generation with Hunyuan video.
This has been a fascinating exploration into a new and rapidly developing sub-field of the AI industry.
AI video generation looks set to take the world by storm in the coming years!
Hunyuan Video allows for the generation of short clips (Usually less than 10 seconds) from a plain text prompt.
From my testing, longer clips are possible, but the quality drops massively. I have tested generation of clips up to 60 seconds (At 320×240).
My current setup is posted HERE, but in short, I have 2 3090 GPU’s installed.
Hunyuan Video at bf16 is 25.6 GB’s.
Unlike with LLM’s (Large Lanuage Models), Video and Image generation does not have the same ability to split the Model across GPUs, so it is not possible to run Hunyuan at bf 16 on a 24 gb card, so I have been using is at FP8.
What is possible, as I discovered, it to offload the CLIP and VAE data from the first GPU to either the CPU or another GPU, this saves some VRAM, allowing for higher resolutions and/or more frames, but it is not “splitting” the model, just offloading some data.
It is also possible to use GGUF format instead of .safetensors. GGUF files for Hunyuan are available HERE.
Running Hunyuan at BF16 would require 45 GB’s RAM on one card.
I am using ComfyUI to run Hunyuan, and having tried many workflows, my current workflow uses GGUFDisTorchMultiGPU nodes as well as DualCLIPLoaderGGUFMultiGPU and VAELoaderMultiGPU to offload as much data as possible to other cards.
The quality I am getting is quite good.
With my setup, it seems that I can generate 201 frames video at 928×640. At 24 Frames per seconds, that’s about 8 seconds of video. This takes about 22 minutes to generate, with Sage attention 1 and TEACache optimisation enabled.
An example of the quality of content (A short, 3 second video) that I can produce is here:
The prompt for the above was:
“A male warrior with long hair and a battle-scarred face wearing a bulletproof vest, holding a large rifle, stands in a war-torn futuristic city. The camera pans around him, as explosions and gunfire rage.
Masterpiece, best quality, dynamic scene.”
What is worthy of note, other than the good quality, is that Hunyuan video seems to have no issue creating realistic looking firearms. I have not used any other video-based AI models before, but I have used Stable Diffusion to generate images, and many of them are unable to create firearms, likely due to censorship, etc. Without getting too far into the realms of politics, it is interesting that a model created by Tencent, A Chinese company, is actually less censored than many models available in the west.
Here is another example, another 3 second Video:
Another example, this one an 8 second video, using the same prompt, is:
It seems that the longer clip has a slightly lower quality. It is hard to tell, but it seems that the background, especially, on larger clips seems to become less clear and more blurry and indistinct.
If I again take the same prompt, but this time increase the length to 60 seconds (Reducing the quality to 320×240 to allow it to fit in Memory), I get this:
Clearly, the quality here is incredibly low, but I don’t think the reason is just because of the lower resolution, I think the main reason is the clip length. For example, if I generate a 3-second clip (using the same prompt as above) at 320×640, I get this:
Which, despite it’s low resolution, is a much better looking video. It seems then, that regardless of resolution, the length is limited to about 8-10 seconds.
While I have tried various quantisations and workflows, in addition to experimenting with different options, there is still a lot of experimentation to do.
In particular, I would like to experiment with the guidance and steps, in addition to the various VAE Decode settings (Tile size, etc).
From what I have read, running the model at Q8 does not substantially impair the quality.
–