LTX-Video Locally: Facts and Myths Debunked. Tips Included.
LTX-Video is a very fast, DTI-based video generation model that can generate high-quality videos on consumer hardware, and the results are often comparable to professional, subscription-based generative video models, in some aspects.
Requirements of NVIDIA GPU
After some experiments, I would say the minimum VRAM for effective use would be 24GB. You can run it on 16GB without too much trouble (like I did), and you could probably run it on 12GB with much more trouble. This is still pretty good considering the speed and requirements of other open models (not to mention the results, which, in the case of LTX-Video, are comparatively pretty good).
Installation
Instalation is easy. Download the models and workflows for Comfy UI.
- Download the ltx-video model https://huggingface.co/Lightricks/LTX-Video/tree/main and put it in your ComfyUI folder with checkpoints (it will be either
ComfyUI\models\checkpoints
by default, or your dedicated folder fromextra_model_paths.yaml
if you are sharing the models) - Download text encoder/s https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main and put it in folder
\models\clip
(if you are running Flux models you probably already have it) - Download and test the official workflows or try your luck with mine (sandner-art github LTX-Video workflows).
LTX Video, Right at Home: Tips, Tricks, and Truth
Can LTX-Video generate high-quality videos in real-time?
No. Unless you have a very flexible definition of "high-quality." Or work on a Bond villain's supercomputer.
Is LTX-Video Fast?
Yes. Even with a high number of steps (100+), it is still probably the fastest model you can run locally.
Is the Output Worth It?
The quality is somewhere between a certain overhyped subscription service where it takes days to generate a video and then it fails, and another subscription-based service, which is probably currently the best on the market. The quality is getting close to a certain model that was recently employed on Civitai. So for local generation experiments, the answer is Yes.
Myth 1: LTX Video can generate realistic videos of people
While this is true for closeups, if there is more motion in the image or a character is far from the camera, a hell breaks loose. With added resolution and steps, you can partially solve this, but you risk a severe case of Motionless Movie (read further).
Myth 2: LTX Video can create long videos
The suggested maximum is 257 frames. However, the longer the video is, the more you risk the occurrence of artifacts and abominations. This may occur randomly. I suggest lengths from 4-7 seconds (96+1 to 168+1 frames), with some subjects you can get away with more.
Myth 3: The Bigger the Resolution the Better
The recommended resolution is under 720 x 1280 (the resolution should be divisible by 32). This is nice. However, lower resolutions tend to create more lively scenes. Large resolutions will often move slightly or not whatsoever. The starting resolution is 512 x 768.
Troubleshooting Common Issues and Errors
The Case of a Pretentious Prompt
How to prompt: In an image-to-video situation, try running the input image through a multimodal (vision) LLM model. Then work with that description, altering the prompt to fit a video. In my tests, this approach produced the most consistent results.
The Case of a Motionless Movie
This one is nasty. Often happens, that the output video barely moves, especially on image to video generations. This is very frustrating, because sometimes it seemingly just won't budge. Try to change seed, resolution, or length of a movie. Changing the prompt may or may not help. Try this workflow:
If you pass the input image through VideoHelperSuite (install with Manager), there seems to be a much better success rate to avoid Motionless Movie. Note that the process does not visibly change the input image (no visible quality downgrade). It also works in higher resolutions. You can test the workflow ltxvideo_I2V-motionfix.json from the sandner-art GitHub. The workflow has comments to explain its function.
The Case of a Switching Slideshow
Occurs in img2vid. The input image is motionless, while it is switched to a scene often from another horrible reality. You can avoid this with changing prompt, seed, length or resolution. Try the motion fixing workflow mentioned in the previous chapter. Best bet would be to use LLM to adjust your prompt for more natural language, and avoid terms suggesting switching scene or a movie cut (or several separate camera movements).
You may use any good online advanced LLM chat or use my free open utility ArtAgents (choose the agent 'Video') for your next local Ollama adventure.
The Case of a Slithering Spectre
When you have quite good camera movement and composition in the output video, but the subjects start to transform midway or are not well-defined, try changing the number of steps (even a way over 100) to get better details. You may also try testing CFG values in the range of 2-5, starting at the middle ground of 3.5. Some subjects may render better at lower CFG values. I got the best results with CFG 5 in image-to-video, for whatever reason. Higher CFG values tend to overbake the image.
Conclusion
It does not produce 24 FPS HQ videos at a 768x512 resolution faster than they can be watched, at least not on normal hardware. But it is so fast that it is creating its own standard on local video generations. If this is the future that video generators go, I am all for it.
Resources and References
- Comfy LTXV support https://blog.comfy.org/ltxv-day-1-comfyui/
- LTXV https://huggingface.co/Lightricks/LTX-Video
- LTXV official documentation https://www.lightricks.com/ltxv-documentation
- .webp viewer (QVIEW) https://interversehq.com/qview/download/
- ArtAgents for prompt adjustments https://github.com/sandner-art/ArtAgents