LTX-Video Locally: Facts and Myths Debunked. Tips Included.

LTX-video model making high resolution fast outputs in ComfyUI

LTX-Video is a very fast, DTI-based video generation model that can generate high-quality videos on consumer hardware, and the results are often comparable to professional, subscription-based generative video models, in some aspects.

Requirements of NVIDIA GPU

After some experiments, I would say the minimum VRAM for effective use would be 24GB. You can run it on 16GB without too much trouble (like I did), and you could probably run it on 12GB with much more trouble. This is still pretty good considering the speed and requirements of other open models (not to mention the results, which, in the case of LTX-Video, are comparatively pretty good).

Installation

Instalation is easy. Download the models and workflows for Comfy UI.

LTX Video, Right at Home: Tips, Tricks, and Truth

Can LTX-Video generate high-quality videos in real-time?

No. Unless you have a very flexible definition of "high-quality." Or work on a Bond villain's supercomputer.

Is LTX-Video Fast?

Yes. Even with a high number of steps (100+), it is still probably the fastest model you can run locally.

Is the Output Worth It?

The quality is somewhere between a certain overhyped subscription service where it takes days to generate a video and then it fails, and another subscription-based service, which is probably currently the best on the market. The quality is getting close to a certain model that was recently employed on Civitai. So for local generation experiments, the answer is Yes.

Myth 1: LTX Video can generate realistic videos of people

While this is true for closeups, if there is more motion in the image or a character is far from the camera, a hell breaks loose. With added resolution and steps, you can partially solve this, but you risk a severe case of Motionless Movie (read further). I would say you can get still quite nice details in medium shots (see examples).

Myth 2: LTX Video can create long videos

The suggested maximum is 257 frames. However, the longer the video is, the more you risk the occurrence of artifacts and abominations. This may occur randomly. I suggest lengths from 4-7 seconds (96+1 to 168+1 frames), with some subjects you can get away with more.

Myth 3: The Bigger the Resolution the Better

The recommended resolution is under 720 x 1280 (the resolution should be divisible by 32). This is nice. However, lower resolutions tend to create more lively scenes. Large resolutions will often move slightly or not whatsoever (you may remedy this alot, try motion fixing workflows). The starting resolution is 512 x 768.

Myth 4: I Can Prompt LTX Video Like It's 2022 and Get Hollywood Results

Similar to Flux or SD 3.5, you can get away with simple prompts in some scenes. However, for higher quality outputs, automatic prompt extensions in the form of LLM nodes or prompt expansion using an online or offline chat are needed, unless you are a compulsive novelist.

The content or wording of such text additions must be relevant to the input image and not interfere with the intended movements of subjects or the camera (read more in 'The Case of a Pretentious Prompt' section). Sometimes you need to manually adjust such prompts and remove problematic parts. Note that there is no general advice, as it may depend on the specific scene you are describing. In some cases, elements like 'wind' or 'lightning' may cause trouble, while in other scenes, such descriptions may cause no issues.

This is not specific to LTX Video; other video/image models often have similar behavior.

Myth 5: Everything I Write in the Prompt Will Happen in the Movie, Just Like in a Movie.

No. The model does not recognize many camera movements and cannot reproduce them reliably. This may change when motion LoRAs or fine-tunings are introduced, if ever. However, this entry is not meant to discourage you from experimenting with camera settings or cinematography jargon. Just bear in mind that, like with any generative model, results may vary by changing the seed. Currently, there is no motion model that is 'reliable' in this sense.

Myth 6: Creating Videos is Easy and Fast—I Can Just Use the Same Prompt for Any Scene and Modify the Details.

No, myth debunked. (Well, technically it's true, but you know what I mean.)

Troubleshooting Common Issues and Errors

Start with main action in a single sentence
Add specific details about movements and gestures
Describe character/object appearances precisely
Include background and environment details
Specify camera angles and movements
Describe lighting and colors
Note any changes or sudden events
Lightricks, Makers of LTXVideo on prompt structure (as you may see, such prompt would not by easy to create by human without any AI help)

The Case of a Pretentious Prompt

How to prompt: In an image-to-video situation, try running the input image through a multimodal (vision) LLM model. Then work with that description, altering the prompt to fit a video. In my tests, this approach produced the most consistent results.

Example (DO NOT): a village, woman, ((smiling)), red hair, (moon:2)

Example (DO): A detailed and vivid tranquil village scene under the soft glow of a full moon. In the heart of the village, there is a charming cobblestone path leading to a quaint cottage. Standing in front of the cottage is a woman with vibrant red hair, her face illuminated by the moonlight, revealing a warm and genuine smile. Describe the surroundings, the woman's appearance, and the overall atmosphere of the scene. Include sensory details such as the sounds, scents, and the gentle breeze that adds to the serene ambiance.

Note that the recommended length for LTX-Video is around 200 words (the limit is either arbitrary or perhaps depending on the text encoder used in the workflow). Therefore, the final prompt for a video could look like (...but both prompts could actually work):

Real-life footage video of a tranquil village bathed in the soft glow of a full moon. The video begins with a slow camera movement, gently panning over charming cottages and cobblestone paths. The focal point is a woman with vibrant red hair, walking gracefully along the path leading to a quaint cottage. Her dress flows gently with her movements, and her warm, genuine smile is beautifully illuminated by the moonlight.

In case of img2vid, of course, the image should fit the description above. 

The Case of a Motionless Movie

This one is nasty. Often happens, that the output video barely moves, especially on image to video generations. This is very frustrating, because sometimes it seemingly just won't budge. Try to change seed, resolution, or length of a movie. Changing the prompt may or may not help. Try this workflow:

If you pass the input image through VideoHelperSuite (install with Manager), there seems to be a much better success rate to avoid Motionless Movie. Note that the process does not visibly change the input image (no visible quality downgrade). It also works in higher resolutions. You can test the workflow ltxvideo_I2V-motionfix.json from the sandner-art GitHub. The workflow has comments to explain its function.

The Case of a Switching Slideshow

Occurs in img2vid. The input image is motionless, while it is switched to a scene often from another horrible reality. You can avoid this with changing prompt, seed, length or resolution. Try the motion fixing workflow mentioned in the previous chapter. Best bet would be to use LLM to adjust your prompt for more natural language, and avoid terms suggesting switching scene or a movie cut (or several separate camera movements).

You may use any good online advanced LLM chat or use my free open utility ArtAgents (choose the agent 'Video') for your next local Ollama adventure.

The Case of a Slithering Spectre

When you have quite good camera movement and composition in the output video, but the subjects start to transform midway or are not well-defined, try changing the number of steps (even a way over 100) to get better details. You may also try testing CFG values in the range of 2-5, starting at the middle ground of 3.5. Some subjects may render better at lower CFG values. I got the best results with CFG 5 in image-to-video, for whatever reason. Higher CFG values tend to overbake the image.

Other Tips

CLIP Text Encoding Takes Up Too Much VRAM

Using text encoders to process prompts takes up VRAM, which may cause very slow rendering of the entire video (or an image) or other issues. There is an option to use a node to force CLIP for CPU processing by setting the 'Force/Set CLIP Device' node to CPU. However, this node pack extension installs a lot of other components, which you may or may not want. It may not behave as expected on your configuration, and the parsing can be quite slow, depending on the length of the text prompt.

I am using a "dry run" trick to first render a 1-5 steps video to parse the prompt (in 5 test run frames you can get a vague idea about camera movement in the video as a bonus, and the rendering is reasonably fast). Then I will render higher steps version, while the prompt is already parsed (you could make it even more effective by lowering the resolution and length for the test run, but changing the values can became impractical and you are loosing the test run function).  Alternatively, you may start generation and after it parses the prompt, cancel the run, and run it again. This works only until you change the prompt again.  

Using Animated .webp Format

Some example workflows produce animated .webp output. You can use the open-source QVIEW app to preview .webp files. I have found that it can be useful, so I have kept it in some workflows (just be aware that lower quality under 90 really lowers the quality of the video). Some examples also include .mp4 outputs via VideoHelperSuite nodes.

Image to Video Low Steps

As I have stated before, more steps help fix details in the scene (note that this can also change the movement). In an image-to-video workflow, you may try starting with 10 steps for simple scenes and achieve some results if you are not aiming for photorealism.

PixArt Encoders

Instead of 't5xxl', you may experiment with 'PixArt' text encoders (there is an example in the workflows). Although the output has often a somewhat SDXL vibe, it seems to have a better prompt adherence in conjunction with the LTXV model. It also produces better video with lower number of steps (in comparison with the same number of steps using t5xxl).

PixArt encoders will probably run slower on your machine, but again you may use the "dry run" trick; let the encoder parse the prompt, cancel the current run and then run it again, it should be several times faster to render the video.

Conclusion

It does not produce 24 FPS HQ videos at a 768x512 resolution faster than they can be watched, at least not on normal hardware. But it is so fast that it is creating its own standard on local video generations. If this is the future that video generators go, I am all for it.

Resources and References

Updated:

You may also like:

Subscribe

Stay connected to make sure you don’t miss anything. Join our newsletter community for artists, designers, and art and science enthusiasts.