LTX-2 Technical Guide: The New Standard for Local Audio-Video Generation
Released in January 2026, the LTX-2 model represents a major leap forward in the local video generation ecosystem. Developed by Lightricks, this 19-billion parameter model distinguishes itself through a joint diffusion architecture, capable of producing high-fidelity video and audio in perfect synchronization. Unlike traditional modular approaches that layer audio post-generation, LTX-2 integrates both streams within a single transformer, ensuring native temporal coherence for up to 20 seconds.
An Asymmetric DiT Architecture for Audio-Video Diffusion
At the core of LTX-2 lies an asymmetric dual-stream Diffusion Transformer (DiT). Of its 19 billion parameters, 14 billion are dedicated to video processing and 5 billion to audio, linked via bidirectional cross-attention layers.
This structure enables constant interaction: the video stream queries audio latents and vice versa, ensuring precise frame-level alignment without temporal drift. Text conditioning is handled by the Gemma-3-12B encoder. This design choice decouples semantic understanding from media generation, facilitating future optimizations and fine-tuning. For further technical details, refer to the LTX-2 arXiv paper.
What Are the Real-World Performances on RTX 5090 and High-End Hardware?
Deploying LTX-2 requires rigorous resource management, particularly regarding VRAM. Observed performance varies significantly based on the chosen pipeline (1-stage, 2-stage, or multi-tile), the scheduler, the step count, and the ComfyUI overhead.
Technical Comparison of Quantization Formats
| Version | Quantization | Estimated VRAM Usage | Obs. Time (720p, ~6s, RTX 5090) |
|---|---|---|---|
| LTX-2 Dev | BF16 (Full) | ~32–38 GB | ~100s+ |
| LTX-2 FP8 | FP8 (E4M3) | ~16–27 GB | ~50s |
| LTX-2 NVFP4 | NVFP4 (4-bit) | ~10–20 GB | ~40–66s |
| LTX-2 Distilled | BF16/FP8 | ~13–19 GB | ~2–5s (Preview) |
Note on NVFP4 format: Optimized for NVIDIA Blackwell (50-series) and Ada (40-series) architectures, this format theoretically offers up to a 3x speedup over BF16. However, community benchmarks on the RTX 5090 AI show a range of 40 to 66 seconds for a 6-second video, highlighting the impact of software configurations and Triton kernels.
Official Lightricks models are available on Hugging Face. Note that the ltx-2-19b-dev-fp4.safetensors file corresponds to the NVFP4 format.
The Case for Community GGUF Conversions
It is crucial to note that the GGUF format is not officially supported by Lightricks for LTX-2; these are community-driven conversions. On a machine equipped with an RTX 5090, using GGUF provides no performance benefit over native FP8 or NVFP4, as the dequantization process introduces additional CPU/GPU latency.
Modality-CFG Innovation: Refining Audio-Visual Coherence
Modality-CFG is one of LTX-2’s strongest differentiators. This technique allows for independent adjustment of text influence (st) and cross-modal guidance (sm).
Based on technical documentation and user feedback, here are the current best practices (non-universal) for balancing your generations:
- Video Stream: A setting of st=3 and sm=3 is often recommended to maintain prompt fidelity without sacrificing motion fluidity.
- Audio Stream: Higher text guidance (st=7) favors speech intelligibility, while sm=3 preserves synchronization with the visual action.
To explore these settings, we recommend using the ComfyUI-LTXVideo integration via the official repository.
IC-LoRA Pipelines and Structural Control
For advanced users, LTX-2 supports IC-LoRA pipelines, enabling video-to-video transformations with precise control. These adapters (Canny, Depth, Pose) inject structural constraints to guide generation while maintaining the source image’s coherence. To deepen your knowledge of these workflows, consult our advanced ComfyUI guides.

FAQ
Can LTX-2 run with less than 16 GB of VRAM?
Running with less than 16 GB is possible but highly restrictive. It typically requires weight streaming (offloading weights to system RAM), which can increase latency by 50% to 200% depending on PCIe bandwidth.
What are the exact specifications of the generated audio?
LTX-2 natively generates 24 kHz stereo audio via a modified HiFi-GAN vocoder. While high quality for foley or speech, it is not 48 kHz professional-grade audio.
Where can I download the official models?
Official weights and configuration files are hosted on the LTX-2 Hugging Face Model Card.
The capabilities of LTX-2, while revolutionary for local creation, demand a sophisticated understanding of the interaction between hardware and quantization formats. The rapid evolution of optimization kernels, such as SageAttention, promises substantial performance gains in the coming months for RTX 5090 owners.
Your comments enrich our articles, so don’t hesitate to share your thoughts! Sharing on social media helps us a lot. Thank you for your support!
