LTX-2 Technical Guide: NVFP4, ComfyUI, and RTX 5090 Optimization

Released in January 2026, the LTX-2 model represents a major leap forward in the local video generation ecosystem. Developed by Lightricks, this 19-billion parameter model distinguishes itself through a joint diffusion architecture, capable of producing high-fidelity video and audio in perfect synchronization. Unlike traditional modular approaches that layer audio post-generation, LTX-2 integrates both streams within a single transformer, ensuring native temporal coherence for up to 20 seconds.

An Asymmetric DiT Architecture for Audio-Video Diffusion

At the core of LTX-2 lies an asymmetric dual-stream Diffusion Transformer (DiT). Of its 19 billion parameters, 14 billion are dedicated to video processing and 5 billion to audio, linked via bidirectional cross-attention layers.

This structure enables constant interaction: the video stream queries audio latents and vice versa, ensuring precise frame-level alignment without temporal drift. Text conditioning is handled by the Gemma-3-12B encoder. This design choice decouples semantic understanding from media generation, facilitating future optimizations and fine-tuning. For further technical details, refer to the LTX-2 arXiv paper.

What Are the Real-World Performances on RTX 5090 and High-End Hardware?

Deploying LTX-2 requires rigorous resource management, particularly regarding VRAM. Observed performance varies significantly based on the chosen pipeline (1-stage, 2-stage, or multi-tile), the scheduler, the step count, and the ComfyUI overhead.

Technical Comparison of Quantization Formats

Version	Quantization	Estimated VRAM Usage	Obs. Time (720p, ~6s, RTX 5090)
LTX-2 Dev	BF16 (Full)	~32–38 GB	~100s+
LTX-2 FP8	FP8 (E4M3)	~16–27 GB	~50s
LTX-2 NVFP4	NVFP4 (4-bit)	~10–20 GB	~40–66s
LTX-2 Distilled	BF16/FP8	~13–19 GB	~2–5s (Preview)

Note on NVFP4 format: Optimized for NVIDIA Blackwell (50-series) and Ada (40-series) architectures, this format theoretically offers up to a 3x speedup over BF16. However, community benchmarks on the RTX 5090 AI show a range of 40 to 66 seconds for a 6-second video, highlighting the impact of software configurations and Triton kernels.

The NVFP4 format introduces a noticeable quality loss on diffusion models such as LTX-2. To choose the right format and optimizations, refer to our dedicated guide.

Official Lightricks models are available on Hugging Face. Note that the ltx-2-19b-dev-fp4.safetensors file corresponds to the NVFP4 format.

The Case for Community GGUF Conversions

It is crucial to note that the GGUF format is not officially supported by Lightricks for LTX-2; these are community-driven conversions. On a machine equipped with an RTX 5090, using GGUF provides no performance benefit over native FP8 or NVFP4, as the dequantization process introduces additional CPU/GPU latency.

Modality-CFG Innovation: Refining Audio-Visual Coherence

Modality-CFG is one of LTX-2’s strongest differentiators. This technique allows for independent adjustment of text influence (st) and cross-modal guidance (sm).

Based on technical documentation and user feedback, here are the current best practices (non-universal) for balancing your generations:

Video Stream: A setting of st=3 and sm=3 is often recommended to maintain prompt fidelity without sacrificing motion fluidity.
Audio Stream: Higher text guidance (st=7) favors speech intelligibility, while sm=3 preserves synchronization with the visual action.

To explore these settings, we recommend using the ComfyUI-LTXVideo integration via the official repository.

IC-LoRA Pipelines and Structural Control

For advanced users, LTX-2 supports IC-LoRA pipelines, enabling video-to-video transformations with precise control. These adapters (Canny, Depth, Pose) inject structural constraints to guide generation while maintaining the source image’s coherence. To deepen your knowledge of these workflows, consult our advanced ComfyUI guides.

FAQ

Can LTX-2 run with less than 16 GB of VRAM?

Running with less than 16 GB is possible but highly restrictive. It typically requires weight streaming (offloading weights to system RAM), which can increase latency by 50% to 200% depending on PCIe bandwidth.

What are the exact specifications of the generated audio?

LTX-2 natively generates 24 kHz stereo audio via a modified HiFi-GAN vocoder. While high quality for foley or speech, it is not 48 kHz professional-grade audio.

Where can I download the official models?

Official weights and configuration files are hosted on the LTX-2 Hugging Face Model Card.

The capabilities of LTX-2, while revolutionary for local creation, demand a sophisticated understanding of the interaction between hardware and quantization formats. The rapid evolution of optimization kernels, such as SageAttention, promises substantial performance gains in the coming months for RTX 5090 owners.

Your comments enrich our articles, so don’t hesitate to share your thoughts! Sharing on social media helps us a lot. Thank you for your support!

LTX-2 Technical Guide: The New Standard for Local Audio-Video Generation

An Asymmetric DiT Architecture for Audio-Video Diffusion

What Are the Real-World Performances on RTX 5090 and High-End Hardware?

Technical Comparison of Quantization Formats

The Case for Community GGUF Conversions

Modality-CFG Innovation: Refining Audio-Visual Coherence

IC-LoRA Pipelines and Structural Control

FAQ

Can LTX-2 run with less than 16 GB of VRAM?

What are the exact specifications of the generated audio?

Where can I download the official models?

AI Inference Throughput and Latency Trade-offs in 2026

KV Cache Memory Scaling and Long-Context Engineering in 2026

AI Inference Economics 2026: The Real Cost of Agents, Long Context and Infrastructure Scale

Weekly AI News: Capex, Agentic Repricing, China’s Model Wave

Agentic AI 2026: Capital Repricing, Long-Context Scaling and China’s Acceleration

Agentic engineering guide: orchestrating agents and mastering the “Vibe Coder” shift

Leave a Reply Cancel reply

An Asymmetric DiT Architecture for Audio-Video Diffusion

What Are the Real-World Performances on RTX 5090 and High-End Hardware?

Technical Comparison of Quantization Formats

The Case for Community GGUF Conversions

Modality-CFG Innovation: Refining Audio-Visual Coherence

IC-LoRA Pipelines and Structural Control

FAQ

Can LTX-2 run with less than 16 GB of VRAM?

What are the exact specifications of the generated audio?

Where can I download the official models?

Similar Posts

Leave a Reply Cancel reply