ComfyUI: which format to choose BF16, FP16, FP8 or GGUF?

Running artificial intelligence models locally is getting more and more accessible, especially thanks to interfaces like ComfyUI, which make it easy to orchestrate complex workflows for image generation, text processing, or semantic analysis. Whether you use Flux, Stable Diffusion, LLaMA 3, HiDream or CLIP, it’s essential to choose the right AI model format to strike the best balance between quality, speed, and memory usage.
Today, several formats coexist in ComfyUI: BF16, FP16, FP8 and quantized GGUF variants like Q2, Q4, Q5, Q6 or Q8. Each of these formats offers its own pros and cons depending on your hardware, especially how much VRAM you have, and on the task type: image generation, text processing, multi-model execution, and more.
The goal of this guide is to explain in detail which format to choose in ComfyUI based on your GPU, with a focus on two common user profiles:
- Users with an RTX 5090 with 32 GB of VRAM
- Users with an RTX 5070 Ti with 16 GB of VRAM
- For GPUs with 8 GB of VRAM, apply the same logic using measured VRAM usage
Each case has specific needs in terms of performance, stability, and compatibility in ComfyUI. Your choice of AI model format (BF16, FP16, FP8, or GGUF) has a direct impact on GPU load, generation time, and final output quality. The crucial pitfall to avoid is exceeding available VRAM, which triggers SWAP between VRAM / RAM / SSD. Execution times skyrocket and errors may occur. The objective? Find the right balance for your setup.
Understanding the formats available in ComfyUI
In ComfyUI, your choice of AI model format (whether BF16, FP16, FP8, or a quantized GGUF format) directly affects workflow performance. Each format is a different numerical representation of model weights, influencing precision, VRAM consumption, and inference speed.
Floating-point formats: BF16, FP16 and FP8
Floating-point formats are the closest to a model’s original FP32 (32-bit float) training weights. They maintain very good precision, but their memory footprint is higher than quantized formats.
BF16: BFloat16, precision for modern models
BF16 is a 16-bit format with a reduced mantissa (the significant digits that define effective precision) but the same exponent range as FP32, making it very well suited to modern AI models like LLaMA 3. It performs particularly well on latest-gen GPUs (Ada Lovelace / RTX 5090) with hardware support.

In ComfyUI, BF16 is mostly used for large language models (LLMs) and high-fidelity diffusion models where numerical stability is critical.
➡️ Pros: very good precision, excellent compatibility with complex ComfyUI workflows.
➡️ Cons: higher VRAM usage than quantized formats.
Note: not all models are available in BF16. Some weights must be converted manually from FP32 using specific tools.
FP16: the standard format in ComfyUI for heavy models
FP16, or Half Precision Float, is the most commonly used format in ComfyUI to run heavy models like Stable Diffusion XL, HiDream i1, Flux, or high-definition VAE. It offers a solid trade-off between precision and performance.
Most ComfyUI workflows are optimized for FP16, especially for image decoding, vector generation, or CLIP encoding steps.
➡️ Pros: very well supported, fast on GPU, good precision.
➡️ Cons: significant VRAM load (16 GB+ for some models; ideally 24–32 GB for the newest models).
Example: running SDXL with VAE + UNet + T5 in FP16 can use between 18 and 24 GB of VRAM, depending on batch size and image dimensions.
FP8: an increasingly common format
FP8 is a lighter (8-bit) format increasingly used in practice. It can reduce memory consumption while speeding up certain stages, as long as the model supports it.
In ComfyUI, FP8 is appearing more and more often and is gradually becoming a speed-first default for some blocks.
➡️ Pros: very fast, minimal VRAM usage.
➡️ Cons: potential quality loss on sensitive models.
Quantized GGUF formats: Q4, Q5, Q6, Q8
The GGUF format (Grokking General Unified Format) is the successor to GGML, designed to run quantized models (size-reduced) on CPU or GPU with minimal resources. Heavily used with llama.cpp, GGUF is now well integrated into ComfyUI.
Q8_0: the best quality among quantized formats
Q8_0 offers precision close to FP16 while reducing memory usage. It’s particularly useful for loading large AI models or when a workflow chains multiple models.
➡️ Pros: excellent precision, smaller size, easy to run on CPU/GPU.
➡️ Cons: slightly slower than FP16 on high-end GPUs.
Q6_K and Q5_K: the sweet spot for mixed workflows
Q6_K and Q5_K are very popular in ComfyUI when you want to load several models in the same pipeline while maintaining acceptable output quality.
- Q6_K can run models with only 10–12 GB of VRAM.
- Q5_K fits lighter workflows or secondary LLMs (assistant, rewriting…).
➡️ Pros: lightweight, efficient, good ComfyUI integration.
➡️ Cons: some loss of nuance for complex contextual tasks.
Q4_K: for tests or small models
The very light Q4_K format is often used for prototypes, compatibility tests, or very constrained setups (CPU-only or 6–8 GB VRAM GPUs). On an RTX 5090 it’s not relevant unless the goal is to load 5–6 models in parallel for comparative tests or a complex workflow.
Summary of formats available in ComfyUI
Format | Precision | VRAM usage |
---|---|---|
BF16 | Very high | High |
FP16 | High | Medium/High |
FP8 | Medium | Very low |
GGUF Q8 | Good | Low/Medium |
GGUF Q6 | Medium to good | Low |
GGUF Q4 | Medium to low | Very low |
Which criteria matter when choosing the right format in ComfyUI?
To choose the best ComfyUI format, BF16, FP16, FP8 or GGUF, it’s not enough to look at raw GPU power. The right choice also depends on several key factors tied to your AI workflow, your models, and how you manage available VRAM.
1. Available VRAM
The decisive factor for picking an AI model format in ComfyUI is your GPU’s video memory. The more VRAM you have, the more heavy models you can load in high precision (FP16 or BF16), and the less you need to rely on quantized GGUF builds.
- A card with 32 GB VRAM like the RTX 5090 lets you load several models in FP16 without saturation. Beware: recent models can be very large.
- Conversely, an RTX 4060 with 8 GB VRAM should lean on Q6_K or Q5_K, or even Q4_K, to avoid out-of-memory crashes.
If you use ComfyUI for image generation (Flux Kontext, HiDream, Stable Diffusion XL, HiRes Fix, LoRA), each module (UNet, CLIP, VAE, T5) can take 2–8 GB of VRAM in FP16. That’s why you should pick the right format per component. Note: some models can be offloaded to the CPU, especially text models.
2. The model family you use in ComfyUI
Not all models react the same way to quantization or reduced precision.
Model type | Recommended format by context |
---|---|
Stable Diffusion (UNet) | FP16 or BF16 (prioritize image quality) |
CLIP / Vision encoder | GGUF Q6_K or FP8 (if available) |
LLM (LLaMA, Mistral) | GGUF Q8_0 if quality matters, Q6_K for VRAM-optimized runs |
T5 / text encoders | Q6_K or FP8 (fast, fairly precise, low memory) |
VAE / image decoding | FP16 to avoid visual artifacts |
Poorly quantized LLMs (e.g., rushed Q5_K releases) can suffer coherence loss. Always check qualitative benchmarks on Hugging Face or community forums before integrating them into your ComfyUI pipeline.
3. Task type: image, text, or multi-model pipeline
Your use case in ComfyUI directly influences which format to favor. Here’s a quick summary by application category:
- High-definition image generation (Flux, Stable Diffusion, HiDream):
Use FP16 or BF16 for UNet and VAE. Precision is essential to avoid blur or artifacts. Quantize secondary models like CLIP or T5. - Text interpreters (LLaMA, T5, Phi-3): favor GGUF Q8_0 for quality or Q6_K for a RAM/perf balance.
- Mixed workflows with multiple models (e.g., CLIP + LoRA + T5 + Diffusion):
Use a format mix: FP16 for critical models, GGUF for secondary LLMs, and FP8 if available for encoders.
4. How optimized your ComfyUI workflow is
Some ComfyUI nodes help manage GPU memory better, notably by isolating heavy blocks or using optimized variants (xformers nodes, CPU offloading, dynamic batching).
For each model, ask yourself:
- Is it active throughout the pipeline or only at the input?
- Is it used in batch or iteratively?
- Can parts be moved to CPU with no significant performance loss?
Example: a T5 XXL used to rephrase a prompt doesn’t necessarily need FP16. A Q6_K build is enough and frees up to 4 GB of VRAM, which you can reallocate to image generation.
5. How much precision you actually need
Finally, decide whether you’re aiming for maximum output fidelity or a performance-friendly compromise. For professional production (illustration, prompt engineering, commercial content), BF16 or FP16 is recommended. For prototyping, tests, or internal tools, a GGUF Q6_K build often delivers sufficient quality with a huge performance and VRAM win.
Tip: adopt a modular approach in your ComfyUI workflows. Use heavyweight formats for critical blocks, and lighten everything else with quantized models.
ComfyUI recommendations for an RTX 5090 (32 GB VRAM)
With an RTX 5090 and 32 GB of VRAM, you can fully leverage ComfyUI and its advanced features. This high-end configuration lets you run multiple AI models, FP16, BF16, or quantized GGUF, simultaneously without typical memory limits. Even so, choosing the right format per pipeline block remains crucial to optimize performance and quality.
Newer models and workflows are increasingly memory-hungry; even with an RTX 5090, you’ll want to optimize memory management, especially for video generation.
BF16 and FP16: prioritize these for critical models
With 32 GB VRAM, you can rely on BF16 and FP16, which are closest to original FP32 weights. These formats are ideal for image diffusion models (UNet, VAE, HiRes), top-tier LLMs (LLaMA 3), and encoder models like CLIP or T5 when maximum precision is required.
- FP16 is the most broadly compatible default in ComfyUI. It delivers fast generation with minimal quality compromise, especially in complex visual flows.
- BF16 can help some recent models optimized for this format by preserving a better dynamic range while slightly reducing memory versus FP32.
💡 Practical example: a ComfyUI workflow combining SDXL in FP16, a CLIP interrogator, a VAE in FP16, and a T5 text interpreter runs without GPU saturation, even for 1024×1024 images.
FP8: increasingly present
On an RTX 5090 with hardware support for FP8, it can be tempting to switch some secondary AI blocks to this format to reduce GPU load. In theory, FP8 enables very fast and lightweight execution, ideal for tasks such as:
- CLIP encoding for prompt inputs
- Seed text generation with T5
- NLP pre-processing
Use FP8 only if the model and ComfyUI nodes are compatible.
GGUF Q8 or Q6: ideal for LLMs or multi-model execution
Even if you have ample VRAM to run a model in FP16, it’s often smart to pick a quantized GGUF build, especially Q8_0 or Q6_K, to free memory for other ComfyUI blocks. This allows you to run, in parallel:
- An LLM (e.g., LLaMA 3) in Q6_K
- An image generation pipeline in FP16
- A text encoder (T5, Phi) in Q6_K
- A CLIP interrogator in Q8_0
With this structure, you maximize efficiency without sacrificing semantic coherence or generation quality. Community benchmarks show that quality differences between FP16 and Q8_0 are often negligible in real-world usage.
Concrete example:
A ComfyUI pipeline with HiDream i1 in FP16 + LLaMA 3 GGUF Q8_0 + CLIP GGUF Q6_K + quantized T5 + SDXL can fit in 26–28 GB VRAM, leaving room for batch size or upscalers.
Per-module recommendations in ComfyUI (with RTX 5090)
AI workflow block | Recommended format | Reason |
---|---|---|
Stable Diffusion XL | FP16 / BF16 | Visual fidelity, stable high-res generation |
VAE / image decoding | FP16 | Best final render, avoids banding |
UNet (HiRes, HiDream) | FP16 | Most stable and fast on GPU |
CLIP interrogator | GGUF Q6_K or FP16 | Can be quantized with no visible loss |
T5 / LLaMA / Mistral | GGUF Q8_0 / Q6_K | Great size/perf ratio, excellent ComfyUI support |
Secondary assistant | Q5_K | Enough for dialogue or basic rewrites |
Summary: what strategy to adopt with the RTX 5090?
With 32 GB VRAM on an RTX 5090, you can build very flexible ComfyUI workflows. Recommended strategy:
- Use FP16 or BF16 for primary models (image or critical LLMs).
- Prefer GGUF Q6_K or Q8_0 for support/assistant models.
- Consider FP8 only for very specific, compatible modules.
- Don’t waste VRAM: smart format segmentation lets you add tools without slowing the pipeline.
ComfyUI recommendations for an RTX 5070 Ti (16 GB VRAM)
With an RTX 5070 Ti and 16 GB VRAM, you’ll make trade-offs between GPU performance and on-board memory. This setup is sufficient for advanced ComfyUI workflows, provided you choose formats wisely. Unlike an RTX 5090, your headroom is tighter, and GGUF becomes essential to balance quality, speed, and memory usage.
FP16 and BF16: use sparingly
While FP16 (or BF16) is sometimes viable on an RTX 5070 Ti, keep it strategic and targeted. Some FP16 models can take 8 GB+ VRAM and others exceed 16 GB, making them impractical in ComfyUI with only 16 GB.
Using FP16 everywhere in a ComfyUI pipeline with 16 GB VRAM often causes memory saturation or crashes.
✅ However, some critical blocks like UNet, VAE, or a single main Stable Diffusion model can stay in FP16 if no other heavyweight components run in parallel.
Practical tip: disable LoRAs, upscalers, or extra encoders if you run a main model in FP16.
GGUF Q6_K and Q5_K: recommended for most workflows
Quantized GGUF formats are a perfect fit for a mid-range setup like the RTX 5070 Ti, especially Q6_K and Q5_K variants:
- GGUF Q6_K runs core AI while leaving enough VRAM for other pipeline modules.
- GGUF Q5_K is useful for support models like T5 or CLIP when top precision isn’t required.
Concrete example: a combo of Stable Diffusion 1.5 in FP16 + LLaMA 3 in Q6_K + CLIP Q5_K runs smoothly on 16 GB VRAM, maintaining good generation quality and decent fluidity.
Pro tip: favor well-tested quantized builds on Hugging Face to avoid corrupt weights or CUDA errors in ComfyUI.
FP8: still of limited use here
With an RTX 5070 Ti, for many models you’ll lean toward FP8 due to tighter memory. If that’s not enough, fall back to GGUF.
Per-module recommendations in ComfyUI (with RTX 5070 Ti)
AI pipeline block | Recommended format | Reason |
---|---|---|
Stable Diffusion 1.5 | FP16 | High visual quality without exceeding ~6–7 GB |
Stable Diffusion XL | GGUF Q6_K or SDXL-Light | Too heavy in FP16 alone; use optimized/light variants |
CLIP | Q5_K | Accurate enough, very lightweight |
T5 / text encoder | Q6_K | Good balance for rephrasing or context assist |
LLaMA 3 / Mistral | Q6_K or Q5_K | Q6_K if quality first, Q5_K for lighter footprint |
VAE | FP16 | Native format, compatible and modest usage |
Optimal ComfyUI strategy with 16 GB VRAM
To choose the ideal ComfyUI format with an RTX 5070 Ti, follow this strategy:
- Use FP16 only for major blocks (e.g., UNet or VAE) and avoid duplicates or overly complex pipelines.
- Use GGUF Q6_K for primary LLMs and Q5_K for assistants/encoders.
- Avoid ultra-hungry upscalers like ESRGAN/RealESRGAN in FP16; prefer quantized or CPU-friendly versions.
- Adopt a modular loading strategy to load models dynamically when using ComfyUI interactively.
💡 If you want to push performance further, combine a dynamic VRAM cache with partial CPU offloading.
Conclusion: which format should you choose in ComfyUI based on your GPU?

Picking the right AI model format in ComfyUI, BF16, FP16, FP8, or GGUF, is decisive to ensure stability, generation quality, and performance. Whether you use image diffusion models, CLIP, or encoders like T5, each format has its place… as long as you tailor it to your GPU configuration.
If you use an RTX 5090 with 32 GB VRAM:
- Choose FP16 or BF16 for the main models (Stable Diffusion XL, HiDream, LLaMA 3…).
- Use GGUF Q6_K or Q8_0 for secondary models (CLIP, T5, assistants).
- FP8 remains a viable option with some models and heavy workflows.
- Fully exploit available VRAM to run heavy, multi-model workflows without compromise.
If you have an RTX 5070 Ti with 16 GB VRAM:
- Reserve FP16 for critical modules like UNet or VAE and keep their number limited.
- Favor quantized GGUF Q6_K or Q5_K for LLMs, T5 and CLIP.
- Switch to FP8 if you’re running out of VRAM.
- Adopt a modular structure to avoid memory saturation.
Whether you’re experienced or just starting with ComfyUI, the secret to a smooth, efficient pipeline is smart management of AI model formats. The right format in the right place guarantees fast execution, faithful results, and better use of your hardware.
Your comments enrich our articles, so don’t hesitate to share your thoughts! Sharing on social media helps us a lot. Thank you for your support!