Ollama and BF16: real behavior and support limitations

The support of the BF16 (bfloat16) format has become an important topic for Ollama users who want to fully exploit the power of their modern GPUs. On paper, Ollama makes it possible to download and run models published in BF16 in safetensors or GGUF format. In practice, however, the reality is more nuanced.

What happens when you run commands such as:

ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:BF16

or:

ollama run hf.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:BF16

We will see why Ollama does not preserve the native BF16 format, what the consequences are for your AI performance and accuracy, and what improvements could be introduced.

Downloading a BF16 model with Ollama

When you run ollama pull or ollama run pointing to a BF16 file, Ollama does download the artifact as published on Hugging Face. For example, in the case of Qwen3-Coder-30B-A3B-Instruct-GGUF:BF16, it is indeed a GGUF file encoded in bfloat16 that is retrieved locally. The same applies to Mistral-Small-3.2-24B-Instruct-2506-GGUF:BF16.

At this stage, it looks as if the model will be executed in BF16, which would make sense to take advantage of the hardware optimizations available on GPUs such as the RTX 40xx, RTX 50xx or Nvidia H100.

Automatic conversion to FP16 during import

The reality is different: during import, Ollama applies an automatic conversion to FP16. In other words, even if you specify a BF16 model, the execution always takes place in FP16.

This limitation has been confirmed several times by the team and the community in GitHub discussions:

In short, Ollama does not yet provide an option to keep a model in native BF16 or FP32.

Consequences of this conversion

This automatic conversion is not trivial and has several direct implications:

Loss of true BF16 support: modern GPUs have hardware units optimized for BF16 computation. By switching to FP16, Ollama deprives the user of these optimizations.
Impact on AI performance and accuracy: BF16 offers a good balance between memory consumption and numerical accuracy. By forcing FP16, Ollama slightly increases memory usage and can reduce accuracy on certain sensitive calculations (long reasoning, complex math).
Forced uniformization: this strategy probably simplifies Ollama’s maintenance pipeline, but it prevents advanced users from choosing the format best suited to their hardware and needs.

Concrete examples with Qwen3 and Mistral

Take the case of the Qwen3-Coder-30B-A3B-Instruct model available in GGUF BF16. When you run:

ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:BF16

the model is downloaded in BF16 but converted to FP16 during import. The same process applies with:

ollama run hf.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:BF16

In both cases, you never actually run a model in native BF16, despite the explicit mention in the command.

Possible improvements

A future update of Ollama or llama.cpp could introduce a flag to keep a model in native BF16 or FP32 in the ollama create or ollama run command.
Documentation: official communication around this limitation remains limited, which fuels confusion in the community.

BF16 support in brief

BF16 support in Ollama is partial and misleading: files downloaded in BF16 are automatically converted to FP16. This directly impacts your AI performance and accuracy, especially if you have a recent GPU capable of exploiting BF16 natively.

As long as a preservation option is not integrated, you should consider that any BF16 model in Ollama is in fact executed in FP16.

Alternatives to truly run BF16 models

While waiting for Ollama to provide native BF16 support, some users turn to alternative solutions that make it possible to take advantage of this format without forced conversion to FP16. Here are the main options:

Using llama.cpp directly

Ollama relies on llama.cpp as its inference engine, so it is possible to bypass Ollama and run BF16 models directly with llama.cpp.

llama.cpp partially supports the BF16 format, although compatibility depends on the model and the GPU.
This allows you to preserve the weights as distributed, without intermediate conversion.
Example: by using the command ./llama-cli -m model.gguf –bf16, it is possible to force BF16 execution if the GPU supports it.

⚠️ full BF16 support in llama.cpp is evolving rapidly, but it is not yet perfect for all models and GPUs.

Switching to other compatible runtimes

Other open source runtimes offer more explicit support for BF16:

vLLM: used in server environments, it handles NVFP4, FP32, FP16 and sometimes BF16 weights more effectively, with finer management of modern GPUs.
Transformers + PyTorch: running a BF16 model with PyTorch makes it possible to directly use hardware acceleration, for example with torch_dtype=torch.bfloat16.
Text Generation Inference (TGI): Hugging Face’s solution is designed for production use and handles BF16 more explicitly on recent Nvidia GPUs.

These solutions are often more complex to set up than Ollama, but they allow you to fully exploit the BF16 capabilities of your hardware.

Also read : Install vLLM with Docker Compose on Linux (compatible with Windows WSL2)

Waiting for Ollama’s evolution

The community has already raised this limitation in several GitHub issues, including:

The Ollama team is therefore aware of the problem. A future update could introduce a flag like –keep-bf16 or –native-dtype, allowing users to choose the format to preserve during import.

How to run a BF16 model today?

If you want to truly run a BF16 model today, you need to turn to llama.cpp, vLLM, or PyTorch. Ollama remains convenient for its simplicity and integration with ready-to-use models, but its current pipeline forces BF16 → FP16 conversion, which limits the value of the format for your AI performance and accuracy.

FAQ about Ollama and BF16 support

Does Ollama really run models in BF16?

No. Even if you launch a model labeled BF16, Ollama automatically converts the weights to FP16 during import. Inference is therefore done in FP16. See issue #9944

Why does Ollama not keep the native BF16 format?

The import pipeline was designed to unify weights and simplify compatibility with the internal engine. This avoids managing multiple data types, but it deprives users of the hardware benefits of BF16.

What is the impact on my AI performance and accuracy?

Loss of BF16 hardware optimizations available on RTX 40xx and H100
Slightly higher memory consumption than native BF16
Reduced accuracy in some cases (long reasoning, sensitive calculations)

Is there an option to keep native BF16 or FP32 in Ollama?

No. At present, no command like –keep-bf16 exists. It is a feature often requested but not yet available. See issue #4670

How to really run a BF16 model?

llama.cpp with the –bf16 option (if supported by the GPU)
vLLM or Text Generation Inference (TGI)
PyTorch with torch_dtype=torch.bfloat16

Will the problem be fixed in a future version?

Probably. The community has raised the need several times, and it is possible that an option to preserve native BF16/FP32 will arrive in a future version of Ollama.

Your comments enrich our articles, so don’t hesitate to share your thoughts! Sharing on social media helps us a lot. Thank you for your support!

Ollama and BF16: real behavior and support limitations

Downloading a BF16 model with Ollama

Automatic conversion to FP16 during import

Consequences of this conversion

Concrete examples with Qwen3 and Mistral

Possible improvements

BF16 support in brief

Alternatives to truly run BF16 models

Using llama.cpp directly

Switching to other compatible runtimes

Waiting for Ollama’s evolution

How to run a BF16 model today?

FAQ about Ollama and BF16 support

Does Ollama really run models in BF16?

Why does Ollama not keep the native BF16 format?

What is the impact on my AI performance and accuracy?

Is there an option to keep native BF16 or FP32 in Ollama?

How to really run a BF16 model?

Will the problem be fixed in a future version?

Web Search API Comparison 2025: Performance, Pricing and Features

Prompty: Automate and organize your AI prompts in VS Code

Google offers 1 year of Gemini AI Pro with 2TB, how to claim it

Understanding the LangChain Ecosystem: Which Solutions Fit Your AI Projects?

Tips and tricks to get the most out of Comet, the AI browser from Perplexity

How to Build AI Agents Independent from Any LLM Provider

Leave a Reply Cancel reply

Downloading a BF16 model with Ollama

Automatic conversion to FP16 during import

Consequences of this conversion

Concrete examples with Qwen3 and Mistral

Possible improvements

BF16 support in brief

Alternatives to truly run BF16 models

Using llama.cpp directly

Switching to other compatible runtimes

Waiting for Ollama’s evolution

How to run a BF16 model today?

FAQ about Ollama and BF16 support

Does Ollama really run models in BF16?

Why does Ollama not keep the native BF16 format?

What is the impact on my AI performance and accuracy?

Is there an option to keep native BF16 or FP32 in Ollama?

How to really run a BF16 model?

Will the problem be fixed in a future version?

Similar Posts

Leave a Reply Cancel reply