This guide walks through importing a GGUF, PyTorch or Safetensors model.
GGUF models and select Safetensors models can be imported directly into Ollama.
## Importing (GGUF)
## ImportGGUF
### Step 1: Write a `Modelfile`
A binary GGUF file can be imported directly into Ollama through a Modelfile.
Start by creating a `Modelfile`. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.
```
FROM ./mistral-7b-v0.1.Q4_0.gguf
```dockerfile
FROM /path/to/file.gguf
```
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
## Import Safetensors
```
FROM ./mistral-7b-v0.1.Q4_0.gguf
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```
If the model being imported is one of these architectures, it can be imported directly into Ollama through a Modelfile:
### Step 2: Create the Ollama model
- LlamaForCausalLM
- MistralForCausalLM
- GemmaForCausalLM
Finally, create a model from your `Modelfile`:
```
ollama create example -f Modelfile
```dockerfile
FROM /path/to/safetensors/directory
```
### Step 3: Run your model
For architectures not directly convertable by Ollama, see llama.cpp's [guide](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize) on conversion. After conversion, see [Import GGUF](#import-gguf).
Next, test the model with `ollama run`:
## Automatic Quantization
```
ollama run example "What is your favourite condiment?"
```
## Importing (PyTorch & Safetensors)
> Importing from PyTorch and Safetensors is a longer process than importing from GGUF. Improvements that make it easier are a work in progress.
> [!NOTE]
> Automatic quantization requires v0.1.35 or higher.
### Setup
Ollama is capable of quantizing FP16 or FP32 models to any of the supported quantizations with the `-q/--quantize` flag in `ollama create`.
First, clone the `ollama/ollama` repo:
```
git clone git@github.com:ollama/ollama.git ollama
cd ollama
```dockerfile
FROM /path/to/my/gemma/f16/model
```
and then fetch its `llama.cpp` submodule:
```shell
git submodule init
git submodule update llm/llama.cpp
$ ollama create -q Q4_K_M mymodel
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd
creating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0f
writing manifest
success
```
Next, install the Python dependencies:
### Supported Quantizations
```
python3 -m venv llm/llama.cpp/.venv
source llm/llama.cpp/.venv/bin/activate
pip install -r llm/llama.cpp/requirements.txt
```
Then build the `quantize` tool:
<details>
<summary>Legacy Quantization</summary>
```
make -C llm/llama.cpp quantize
```
-`Q4_0`
-`Q4_1`
-`Q5_0`
-`Q5_1`
-`Q8_0`
### Clone the HuggingFace repository (optional)
</details>
If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
<details>
<summary>K-means Quantization</summary>`
Install [Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage), verify it's installed, and then clone the model's repository:
- `Q3_K_S`
- `Q3_K_M`
- `Q3_K_L`
- `Q4_K_S`
- `Q4_K_M`
- `Q5_K_S`
- `Q5_K_M`
- `Q6_K`
```
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 model
```
</details>
### Convert the model
> [!NOTE]
> Activation-aware Weight Quantization (i.e. IQ) are not currently supported for automatic quantization however you can still import the quantized model into Ollama, see [Import GGUF](#import-gguf).
> Note: some model architectures require using specific convert scripts. For example, Qwen models require running `convert-hf-to-gguf.py` instead of `convert.py`
3. Add your public key to your [Ollama account](https://ollama.com/settings/keys)
Next, copy your model to your username's namespace:
```
ollama cp example <your username>/example
```
> Note: model names may only contain lowercase letters, digits, and the characters `.`, `-`, and `_`.
Then push the model:
```
ollama push <your username>/example
```shell
$ ollama create mymodel
transferring model data
using autodetected template gemma-instruct
creating new layer sha256:baa2a0edc27d19cc6b7537578a9a7ba1a4e3214dc185ed5ae43692b319af7b84
creating new layer sha256:ba66c3309914dbef07e5149a648fd1877f030d337a4f240d444ea335008943cb
writing manifest
success
```
After publishing, your model will be available at `https://ollama.com/<your username>/example`.
## Quantization reference
The quantization options are as follow (from highest highest to lowest levels of quantization). Note: some architectures such as Falcon do not support K quants.
-`q2_K`
-`q3_K`
-`q3_K_S`
-`q3_K_M`
-`q3_K_L`
-`q4_0` (recommended)
-`q4_1`
-`q4_K`
-`q4_K_S`
-`q4_K_M`
-`q5_0`
-`q5_1`
-`q5_K`
-`q5_K_S`
-`q5_K_M`
-`q6_K`
-`q8_0`
-`f16`
Defining a template in the Modelfile will disable this feature which may be useful if you want to use a different template than the autodetected one.