@@ -15,7 +15,7 @@ FROM ./mistral-7b-v0.1.Q4_0.gguf
...
@@ -15,7 +15,7 @@ FROM ./mistral-7b-v0.1.Q4_0.gguf
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
```
```
FROM ./q4_0.bin
FROM ./mistral-7b-v0.1.Q4_0.gguf
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```
```
...
@@ -37,55 +37,69 @@ ollama run example "What is your favourite condiment?"
...
@@ -37,55 +37,69 @@ ollama run example "What is your favourite condiment?"
## Importing (PyTorch & Safetensors)
## Importing (PyTorch & Safetensors)
### Supported models
> Importing from PyTorch and Safetensors is a longer process than importing from GGUF. Improvements that make it easier are a work in progress.
Ollama supports a set of model architectures, with support for more coming soon:
### Setup
- Llama & Mistral
First, clone the `ollama/ollama` repo:
- Falcon & RW
- BigCode
To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).
```
git clone git@github.com:ollama/ollama.git ollama
cd ollama
```
### Step 1: Clone the HuggingFace repository (optional)
and then fetch its `llama.cpp` submodule:
If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
### Step 2: Convert and quantize to a `.bin` file (optional, for PyTorch and Safetensors)
Then build the `quantize` tool:
```
make -C llm/llama.cpp quantize
```
If the model is in PyTorch or Safetensors format, a [Docker image](https://hub.docker.com/r/ollama/quantize) with the tooling required to convert and quantize models is available.
If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
Next, to convert and quantize your model, run:
Install [Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage), verify it's installed, and then clone the model's repository:
```
```
docker run --rm -v .:/model ollama/quantize -q q4_0 /model
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 model
```
```
This will output two files into the directory:
### Convert the model
-`f16.bin`: the model converted to GGUF
> Note: some model architectures require using specific convert scripts. For example, Qwen models require running `convert-hf-to-gguf.py` instead of `convert.py`
-`q4_0.bin` the model quantized to a 4-bit quantization (Ollama will use this file to create the Ollama model)
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
### Step 3: Write a `Modelfile`
Next, create a `Modelfile` for your model:
```
```
FROM ./q4_0.bin
FROM quantized.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```
```
...
@@ -149,47 +163,3 @@ The quantization options are as follow (from highest highest to lowest levels of
...
@@ -149,47 +163,3 @@ The quantization options are as follow (from highest highest to lowest levels of
-`q6_K`
-`q6_K`
-`q8_0`
-`q8_0`
-`f16`
-`f16`
## Manually converting & quantizing models
### Prerequisites
Start by cloning the `llama.cpp` repo to your machine in another directory: