import.md 4.58 KB
Newer Older
1
2
# Import a model

Jeffrey Morgan's avatar
Jeffrey Morgan committed
3
This guide walks through importing a GGUF, PyTorch or Safetensors model.
4

Jeffrey Morgan's avatar
Jeffrey Morgan committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## Importing (GGUF)

### Step 1: Write a `Modelfile`

Start by creating a `Modelfile`. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.

```
FROM ./mistral-7b-v0.1.Q4_0.gguf
```

(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:

```
FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```

### Step 2: Create the Ollama model

Finally, create a model from your `Modelfile`:

```
ollama create example -f Modelfile
```

### Step 3: Run your model

Next, test the model with `ollama run`:

```
ollama run example "What is your favourite condiment?"
```

## Importing (PyTorch & Safetensors)
39

Jeffrey Morgan's avatar
Jeffrey Morgan committed
40
41
42
43
44
45
46
47
48
49
50
### Supported models

Ollama supports a set of model architectures, with support for more coming soon:

- Llama & Mistral
- Falcon & RW
- GPT-NeoX
- BigCode

To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).

Jeffrey Morgan's avatar
Jeffrey Morgan committed
51
52
53
### Step 1: Clone the HuggingFace repository (optional)

If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
54
55
56
57
58
59
60

```
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
cd Mistral-7B-Instruct-v0.1
```

Jeffrey Morgan's avatar
Jeffrey Morgan committed
61
### Step 2: Convert and quantize to a `.bin` file (optional, for PyTorch and Safetensors)
62

Jeffrey Morgan's avatar
Jeffrey Morgan committed
63
If the model is in PyTorch or Safetensors format, a [Docker image](https://hub.docker.com/r/ollama/quantize) with the tooling required to convert and quantize models is available.
64

65
First, Install [Docker](https://www.docker.com/get-started/).
66

67
Next, to convert and quantize your model, run:
68
69
70
71
72
73
74
75

```
docker run --rm -v .:/model ollama/quantize -q q4_0 /model
```

This will output two files into the directory:

- `f16.bin`: the model converted to GGUF
76
- `q4_0.bin` the model quantized to a 4-bit quantization (we will use this file to create the Ollama model)
77
78
79

### Step 3: Write a `Modelfile`

Jeffrey Morgan's avatar
Jeffrey Morgan committed
80
Next, create a `Modelfile` for your model:
81
82
83
84
85
86
87
88
89
90
91
92

```
FROM ./q4_0.bin
```

(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:

```
FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```

93
### Step 4: Create the Ollama model
94
95
96
97
98
99
100

Finally, create a model from your `Modelfile`:

```
ollama create example -f Modelfile
```

Jeffrey Morgan's avatar
Jeffrey Morgan committed
101
102
### Step 5: Run your model

103
104
105
106
107
108
Next, test the model with `ollama run`:

```
ollama run example "What is your favourite condiment?"
```

Jeffrey Morgan's avatar
Jeffrey Morgan committed
109
## Publishing your model (optional – early alpha)
110
111
112
113

Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps:

1. Create [an account](https://ollama.ai/signup)
114
2. Run `cat ~/.ollama/id_ed25519.pub` to view your Ollama public key. Copy this to the clipboard.
115
116
117
118
119
120
121
122
123
124
125
126
127
128
3. Add your public key to your [Ollama account](https://ollama.ai/settings/keys)

Next, copy your model to your username's namespace:

```
ollama cp example <your username>/example
```

Then push the model:

```
ollama push <your username>/example
```

129
After publishing, your model will be available at `https://ollama.ai/<your username>/example`.
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181

## Quantization reference

The quantization options are as follow (from highest highest to lowest levels of quantization). Note: some architectures such as Falcon do not support K quants.

- `q2_K`
- `q3_K`
- `q3_K_S`
- `q3_K_M`
- `q3_K_L`
- `q4_0` (recommended)
- `q4_1`
- `q4_K`
- `q4_K_S`
- `q4_K_M`
- `q5_0`
- `q5_1`
- `q5_K`
- `q5_K_S`
- `q5_K_M`
- `q6_K`
- `q8_0`

## Manually converting & quantizing models

### Prerequisites

Start by cloning the `llama.cpp` repo to your machine in another directory:

```
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
```

Next, install the Python dependencies:

```
pip install -r requirements.txt
```

Finally, build the `quantize` tool:

```
make quantize
```

### Convert the model

Run the correct conversion script for your model architecture:

```shell
# LlamaForCausalLM or MistralForCausalLM
182
python convert.py <path to model directory>
183
184

# FalconForCausalLM
185
python convert-falcon-hf-to-gguf.py <path to model directory>
186
187

# GPTNeoXForCausalLM
188
python convert-gptneox-hf-to-gguf.py <path to model directory>
189
190

# GPTBigCodeForCausalLM
191
python convert-starcoder-hf-to-gguf.py <path to model directory>
192
193
194
195
196
197
198
```

### Quantize the model

```
quantize <path to model dir>/ggml-model-f32.bin <path to model dir>/q4_0.bin q4_0
```