faq.mdx 16.7 KB
Newer Older
1
2
3
---
title: FAQ
---
4

Matt Williams's avatar
Matt Williams committed
5
## How can I upgrade Ollama?
6

Jeffrey Morgan's avatar
Jeffrey Morgan committed
7
8
9
10
Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version [manually](https://ollama.com/download/).

On Linux, re-run the install script:

11
```shell
Jeffrey Morgan's avatar
Jeffrey Morgan committed
12
13
curl -fsSL https://ollama.com/install.sh | sh
```
14

Matt Williams's avatar
Matt Williams committed
15
## How can I view the logs?
16

17
Review the [Troubleshooting](./troubleshooting) docs for more about using logs.
18

19
20
## Is my GPU compatible with Ollama?

21
Please refer to the [GPU docs](./gpu).
22

Jeffrey Morgan's avatar
Jeffrey Morgan committed
23
24
## How can I specify the context window size?

25
By default, Ollama uses a context window size of 2048 tokens.
26

27
This can be overridden with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
28
29
30
31

```shell
OLLAMA_CONTEXT_LENGTH=8192 ollama serve
```
Jeffrey Morgan's avatar
Jeffrey Morgan committed
32
33
34

To change this when using `ollama run`, use `/set parameter`:

35
```shell
36
/set parameter num_ctx 4096
Jeffrey Morgan's avatar
Jeffrey Morgan committed
37
38
39
40
```

When using the API, specify the `num_ctx` parameter:

41
```shell
Jeffrey Morgan's avatar
Jeffrey Morgan committed
42
curl http://localhost:11434/api/generate -d '{
43
  "model": "llama3.2",
Jeffrey Morgan's avatar
Jeffrey Morgan committed
44
45
  "prompt": "Why is the sky blue?",
  "options": {
46
    "num_ctx": 4096
Jeffrey Morgan's avatar
Jeffrey Morgan committed
47
48
49
50
  }
}'
```

51
52
53
54
55
56
57
58
## How can I tell if my model was loaded onto the GPU?

Use the `ollama ps` command to see what models are currently loaded into memory.

```shell
ollama ps
```

59
<Info>
60
61
62
63
64
65
66

**Output**:

```
NAME        ID            SIZE    PROCESSOR   UNTIL
llama3:70b  bcfb190ca3a7  42 GB   100% GPU    4 minutes from now
```
67
</Info>
68

69
The `Processor` column will show which memory the model was loaded in to:
70
71
72
73

- `100% GPU` means the model was loaded entirely into the GPU
- `100% CPU` means the model was loaded entirely in system memory
- `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory
74

75
## How do I configure Ollama server?
76

77
Ollama server can be configured with environment variables.
78

79
### Setting environment variables on Mac
80

81
If Ollama is run as a macOS application, environment variables should be set using `launchctl`:
Michael Yang's avatar
Michael Yang committed
82

83
1. For each environment variable, call `launchctl setenv`.
Michael Yang's avatar
Michael Yang committed
84

85
86
87
   ```bash
   launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
   ```
88

89
2. Restart Ollama application.
Michael Yang's avatar
Michael Yang committed
90

91
### Setting environment variables on Linux
Michael Yang's avatar
Michael Yang committed
92

93
If Ollama is run as a systemd service, environment variables should be set using `systemctl`:
Michael Yang's avatar
Michael Yang committed
94

95
1. Edit the systemd service by calling `systemctl edit ollama.service`. This will open an editor.
Jeffrey Morgan's avatar
Jeffrey Morgan committed
96

97
98
2. For each environment variable, add a line `Environment` under section `[Service]`:

99
100
101
102
   ```ini
   [Service]
   Environment="OLLAMA_HOST=0.0.0.0:11434"
   ```
103
104
105
106

3. Save and exit.

4. Reload `systemd` and restart Ollama:
Jeffrey Morgan's avatar
Jeffrey Morgan committed
107

108
   ```shell
Matt Williams's avatar
Matt Williams committed
109
110
111
   systemctl daemon-reload
   systemctl restart ollama
   ```
Jeffrey Morgan's avatar
Jeffrey Morgan committed
112

113
114
### Setting environment variables on Windows

115
On Windows, Ollama inherits your user and system environment variables.
116

117
1. First Quit Ollama by clicking on it in the task bar.
118

119
2. Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for _environment variables_.
120

121
3. Click on _Edit environment variables for your account_.
122

123
4. Edit or create a new variable for your user account for `OLLAMA_HOST`, `OLLAMA_MODELS`, etc.
124

125
126
127
5. Click OK/Apply to save.

6. Start the Ollama application from the Windows Start menu.
128

129
130
## How do I use Ollama behind a proxy?

Michael Yang's avatar
Michael Yang committed
131
132
Ollama pulls models from the Internet and may require a proxy server to access the models. Use `HTTPS_PROXY` to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.

133
134
135
136
<Note>
  Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only
  HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server.
</Note>
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

### How do I use Ollama behind a proxy in Docker?

The Ollama Docker container image can be configured to use a proxy by passing `-e HTTPS_PROXY=https://proxy.example.com` when starting the container.

Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on [macOS](https://docs.docker.com/desktop/settings/mac/#proxies), [Windows](https://docs.docker.com/desktop/settings/windows/#proxies), and [Linux](https://docs.docker.com/desktop/settings/linux/#proxies), and Docker [daemon with systemd](https://docs.docker.com/config/daemon/systemd/#httphttps-proxy).

Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.

```dockerfile
FROM ollama/ollama
COPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crt
RUN update-ca-certificates
```

Build and run this image:

```shell
docker build -t ollama-with-ca .
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
```

159
## Does Ollama send my prompts and answers back to ollama.com?
Patrick Devine's avatar
Patrick Devine committed
160

161
No. Ollama runs locally, and conversation data does not leave your machine.
162

Matt Williams's avatar
Matt Williams committed
163
## How can I expose Ollama on my network?
Jeffrey Morgan's avatar
Jeffrey Morgan committed
164

165
Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the `OLLAMA_HOST` environment variable.
Jeffrey Morgan's avatar
Jeffrey Morgan committed
166

167
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
Jeffrey Morgan's avatar
Jeffrey Morgan committed
168

jmorganca's avatar
jmorganca committed
169
170
## How can I use Ollama with a proxy server?

Jeffrey Morgan's avatar
Jeffrey Morgan committed
171
Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:
jmorganca's avatar
jmorganca committed
172

173
```nginx
jmorganca's avatar
jmorganca committed
174
175
176
177
178
179
180
181
182
183
184
185
186
187
server {
    listen 80;
    server_name example.com;  # Replace with your domain or IP
    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host localhost:11434;
    }
}
```

## How can I use Ollama with ngrok?

Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:

188
```shell
jmorganca's avatar
jmorganca committed
189
190
191
ngrok http 11434 --host-header="localhost:11434"
```

Jeffrey Morgan's avatar
Jeffrey Morgan committed
192
193
194
195
## How can I use Ollama with Cloudflare Tunnel?

To use Ollama with Cloudflare Tunnel, use the `--url` and `--http-host-header` flags:

196
```shell
Jeffrey Morgan's avatar
Jeffrey Morgan committed
197
198
199
cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"
```

200
## How can I allow additional web origins to access Ollama?
Jeffrey Morgan's avatar
Jeffrey Morgan committed
201

202
Ollama allows cross-origin requests from `127.0.0.1` and `0.0.0.0` by default. Additional origins can be configured with `OLLAMA_ORIGINS`.
Michael Yang's avatar
Michael Yang committed
203

204
205
206
207
208
209
210
For browser extensions, you'll need to explicitly allow the extension's origin pattern. Set `OLLAMA_ORIGINS` to include `chrome-extension://*`, `moz-extension://*`, and `safari-web-extension://*` if you wish to allow all browser extensions access, or specific extensions as needed:

```
# Allow all Chrome, Firefox, and Safari extensions
OLLAMA_ORIGINS=chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve
```

211
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
Michael Yang's avatar
Michael Yang committed
212

213
214
## Where are models stored?

215
- macOS: `~/.ollama/models`
Matt Williams's avatar
Matt Williams committed
216
- Linux: `/usr/share/ollama/.ollama/models`
217
- Windows: `C:\Users\%username%\.ollama\models`
218

219
220
221
### How do I set them to a different location?

If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory.
222

223
224
225
<Note>
  On Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama <directory>`.
</Note>
226

227
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
228

Jeffrey Morgan's avatar
Jeffrey Morgan committed
229
## How can I use Ollama in Visual Studio Code?
230

231
There is already a large collection of plugins available for VS Code as well as other editors that leverage Ollama. See the list of [extensions & plugins](https://github.com/ollama/ollama#extensions--plugins) at the bottom of the main repository readme.
Michael Yang's avatar
Michael Yang committed
232

233
## How do I use Ollama with GPU acceleration in Docker?
Michael Yang's avatar
Michael Yang committed
234

235
The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). See [ollama/ollama](https://hub.docker.com/r/ollama/ollama) for more details.
Michael Yang's avatar
Michael Yang committed
236
237

GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.
238
239
240
241
242
243

## Why is networking slow in WSL2 on Windows 10?

This can impact both installing Ollama, as well as downloading models.

Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`.
244
Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. _Disable_ both of these
245
properties.
246

247
## How can I preload a model into Ollama to get faster response times?
248
249
250
251

If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the `/api/generate` and `/api/chat` API endpoints.

To preload the mistral model using the generate endpoint, use:
252

253
254
255
256
257
```shell
curl http://localhost:11434/api/generate -d '{"model": "mistral"}'
```

To use the chat completions endpoint, use:
258

259
260
261
262
```shell
curl http://localhost:11434/api/chat -d '{"model": "mistral"}'
```

263
To preload a model using the CLI, use the command:
264

265
```shell
266
ollama run llama3.2 ""
267
268
```

269
270
## How do I keep a model loaded in memory or make it unload immediately?

271
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the `ollama stop` command:
272

273
```shell
274
ollama stop llama3.2
275
276
277
```

If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to:
278
279
280
281
282

- a duration string (such as "10m" or "24h")
- a number in seconds (such as 3600)
- any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
- '0' which will unload the model immediately after generating a response
283
284

For example, to preload a model and leave it in memory use:
285

286
```shell
287
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'
288
289
290
```

To unload the model and free up memory use:
291

292
```shell
293
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'
Jeffrey Morgan's avatar
Jeffrey Morgan committed
294
```
295

296
Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to the section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
297

298
The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endpoints will override the `OLLAMA_KEEP_ALIVE` setting.
299

300
## How do I manage the maximum number of requests the Ollama server can queue?
301

302
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
303
304
305

## How does Ollama handle concurrent requests?

306
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
307

308
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
309

310
Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
311

312
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
313

314
315
- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 \* the number of GPUs or 3 for CPU inference.
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
316
- `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
317

318
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
319
320
321

## How does Ollama load models on multiple GPUs?

322
When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
323
324
325

## How can I enable Flash Attention?

326
Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server.
327
328
329
330
331
332
333

## How can I set the quantization type for the K/V cache?

The K/V context cache can be quantized to significantly reduce memory usage when Flash Attention is enabled.

To use quantized K/V cache with Ollama you can set the following environment variable:

334
- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache. Default is `f16`.
335

336
337
338
339
<Note>
  Currently this is a global option - meaning all models will run with the
  specified quantization type.
</Note>
340
341
342
343
344
345
346

The currently available K/V cache quantization types are:

- `f16` - high precision and memory usage (default).
- `q8_0` - 8-bit quantization, uses approximately 1/2 the memory of `f16` with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).
- `q4_0` - 4-bit quantization, uses approximately 1/4 the memory of `f16` with a small-medium loss in precision that may be more noticeable at higher context sizes.

347
How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.
348
349

You may need to experiment with different quantization types to find the best balance between memory usage and quality.
Daniel Hiltgen's avatar
Daniel Hiltgen committed
350

351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
## Where can I find my Ollama Public Key?

Your **Ollama Public Key** is the public part of the key pair that lets your local Ollama instance talk to [ollama.com](https://ollama.com).

You'll need it to:
* Push models to Ollama
* Pull private models from Ollama to your machine
* Run models hosted in [Ollama Cloud](https://ollama.com/cloud)

### How to Add the Key

* **Sign-in via the Settings page** in the **Mac** and **Windows App**

* **Sign‑in via CLI**

```shell
ollama signin
```
Daniel Hiltgen's avatar
Daniel Hiltgen committed
369

370
371
* **Manually copy & paste** the key on the **Ollama Keys** page:
[https://ollama.com/settings/keys](https://ollama.com/settings/keys)
Daniel Hiltgen's avatar
Daniel Hiltgen committed
372

373
### Where the Ollama Public Key lives
Daniel Hiltgen's avatar
Daniel Hiltgen committed
374

375
376
377
378
379
| OS | Path to `id_ed25519.pub` |
| :- | :- |
| macOS 	| `~/.ollama/id_ed25519.pub`			|
| Linux		| `/usr/share/ollama/.ollama/id_ed25519.pub`	|
| Windows	| `C:\Users\<username>\.ollama\id_ed25519.pub`	|
Daniel Hiltgen's avatar
Daniel Hiltgen committed
380

381
382
383
<Note>
  Replace &lt;username&gt; with your actual Windows user name.
</Note>
384
385
386
387
388
389
390
391
392

## How can I stop Ollama from starting when I login to my computer

Ollama for Windows and macOS register as a login item during installation.  You can disable this if you prefer not to have Ollama automatically start.  Ollama will respect this setting across upgrades, unless you uninstall the application.

**Windows**
- In `Task Manager` go to the `Startup apps` tab, search for `ollama` then click `Disable`

**MacOS**
393
- Open `Settings` and search for "Login Items", find the `Ollama` entry under "Allow in the Background`, then click the slider to disable.