faq.md 12.8 KB
Newer Older
1
2
# FAQ

Matt Williams's avatar
Matt Williams committed
3
## How can I upgrade Ollama?
4

Jeffrey Morgan's avatar
Jeffrey Morgan committed
5
6
7
8
Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version [manually](https://ollama.com/download/).

On Linux, re-run the install script:

9
```shell
Jeffrey Morgan's avatar
Jeffrey Morgan committed
10
11
curl -fsSL https://ollama.com/install.sh | sh
```
12

Matt Williams's avatar
Matt Williams committed
13
## How can I view the logs?
14

Matt Williams's avatar
Matt Williams committed
15
Review the [Troubleshooting](./troubleshooting.md) docs for more about using logs.
16

17
18
19
20
## Is my GPU compatible with Ollama?

Please refer to the [GPU docs](./gpu.md).

Jeffrey Morgan's avatar
Jeffrey Morgan committed
21
22
23
24
25
26
27
28
29
30
31
32
## How can I specify the context window size?

By default, Ollama uses a context window size of 2048 tokens.

To change this when using `ollama run`, use `/set parameter`:

```
/set parameter num_ctx 4096
```

When using the API, specify the `num_ctx` parameter:

33
```shell
Jeffrey Morgan's avatar
Jeffrey Morgan committed
34
curl http://localhost:11434/api/generate -d '{
35
  "model": "llama3",
Jeffrey Morgan's avatar
Jeffrey Morgan committed
36
37
38
39
40
41
42
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'
```

43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
## How can I tell if my model was loaded onto the GPU?

Use the `ollama ps` command to see what models are currently loaded into memory.

```shell
ollama ps
NAME      	ID          	SIZE 	PROCESSOR	UNTIL
llama3:70b	bcfb190ca3a7	42 GB	100% GPU 	4 minutes from now
```

The `Processor` column will show which memory the model was loaded in to:
* `100% GPU` means the model was loaded entirely into the GPU
* `100% CPU` means the model was loaded entirely in system memory
* `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory

58
## How do I configure Ollama server?
59

60
Ollama server can be configured with environment variables.
61

62
### Setting environment variables on Mac
63

64
If Ollama is run as a macOS application, environment variables should be set using `launchctl`:
Michael Yang's avatar
Michael Yang committed
65

66
1. For each environment variable, call `launchctl setenv`.
Michael Yang's avatar
Michael Yang committed
67

68
69
70
    ```bash
    launchctl setenv OLLAMA_HOST "0.0.0.0"
    ```
71

72
2. Restart Ollama application.
Michael Yang's avatar
Michael Yang committed
73

74
### Setting environment variables on Linux
Michael Yang's avatar
Michael Yang committed
75

76
If Ollama is run as a systemd service, environment variables should be set using `systemctl`:
Michael Yang's avatar
Michael Yang committed
77

78
1. Edit the systemd service by calling `systemctl edit ollama.service`. This will open an editor.
Jeffrey Morgan's avatar
Jeffrey Morgan committed
79

80
81
82
83
84
85
86
87
88
89
2. For each environment variable, add a line `Environment` under section `[Service]`:

    ```ini
    [Service]
    Environment="OLLAMA_HOST=0.0.0.0"
    ```

3. Save and exit.

4. Reload `systemd` and restart Ollama:
Jeffrey Morgan's avatar
Jeffrey Morgan committed
90

Matt Williams's avatar
Matt Williams committed
91
92
93
94
   ```bash
   systemctl daemon-reload
   systemctl restart ollama
   ```
Jeffrey Morgan's avatar
Jeffrey Morgan committed
95

96
97
### Setting environment variables on Windows

98
On Windows, Ollama inherits your user and system environment variables.
99

100
1. First Quit Ollama by clicking on it in the task bar.
101

102
2. Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for _environment variables_.
103

104
3. Click on _Edit environment variables for your account_.
105

106
4. Edit or create a new variable for your user account for `OLLAMA_HOST`, `OLLAMA_MODELS`, etc.
107

108
109
110
5. Click OK/Apply to save.

6. Start the Ollama application from the Windows Start menu.
111

112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
## How do I use Ollama behind a proxy?

Ollama is compatible with proxy servers if `HTTP_PROXY` or `HTTPS_PROXY` are configured. When using either variables, ensure it is set where `ollama serve` can access the values. When using `HTTPS_PROXY`, ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.

### How do I use Ollama behind a proxy in Docker?

The Ollama Docker container image can be configured to use a proxy by passing `-e HTTPS_PROXY=https://proxy.example.com` when starting the container.

Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on [macOS](https://docs.docker.com/desktop/settings/mac/#proxies), [Windows](https://docs.docker.com/desktop/settings/windows/#proxies), and [Linux](https://docs.docker.com/desktop/settings/linux/#proxies), and Docker [daemon with systemd](https://docs.docker.com/config/daemon/systemd/#httphttps-proxy).

Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.

```dockerfile
FROM ollama/ollama
COPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crt
RUN update-ca-certificates
```

Build and run this image:

```shell
docker build -t ollama-with-ca .
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
```

## Does Ollama send my prompts and answers back to ollama.com?

No. Ollama runs locally, and conversation data does not leave your machine.
140

Matt Williams's avatar
Matt Williams committed
141
## How can I expose Ollama on my network?
Jeffrey Morgan's avatar
Jeffrey Morgan committed
142

143
Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the `OLLAMA_HOST` environment variable.
Jeffrey Morgan's avatar
Jeffrey Morgan committed
144

145
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
Jeffrey Morgan's avatar
Jeffrey Morgan committed
146

jmorganca's avatar
jmorganca committed
147
148
## How can I use Ollama with a proxy server?

Jeffrey Morgan's avatar
Jeffrey Morgan committed
149
Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:
jmorganca's avatar
jmorganca committed
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165

```
server {
    listen 80;
    server_name example.com;  # Replace with your domain or IP
    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host localhost:11434;
    }
}
```

## How can I use Ollama with ngrok?

Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:

166
```shell
jmorganca's avatar
jmorganca committed
167
168
169
ngrok http 11434 --host-header="localhost:11434"
```

Jeffrey Morgan's avatar
Jeffrey Morgan committed
170
171
172
173
## How can I use Ollama with Cloudflare Tunnel?

To use Ollama with Cloudflare Tunnel, use the `--url` and `--http-host-header` flags:

174
```shell
Jeffrey Morgan's avatar
Jeffrey Morgan committed
175
176
177
cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"
```

178
## How can I allow additional web origins to access Ollama?
Jeffrey Morgan's avatar
Jeffrey Morgan committed
179

180
Ollama allows cross-origin requests from `127.0.0.1` and `0.0.0.0` by default. Additional origins can be configured with `OLLAMA_ORIGINS`.
Michael Yang's avatar
Michael Yang committed
181

182
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
Michael Yang's avatar
Michael Yang committed
183

184
185
## Where are models stored?

186
- macOS: `~/.ollama/models`
Matt Williams's avatar
Matt Williams committed
187
- Linux: `/usr/share/ollama/.ollama/models`
188
- Windows: `C:\Users\%username%\.ollama\models`
189

190
191
192
### How do I set them to a different location?

If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory.
193

194
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
195

Jeffrey Morgan's avatar
Jeffrey Morgan committed
196
## How can I use Ollama in Visual Studio Code?
197

198
There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of [extensions & plugins](https://github.com/ollama/ollama#extensions--plugins) at the bottom of the main repository readme.
Michael Yang's avatar
Michael Yang committed
199

200
## How do I use Ollama with GPU acceleration in Docker?
Michael Yang's avatar
Michael Yang committed
201

202
The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). See [ollama/ollama](https://hub.docker.com/r/ollama/ollama) for more details.
Michael Yang's avatar
Michael Yang committed
203
204

GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.
205
206
207
208
209
210
211
212

## Why is networking slow in WSL2 on Windows 10?

This can impact both installing Ollama, as well as downloading models.

Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`.
Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. *Disable* both of these
properties.
213

214
## How can I preload a model into Ollama to get faster response times?
215
216
217
218
219
220
221
222
223
224
225
226
227

If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the `/api/generate` and `/api/chat` API endpoints.

To preload the mistral model using the generate endpoint, use:
```shell
curl http://localhost:11434/api/generate -d '{"model": "mistral"}'
```

To use the chat completions endpoint, use:
```shell
curl http://localhost:11434/api/chat -d '{"model": "mistral"}'
```

228
229
To preload a model using the CLI, use the command:
```shell
230
ollama run llama3.1 ""
231
232
```

233
234
235
236
237
238
239
240
241
242
243
244
## How do I keep a model loaded in memory or make it unload immediately?

By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory.

The `keep_alive` parameter can be set to:
* a duration string (such as "10m" or "24h")
* a number in seconds (such as 3600)
* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
* '0' which will unload the model immediately after generating a response

For example, to preload a model and leave it in memory use:
```shell
245
curl http://localhost:11434/api/generate -d '{"model": "llama3", "keep_alive": -1}'
246
247
248
249
```

To unload the model and free up memory use:
```shell
250
curl http://localhost:11434/api/generate -d '{"model": "llama3", "keep_alive": 0}'
Jeffrey Morgan's avatar
Jeffrey Morgan committed
251
```
252
253
254
255

Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.

If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` API parameter with the `/api/generate` or `/api/chat` API endpoints.
256

257
## How do I manage the maximum number of requests the Ollama server can queue?
258

259
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded.  You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
260
261
262
263
264
265
266
267
268

## How does Ollama handle concurrent requests?

Ollama supports two levels of concurrent processing.  If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time.  For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.

If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded.  As prior models become idle, one or more will be unloaded to make room for the new model.  Queued requests will be processed in order.  When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.

Parallel request processing for a given model results in increasing the context size by the number of parallel requests.  For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.

269
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
270
271
272
273

- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory.  The default is 3 * the number of GPUs or 3 for CPU inference.
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time.  The default will auto-select either 4 or 1 based on available memory.
- `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
274

275
276
277
278
279
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting.  Once ROCm v6.2 is available, Windows Radeon will follow the defaults above.  You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.

## How does Ollama load models on multiple GPUs?

Installing multiple GPUs of the same brand can be a great way to increase your available VRAM to load larger models.  When you load a new model, Ollama evaluates the required VRAM for the model against what is currently available.  If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.  This typically provides the best performance as it reduces the amount of data transfering across the PCI bus during inference.  If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.