documentation for stopping a model (#6766)

5804cf17 · Patrick Devine · GitHub · bf7ee0f4 · 5804cf17 · 5804cf17
Unverified Commit 5804cf17 authored Sep 18, 2024 by Patrick Devine Committed by GitHub Sep 18, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 105 additions and 4 deletions

README.md README.md +12 -0

docs/api.md docs/api.md +85 -0

docs/faq.md docs/faq.md +8 -4

No files found.
--- a/README.md
+++ b/README.md
@@ -197,6 +197,18 @@ ollama show llama3.1
 ollama list
 ```
+### List which models are currently loaded
+```
+ollama ps
+```
+### Stop a model which is currently running
+```
+ollama stop llama3.1
+```
 ### Start Ollama
 `ollama serve` is used when you want to start ollama without running the desktop application.

--- a/docs/api.md
+++ b/docs/api.md
@@ -407,6 +407,33 @@ A single JSON object is returned:
 }
 ```
+#### Unload a model
+If an empty prompt is provided and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
+##### Request
+```shell
+curl http://localhost:11434/api/generate -d '{
+  "model": "llama3.1",
+  "keep_alive": 0
+}'
+```
+##### Response
+A single JSON object is returned:
+```json
+{
+  "model": "llama3.1",
+  "created_at": "2024-09-12T03:54:03.516566Z",
+  "response": "",
+  "done": true,
+  "done_reason": "unload"
+}
+```
 ## Generate a chat completion
 ```shell
@@ -736,6 +763,64 @@ curl http://localhost:11434/api/chat -d '{
 }
 ```
+#### Load a model
+If the messages array is empty, the model will be loaded into memory.
+##### Request
+```
+curl http://localhost:11434/api/chat -d '{
+  "model": "llama3.1",
+  "messages": []
+}'
+```
+##### Response
+```json
+{
+  "model": "llama3.1",
+  "created_at":"2024-09-12T21:17:29.110811Z",
+  "message": {
+    "role": "assistant",
+    "content": ""
+  },
+  "done_reason": "load",
+  "done": true
+}
+```
+#### Unload a model
+If the messages array is empty and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
+##### Request
+```
+curl http://localhost:11434/api/chat -d '{
+  "model": "llama3.1",
+  "messages": [],
+  "keep_alive": 0
+}'
+```
+##### Response
+A single JSON object is returned:
+```json
+{
+  "model": "llama3.1",
+  "created_at":"2024-09-12T21:33:17.547535Z",
+  "message": {
+    "role": "assistant",
+    "content": ""
+  },
+  "done_reason": "unload",
+  "done": true
+}
+```
 ## Create a Model
 ```shell

--- a/docs/faq.md
+++ b/docs/faq.md
@@ -237,9 +237,13 @@ ollama run llama3.1 ""
 ## How do I keep a model loaded in memory or make it unload immediately?
-By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory.
+By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the `ollama stop` command:
-The `keep_alive` parameter can be set to:
+```shell
+ollama stop llama3.1
+```
+If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to:
 * a duration string (such as "10m" or "24h")
 * a number in seconds (such as 3600)
 * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
@@ -255,9 +259,9 @@ To unload the model and free up memory use:
 curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}'
 ```
-Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
+Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to the section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
-If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` API parameter with the `/api/generate` or `/api/chat` API endpoints.
+The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endpoints will override the `OLLAMA_KEEP_ALIVE` setting.
 ## How do I manage the maximum number of requests the Ollama server can queue?