docs: add docs for docs.ollama.com (#12805)

3d99d977 · Parth Sareen · GitHub · 6d02a43a · 3d99d977 · 3d99d977
Unverified Commit 3d99d977 authored Oct 28, 2025 by Parth Sareen Committed by GitHub Oct 28, 2025
20 changed files
--- a/docs/api/authentication.mdx
+++ b/docs/api/authentication.mdx
+---
+title: Authentication
+---
+No authentication is required when accessing Ollama's API locally via `http://localhost:11434`.
+Authentication is required for the following:
+* Running cloud models via ollama.com
+* Publishing models
+* Downloading private models
+Ollama supports two authentication methods:
+* **Signing in**: sign in from your local installation, and Ollama will automatically take care of authenticating requests to ollama.com when running commands
+* **API keys**: API keys for programmatic access to ollama.com's API
+## Signing in
+To sign in to ollama.com from your local installation of Ollama, run:
+```
+ollama signin
+```
+Once signed in, Ollama will automatically authenticate commands as required:
+```
+ollama run gpt-oss:120b-cloud
+```
+Similarly, when accessing a local API endpoint that requires cloud access, Ollama will automatically authenticate the request:
+```shell
+curl http://localhost:11434/api/generate -d '{
+  "model": "gpt-oss:120b-cloud",
+  "prompt": "Why is the sky blue?"
+}'
+```
+## API keys
+For direct access to ollama.com's API served at `https://ollama.com/api`, authentication via API keys is required.
+First, create an [API key](https://ollama.com/settings/keys), then set the `OLLAMA_API_KEY` environment variable:
+```shell
+export OLLAMA_API_KEY=your_api_key
+```
+Then use the API key in the Authorization header:
+```shell
+curl https://ollama.com/api/generate \
+  -H "Authorization: Bearer $OLLAMA_API_KEY" \
+  -d '{
+    "model": "gpt-oss:120b",
+    "prompt": "Why is the sky blue?",
+    "stream": false
+  }'
+```
+API keys don't currently expire, however you can revoke them at any time in your [API keys settings](https://ollama.com/settings/keys).
--- a/docs/api/errors.mdx
+++ b/docs/api/errors.mdx
+---
+title: Errors
+---
+## Status codes
+Endpoints return appropriate HTTP status codes based on the success or failure of the request in the HTTP status line (e.g. `HTTP/1.1 200 OK` or `HTTP/1.1 400 Bad Request`). Common status codes are:
+- `200`: Success
+- `400`: Bad Request (missing parameters, invalid JSON, etc.)
+- `404`: Not Found (model doesn't exist, etc.)
+- `429`: Too Many Requests (e.g. when a rate limit is exceeded)
+- `500`: Internal Server Error
+- `502`: Bad Gateway (e.g. when a cloud model cannot be reached)
+## Error messages
+Errors are returned in the `application/json` format with the following structure, with the error message in the `error` property:
+```json
+{
+  "error": "the model failed to generate a response"
+}
+```
+## Errors that occur while streaming
+If an error occurs mid-stream, the error will be returned as an object in the `application/x-ndjson` format with an `error` property. Since the response has already started, the status code of the response will not be changed.
+```json
+{"model":"gemma3","created_at":"2025-10-26T17:21:21.196249Z","response":" Yes","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:21:21.207235Z","response":".","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:21:21.219166Z","response":"I","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:21:21.231094Z","response":"can","done":false}
+{"error":"an error was encountered while running the model"}
+```
--- a/docs/api/index.mdx
+++ b/docs/api/index.mdx
--- a/docs/api/openai-compatibility.mdx
+++ b/docs/api/openai-compatibility.mdx
-# OpenAI compatibility
+---
+title: OpenAI compatibility
+---
-> [!NOTE]
+Ollama provides compatibility with parts of the [OpenAI API](https://platform.openai.com/docs/api-reference) to help connect existing applications to Ollama.
-> OpenAI compatibility is experimental and is subject to major adjustments including breaking changes. For fully-featured access to the Ollama API, see the Ollama [Python library](https://github.com/ollama/ollama-python), [JavaScript library](https://github.com/ollama/ollama-js) and [REST API](https://github.com/ollama/ollama/blob/main/docs/api.md).
-Ollama provides experimental compatibility with parts of the [OpenAI API](https://platform.openai.com/docs/api-reference) to help connect existing applications to Ollama.
 ## Usage
@@ -100,49 +99,50 @@ except Exception as e:
 ### OpenAI JavaScript library
 ```javascript
-import OpenAI from 'openai'
+import OpenAI from "openai";
 const openai = new OpenAI({
-  baseURL: 'http://localhost:11434/v1/',
+  baseURL: "http://localhost:11434/v1/",
  // required but ignored
-  apiKey: 'ollama',
+  apiKey: "ollama",
-})
+});
 const chatCompletion = await openai.chat.completions.create({
-    messages: [{ role: 'user', content: 'Say this is a test' }],
+  messages: [{ role: "user", content: "Say this is a test" }],
-    model: 'llama3.2',
+  model: "llama3.2",
-})
+});
 const response = await openai.chat.completions.create({
-    model: "llava",
+  model: "llava",
-    messages: [
+  messages: [
+    {
+      role: "user",
+      content: [
+        { type: "text", text: "What's in this image?" },
        {
-        role: "user",
+          type: "image_url",
-        content: [
+          image_url:
-            { type: "text", text: "What's in this image?" },
+            "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC",
-            {
-            type: "image_url",
-            image_url: "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC",
-            },
-        ],
        },
-    ],
+      ],
-})
+    },
+  ],
+});
 const completion = await openai.completions.create({
-    model: "llama3.2",
+  model: "llama3.2",
-    prompt: "Say this is a test.",
+  prompt: "Say this is a test.",
-})
+});
-const listCompletion = await openai.models.list()
+const listCompletion = await openai.models.list();
-const model = await openai.models.retrieve("llama3.2")
+const model = await openai.models.retrieve("llama3.2");
 const embedding = await openai.embeddings.create({
  model: "all-minilm",
  input: ["why is the sky blue?", "why is the grass green?"],
-})
+});
 ```
 ### `curl`
@@ -306,8 +306,8 @@ curl http://localhost:11434/v1/embeddings \
  - [x] array of strings
  - [ ] array of tokens
  - [ ] array of token arrays
- [ ] `encoding format`
+- [x] `encoding format`
- [ ] `dimensions`
+- [x] `dimensions`
 - [ ] `user`
 ## Models
@@ -365,4 +365,4 @@ curl http://localhost:11434/v1/chat/completions \
            }
        ]
    }'
 ```
\ No newline at end of file
--- a/docs/api/streaming.mdx
+++ b/docs/api/streaming.mdx
+---
+title: Streaming
+---
+Certain API endpoints stream responses by default, such as `/api/generate`. These responses are provided in the newline-delimited JSON format (i.e. the `application/x-ndjson` content type). For example:
+```json
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.097767Z","response":"That","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.109172Z","response":"'","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.121485Z","response":"s","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.132802Z","response":" a","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.143931Z","response":" fantastic","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.155176Z","response":" question","done":false}
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.166576Z","response":"!","done":true, "done_reason": "stop"}
+```
+## Disabling streaming
+Streaming can be disabled by providing `{"stream": false}` in the request body for any endpoint that support streaming. This will cause responses to be returned in the `application/json` format instead:
+```json
+{"model":"gemma3","created_at":"2025-10-26T17:15:24.166576Z","response":"That's a fantastic question!","done":true}
+```
+## When to use streaming vs non-streaming
+**Streaming (default)**:
+  - Real-time response generation
+  - Lower perceived latency
+  - Better for long generations
+**Non-streaming**:
+  - Simpler to process
+  - Better for short responses, or structured outputs
+  - Easier to handle in some applications
\ No newline at end of file
--- a/docs/api/usage.mdx
+++ b/docs/api/usage.mdx
+---
+title: Usage
+---
+Ollama's API responses include metrics that can be used for measuring performance and model usage:
+* `total_duration`: How long the response took to generate
+* `load_duration`: How long the model took to load
+* `prompt_eval_count`: How many input tokens were processed
+* `prompt_eval_duration`: How long it took to evaluate the prompt
+* `eval_count`: How many output tokens were processes
+* `eval_duration`: How long it took to generate the output tokens
+All timing values are measured in nanoseconds.
+## Example response
+For endpoints that return usage metrics, the response body will include the usage fields. For example, a non-streaming call to `/api/generate` may return the following response:
+```json
+{
+  "model": "gemma3",
+  "created_at": "2025-10-17T23:14:07.414671Z",
+  "response": "Hello! How can I help you today?",
+  "done": true,
+  "done_reason": "stop",
+  "total_duration": 174560334,
+  "load_duration": 101397084,
+  "prompt_eval_count": 11,
+  "prompt_eval_duration": 13074791,
+  "eval_count": 18,
+  "eval_duration": 52479709
+}
+```
+For endpoints that return **streaming responses**, usage fields are included as part of the final chunk, where `done` is `true`.
--- a/docs/benchmark.mdx
+++ b/docs/benchmark.mdx
+---
+title: Benchmark
+---
+Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes.
+## When to use
+Run these benchmarks when:
+- Making changes to the model inference engine
+- Modifying model loading/unloading logic
+- Changing prompt processing or token generation code
+- Implementing a new model architecture
+- Testing performance across different hardware setups
+## Prerequisites
+- Ollama server running locally with `ollama serve` on `127.0.0.1:11434`
+## Usage and Examples
+<Note>
+  All commands must be run from the root directory of the Ollama project.
+</Note>
+Basic syntax:
+```bash
+go test -bench=. ./benchmark/... -m $MODEL_NAME
+```
+Required flags:
+- `-bench=.`: Run all benchmarks
+- `-m`: Model name to benchmark
+Optional flags:
+- `-count N`: Number of times to run the benchmark (useful for statistical analysis)
+- `-timeout T`: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes)
+Common usage patterns:
+Single benchmark run with a model specified:
+```bash
+go test -bench=. ./benchmark/... -m llama3.3
+```
+## Output metrics
+The benchmark reports several key metrics:
+- `gen_tok/s`: Generated tokens per second
+- `prompt_tok/s`: Prompt processing tokens per second
+- `ttft_ms`: Time to first token in milliseconds
+- `load_ms`: Model load time in milliseconds
+- `gen_tokens`: Total tokens generated
+- `prompt_tokens`: Total prompt tokens processed
+Each benchmark runs two scenarios:
+- Cold start: Model is loaded from disk for each test
+- Warm start: Model is pre-loaded in memory
+Three prompt lengths are tested for each scenario:
+- Short prompt (100 tokens)
+- Medium prompt (500 tokens)
+- Long prompt (1000 tokens)
--- a/docs/capabilities/embeddings.mdx
+++ b/docs/capabilities/embeddings.mdx
+---
+title: Embeddings
+description: Generate text embeddings for semantic search, retrieval, and RAG.
+---
+Embeddings turn text into numeric vectors you can store in a vector database, search with cosine similarity, or use in RAG pipelines. The vector length depends on the model (typically 384–1024 dimensions).
+## Recommended models
+- [embeddinggemma](https://ollama.com/library/embeddinggemma)
+- [qwen3-embedding](https://ollama.com/library/qwen3-embedding)
+- [all-minilm](https://ollama.com/library/all-minilm)
+## Generate embeddings
+Use `/api/embed` with a single string.
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/embed \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "embeddinggemma",
+        "input": "The quick brown fox jumps over the lazy dog."
+      }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    import ollama
+    single = ollama.embed(
+      model='embeddinggemma',
+      input='The quick brown fox jumps over the lazy dog.'
+    )
+    print(len(single['embeddings'][0]))  # vector length
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    const single = await ollama.embed({
+      model: 'embeddinggemma',
+      input: 'The quick brown fox jumps over the lazy dog.',
+    })
+    console.log(single.embeddings[0].length) // vector length
+    ```
+  </Tab>
+</Tabs>
+<Note>
+  The `/api/embed` endpoint returns L2‑normalized (unit‑length) vectors.
+</Note>
+## Generate a batch of embeddings
+Pass an array of strings to `input`.
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/embed \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "embeddinggemma",
+        "input": [
+          "First sentence",
+          "Second sentence",
+          "Third sentence"
+        ]
+      }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    import ollama
+    batch = ollama.embed(
+      model='embeddinggemma',
+      input=[
+        'The quick brown fox jumps over the lazy dog.',
+        'The five boxing wizards jump quickly.',
+        'Jackdaws love my big sphinx of quartz.',
+      ]
+    )
+    print(len(batch['embeddings']))  # number of vectors
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    const batch = await ollama.embed({
+      model: 'embeddinggemma',
+      input: [
+        'The quick brown fox jumps over the lazy dog.',
+        'The five boxing wizards jump quickly.',
+        'Jackdaws love my big sphinx of quartz.',
+      ],
+    })
+    console.log(batch.embeddings.length) // number of vectors
+    ```
+  </Tab>
+</Tabs>
+## Tips
+- Use cosine similarity for most semantic search use cases.
+- Use the same embedding model for both indexing and querying.
--- a/docs/capabilities/streaming.mdx
+++ b/docs/capabilities/streaming.mdx
+---
+title: Streaming
+---
+Streaming allows you to render text as it is produced by the model. 
+Streaming is enabled by default through the REST API, but disabled by default in the SDKs.
+To enable streaming in the SDKs, set the `stream` parameter to `True`.
+## Key streaming concepts
+1. Chatting: Stream partial assistant messages. Each chunk includes the `content` so you can render messages as they arrive.
+1. Thinking: Thinking-capable models emit a `thinking` field alongside regular content in each chunk. Detect this field in streaming chunks to show or hide reasoning traces before the final answer arrives.
+1. Tool calling: Watch for streamed `tool_calls` in each chunk, execute the requested tool, and append tool outputs back into the conversation.
+## Handling streamed chunks
+<Note> It is necessary to accumulate the partial fields in order to maintain the history of the conversation. This is particularly important for tool calling where the thinking, tool call from the model, and the executed tool result must be passed back to the model in the next request. </Note>
+<Tabs>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+    stream = chat(
+      model='qwen3',
+      messages=[{'role': 'user', 'content': 'What is 17 × 23?'}],
+      stream=True,
+    )
+    in_thinking = False
+    content = ''
+    thinking = ''
+    for chunk in stream:
+      if chunk.message.thinking:
+        if not in_thinking:
+          in_thinking = True
+          print('Thinking:\n', end='', flush=True)
+        print(chunk.message.thinking, end='', flush=True)
+        # accumulate the partial thinking 
+        thinking += chunk.message.thinking
+      elif chunk.message.content:
+        if in_thinking:
+          in_thinking = False
+          print('\n\nAnswer:\n', end='', flush=True)
+        print(chunk.message.content, end='', flush=True)
+        # accumulate the partial content
+        content += chunk.message.content
+      # append the accumulated fields to the messages for the next request
+      new_messages = [{ role: 'assistant', thinking: thinking, content: content }]
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    async function main() {
+      const stream = await ollama.chat({
+        model: 'qwen3',
+        messages: [{ role: 'user', content: 'What is 17 × 23?' }],
+        stream: true,
+      })
+      let inThinking = false
+      let content = ''
+      let thinking = ''
+      for await (const chunk of stream) {
+        if (chunk.message.thinking) {
+          if (!inThinking) {
+            inThinking = true
+            process.stdout.write('Thinking:\n')
+          }
+          process.stdout.write(chunk.message.thinking)
+          // accumulate the partial thinking
+          thinking += chunk.message.thinking
+        } else if (chunk.message.content) {
+          if (inThinking) {
+            inThinking = false
+            process.stdout.write('\n\nAnswer:\n')
+          }
+          process.stdout.write(chunk.message.content)
+          // accumulate the partial content
+          content += chunk.message.content
+        }
+      }
+      // append the accumulated fields to the messages for the next request
+      new_messages = [{ role: 'assistant', thinking: thinking, content: content }]
+    }
+    main().catch(console.error)
+    ```
+  </Tab>
+</Tabs>
\ No newline at end of file
--- a/docs/capabilities/structured-outputs.mdx
+++ b/docs/capabilities/structured-outputs.mdx
+---
+title: Structured Outputs
+---
+Structured outputs let you enforce a JSON schema on model responses so you can reliably extract structured data, describe images, or keep every reply consistent.
+## Generating structured JSON
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "gpt-oss",
+      "messages": [{"role": "user", "content": "Tell me about Canada in one line"}],
+      "stream": false,
+      "format": "json"
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+    response = chat(
+      model='gpt-oss',
+      messages=[{'role': 'user', 'content': 'Tell me about Canada.'}],
+      format='json'
+    )
+    print(response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    const response = await ollama.chat({
+      model: 'gpt-oss',
+      messages: [{ role: 'user', content: 'Tell me about Canada.' }],
+      format: 'json'
+    })
+    console.log(response.message.content)
+    ```
+  </Tab>
+</Tabs>
+## Generating structured JSON with a schema
+Provide a JSON schema to the `format` field.
+<Note>
+  It is ideal to also pass the JSON schema as a string in the prompt to ground the model's response.
+</Note>
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "gpt-oss",
+      "messages": [{"role": "user", "content": "Tell me about Canada."}],
+      "stream": false,
+      "format": {
+        "type": "object",
+        "properties": {
+          "name": {"type": "string"},
+          "capital": {"type": "string"},
+          "languages": {
+            "type": "array",
+            "items": {"type": "string"}
+          }
+        },
+        "required": ["name", "capital", "languages"]
+      }
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    Use Pydantic models and pass `model_json_schema()` to `format`, then validate the response:
+    ```python
+    from ollama import chat
+    from pydantic import BaseModel
+    class Country(BaseModel):
+      name: str
+      capital: str
+      languages: list[str]
+    response = chat(
+      model='gpt-oss',
+      messages=[{'role': 'user', 'content': 'Tell me about Canada.'}],
+      format=Country.model_json_schema(),
+    )
+    country = Country.model_validate_json(response.message.content)
+    print(country)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    Serialize a Zod schema with `zodToJsonSchema()` and parse the structured response:
+    ```javascript
+    import ollama from 'ollama'
+    import { z } from 'zod'
+    import { zodToJsonSchema } from 'zod-to-json-schema'
+    const Country = z.object({
+      name: z.string(),
+      capital: z.string(),
+      languages: z.array(z.string()),
+    })
+    const response = await ollama.chat({
+      model: 'gpt-oss',
+      messages: [{ role: 'user', content: 'Tell me about Canada.' }],
+      format: zodToJsonSchema(Country),
+    })
+    const country = Country.parse(JSON.parse(response.message.content))
+    console.log(country)
+    ```
+  </Tab>
+</Tabs>
+## Example: Extract structured data
+Define the objects you want returned and let the model populate the fields:
+```python
+from ollama import chat
+from pydantic import BaseModel
+class Pet(BaseModel):
+  name: str
+  animal: str
+  age: int
+  color: str | None
+  favorite_toy: str | None
+class PetList(BaseModel):
+  pets: list[Pet]
+response = chat(
+  model='gpt-oss',
+  messages=[{'role': 'user', 'content': 'I have two cats named Luna and Loki...'}],
+  format=PetList.model_json_schema(),
+)
+pets = PetList.model_validate_json(response.message.content)
+print(pets)
+```
+## Example: Vision with structured outputs
+Vision models accept the same `format` parameter, enabling deterministic descriptions of images:
+```python
+from ollama import chat
+from pydantic import BaseModel
+from typing import Literal, Optional
+class Object(BaseModel):
+  name: str
+  confidence: float
+  attributes: str
+class ImageDescription(BaseModel):
+  summary: str
+  objects: list[Object]
+  scene: str
+  colors: list[str]
+  time_of_day: Literal['Morning', 'Afternoon', 'Evening', 'Night']
+  setting: Literal['Indoor', 'Outdoor', 'Unknown']
+  text_content: Optional[str] = None
+response = chat(
+  model='gemma3',
+  messages=[{
+    'role': 'user',
+    'content': 'Describe this photo and list the objects you detect.',
+    'images': ['path/to/image.jpg'],
+  }],
+  format=ImageDescription.model_json_schema(),
+  options={'temperature': 0},
+)
+image_description = ImageDescription.model_validate_json(response.message.content)
+print(image_description)
+```
+## Tips for reliable structured outputs
+- Define schemas with Pydantic (Python) or Zod (JavaScript) so they can be reused for validation.
+- Lower the temperature (e.g., set it to `0`) for more deterministic completions.
+- Structured outputs work through the OpenAI-compatible API via `response_format` 
--- a/docs/capabilities/thinking.mdx
+++ b/docs/capabilities/thinking.mdx
+---
+title: Thinking
+---
+Thinking-capable models emit a `thinking` field that separates their reasoning trace from the final answer. 
+Use this capability to audit model steps, animate the model *thinking* in a UI, or hide the trace entirely when you only need the final response.
+## Supported models
+- [Qwen 3](https://ollama.com/library/qwen3)
+- [GPT-OSS](https://ollama.com/library/gpt-oss) *(use `think` levels: `low`, `medium`, `high` — the trace cannot be fully disabled)*
+- [DeepSeek-v3.1](https://ollama.com/library/deepseek-v3.1)
+- [DeepSeek R1](https://ollama.com/library/deepseek-r1)
+- Browse the latest additions under [thinking models](https://ollama.com/search?c=thinking)
+## Enable thinking in API calls
+Set the `think` field on chat or generate requests. Most models accept booleans (`true`/`false`).
+GPT-OSS instead expects one of `low`, `medium`, or `high` to tune the trace length. 
+The `message.thinking` (chat endpoint) or `thinking` (generate endpoint) field contains the reasoning trace while `message.content` / `response` holds the final answer.
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl http://localhost:11434/api/chat -d '{
+      "model": "qwen3",
+      "messages": [{
+        "role": "user",
+        "content": "How many letter r are in strawberry?"
+      }],
+      "think": true,
+      "stream": false
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+    response = chat(
+      model='qwen3',
+      messages=[{'role': 'user', 'content': 'How many letter r are in strawberry?'}],
+      think=True,
+      stream=False,
+    )
+    print('Thinking:\n', response.message.thinking)
+    print('Answer:\n', response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    const response = await ollama.chat({
+      model: 'deepseek-r1',
+      messages: [{ role: 'user', content: 'How many letter r are in strawberry?' }],
+      think: true,
+      stream: false,
+    })
+    console.log('Thinking:\n', response.message.thinking)
+    console.log('Answer:\n', response.message.content)
+    ```
+  </Tab>
+</Tabs>
+<Note>
+  GPT-OSS requires `think` to be set to `"low"`, `"medium"`, or `"high"`. Passing `true`/`false` is ignored for that model.
+</Note>
+## Stream the reasoning trace
+Thinking streams interleave reasoning tokens before answer tokens. Detect the first `thinking` chunk to render a "thinking" section, then switch to the final reply once `message.content` arrives.
+<Tabs>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+    stream = chat(
+      model='qwen3',
+      messages=[{'role': 'user', 'content': 'What is 17 × 23?'}],
+      think=True,
+      stream=True,
+    )
+    in_thinking = False
+    for chunk in stream:
+      if chunk.message.thinking and not in_thinking:
+        in_thinking = True
+        print('Thinking:\n', end='')
+      if chunk.message.thinking:
+        print(chunk.message.thinking, end='')
+      elif chunk.message.content:
+        if in_thinking:
+          print('\n\nAnswer:\n', end='')
+          in_thinking = False
+        print(chunk.message.content, end='')
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    async function main() {
+      const stream = await ollama.chat({
+        model: 'qwen3',
+        messages: [{ role: 'user', content: 'What is 17 × 23?' }],
+        think: true,
+        stream: true,
+      })
+      let inThinking = false
+      for await (const chunk of stream) {
+        if (chunk.message.thinking && !inThinking) {
+          inThinking = true
+          process.stdout.write('Thinking:\n')
+        }
+        if (chunk.message.thinking) {
+          process.stdout.write(chunk.message.thinking)
+        } else if (chunk.message.content) {
+          if (inThinking) {
+            process.stdout.write('\n\nAnswer:\n')
+            inThinking = false
+          }
+          process.stdout.write(chunk.message.content)
+        }
+      }
+    }
+    main()
+    ```
+  </Tab>
+</Tabs>
+## CLI quick reference
+- Enable thinking for a single run: `ollama run deepseek-r1 --think "Where should I visit in Lisbon?"`
+- Disable thinking: `ollama run deepseek-r1 --think=false "Summarize this article"`
+- Hide the trace while still using a thinking model: `ollama run deepseek-r1 --hidethinking "Is 9.9 bigger or 9.11?"`
+- Inside interactive sessions, toggle with `/set think` or `/set nothink`.
+- GPT-OSS only accepts levels: `ollama run gpt-oss --think=low "Draft a headline"` (replace `low` with `medium` or `high` as needed).
+<Note>Thinking is enabled by default in the CLI and API for supported models.</Note>
--- a/docs/capabilities/tool-calling.mdx
+++ b/docs/capabilities/tool-calling.mdx
--- a/docs/capabilities/vision.mdx
+++ b/docs/capabilities/vision.mdx
+---
+title: Vision
+---
+Vision models accept images alongside text so the model can describe, classify, and answer questions about what it sees.
+## Quick start
+```shell
+ollama run gemma3 ./image.png whats in this image?
+```
+## Usage with Ollama's API
+Provide an `images` array. SDKs accept file paths, URLs or raw bytes while the REST API expects base64-encoded image data.
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    # 1. Download a sample image
+    curl -L -o test.jpg "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg"
+    # 2. Encode the image
+    IMG=$(base64 < test.jpg | tr -d '\n')
+    # 3. Send it to Ollama
+    curl -X POST http://localhost:11434/api/chat \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "gemma3",
+        "messages": [{
+        "role": "user",
+        "content": "What is in this image?",
+        "images": ["'"$IMG"'"]
+        }],
+        "stream": false
+    }'
+    "
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+    # from pathlib import Path
+    # Pass in the path to the image
+    path = input('Please enter the path to the image: ')
+    # You can also pass in base64 encoded image data
+    # img = base64.b64encode(Path(path).read_bytes()).decode()
+    # or the raw bytes
+    # img = Path(path).read_bytes()
+    response = chat(
+      model='gemma3',
+      messages=[
+        {
+          'role': 'user',
+          'content': 'What is in this image? Be concise.',
+          'images': [path],
+        }
+      ],
+    )
+    print(response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+    const imagePath = '/absolute/path/to/image.jpg'
+    const response = await ollama.chat({
+      model: 'gemma3',
+      messages: [
+        { role: 'user', content: 'What is in this image?', images: [imagePath] }
+      ],
+      stream: false,
+    })
+    console.log(response.message.content)
+    ```
+  </Tab>
+</Tabs>
--- a/docs/capabilities/web-search.mdx
+++ b/docs/capabilities/web-search.mdx
+---
+title: Web search
+---
+Ollama's web search API can be used to augment models with the latest information to reduce hallucinations and improve accuracy.
+Web search is provided as a REST API with deeper tool integrations in the Python and JavaScript libraries. This also enables models like OpenAI’s gpt-oss models to conduct long-running research tasks.
+## Authentication
+For access to Ollama's web search API, create an [API key](https://ollama.com/settings/keys). A free Ollama account is required.
+## Web search API
+Performs a web search for a single query and returns relevant results.
+### Request
+`POST https://ollama.com/api/web_search`
+- `query` (string, required): the search query string
+- `max_results` (integer, optional): maximum results to return (default 5, max 10)
+### Response
+Returns an object containing:
+- `results` (array): array of search result objects, each containing:
+  - `title` (string): the title of the web page
+  - `url` (string): the URL of the web page
+  - `content` (string): relevant content snippet from the web page
+### Examples
+<Note>
+  Ensure OLLAMA_API_KEY is set or it must be passed in the Authorization header.
+</Note>
+#### cURL Request
+```bash
+curl https://ollama.com/api/web_search \
+  --header "Authorization: Bearer $OLLAMA_API_KEY" \
+	-d '{
+	  "query":"what is ollama?"
+	}'
+```
+**Response**
+```json
+{
+  "results": [
+    {
+      "title": "Ollama",
+      "url": "https://ollama.com/",
+      "content": "Cloud models are now available..."
+    },
+    {
+      "title": "What is Ollama? Introduction to the AI model management tool",
+      "url": "https://www.hostinger.com/tutorials/what-is-ollama",
+      "content": "Ariffud M. 6min Read..."
+    },
+    {
+      "title": "Ollama Explained: Transforming AI Accessibility and Language ...",
+      "url": "https://www.geeksforgeeks.org/artificial-intelligence/ollama-explained-transforming-ai-accessibility-and-language-processing/",
+      "content": "Data Science Data Science Projects Data Analysis..."
+    }
+  ]
+}
+```
+#### Python library
+```python
+import ollama
+response = ollama.web_search("What is Ollama?")
+print(response)
+```
+**Example output**
+```python
+results = [
+    {
+        "title": "Ollama",
+        "url": "https://ollama.com/",
+        "content": "Cloud models are now available in Ollama..."
+    },
+    {
+        "title": "What is Ollama? Features, Pricing, and Use Cases - Walturn",
+        "url": "https://www.walturn.com/insights/what-is-ollama-features-pricing-and-use-cases",
+        "content": "Our services..."
+    },
+    {
+        "title": "Complete Ollama Guide: Installation, Usage & Code Examples",
+        "url": "https://collabnix.com/complete-ollama-guide-installation-usage-code-examples",
+        "content": "Join our Discord Server..."
+    }
+]
+```
+More Ollama [Python example](https://github.com/ollama/ollama-python/blob/main/examples/web-search.py)
+#### JavaScript Library
+```tsx
+import { Ollama } from "ollama";
+const client = new Ollama();
+const results = await client.webSearch({ query: "what is ollama?" });
+console.log(JSON.stringify(results, null, 2));
+```
+**Example output**
+```json
+{
+  "results": [
+    {
+      "title": "Ollama",
+      "url": "https://ollama.com/",
+      "content": "Cloud models are now available..."
+    },
+    {
+      "title": "What is Ollama? Introduction to the AI model management tool",
+      "url": "https://www.hostinger.com/tutorials/what-is-ollama",
+      "content": "Ollama is an open-source tool..."
+    },
+    {
+      "title": "Ollama Explained: Transforming AI Accessibility and Language Processing",
+      "url": "https://www.geeksforgeeks.org/artificial-intelligence/ollama-explained-transforming-ai-accessibility-and-language-processing/",
+      "content": "Ollama is a groundbreaking..."
+    }
+  ]
+}
+```
+More Ollama [JavaScript example](https://github.com/ollama/ollama-js/blob/main/examples/websearch/websearch-tools.ts)
+## Web fetch API
+Fetches a single web page by URL and returns its content.
+### Request
+`POST https://ollama.com/api/web_fetch`
+- `url` (string, required): the URL to fetch
+### Response
+Returns an object containing:
+- `title` (string): the title of the web page
+- `content` (string): the main content of the web page
+- `links` (array): array of links found on the page
+### Examples
+#### cURL Request
+```python
+curl --request POST \
+  --url https://ollama.com/api/web_fetch \
+  --header "Authorization: Bearer $OLLAMA_API_KEY" \
+  --header 'Content-Type: application/json' \
+  --data '{
+      "url": "ollama.com"
+  }'
+```
+**Response**
+```json
+{
+  "title": "Ollama",
+  "content": "[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama...",
+  "links": [
+    "http://ollama.com/",
+    "http://ollama.com/models",
+    "https://github.com/ollama/ollama"
+  ]
+```
+#### Python SDK
+```python
+from ollama import web_fetch
+result = web_fetch('https://ollama.com')
+print(result)
+```
+**Result**
+```python
+WebFetchResponse(
+    title='Ollama',
+    content='[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama\n\n**Chat & build
+with open models**\n\n[Download](https://ollama.com/download) [Explore
+models](https://ollama.com/models)\n\nAvailable for macOS, Windows, and Linux',
+    links=['https://ollama.com/', 'https://ollama.com/models', 'https://github.com/ollama/ollama']
+)
+```
+#### JavaScript SDK
+```tsx
+import { Ollama } from "ollama";
+const client = new Ollama();
+const fetchResult = await client.webFetch({ url: "https://ollama.com" });
+console.log(JSON.stringify(fetchResult, null, 2));
+```
+**Result**
+```json
+{
+  "title": "Ollama",
+  "content": "[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama...",
+  "links": [
+    "https://ollama.com/",
+    "https://ollama.com/models",
+    "https://github.com/ollama/ollama"
+  ]
+}
+```
+## Building a search agent
+Use Ollama’s web search API as a tool to build a mini search agent.
+This example uses Alibaba’s Qwen 3 model with 4B parameters.
+```bash
+ollama pull qwen3:4b
+```
+```python
+from ollama import chat, web_fetch, web_search
+available_tools = {'web_search': web_search, 'web_fetch': web_fetch}
+messages = [{'role': 'user', 'content': "what is ollama's new engine"}]
+while True:
+  response = chat(
+    model='qwen3:4b',
+    messages=messages,
+    tools=[web_search, web_fetch],
+    think=True
+    )
+  if response.message.thinking:
+    print('Thinking: ', response.message.thinking)
+  if response.message.content:
+    print('Content: ', response.message.content)
+  messages.append(response.message)
+  if response.message.tool_calls:
+    print('Tool calls: ', response.message.tool_calls)
+    for tool_call in response.message.tool_calls:
+      function_to_call = available_tools.get(tool_call.function.name)
+      if function_to_call:
+        args = tool_call.function.arguments
+        result = function_to_call(**args)
+        print('Result: ', str(result)[:200]+'...')
+        # Result is truncated for limited context lengths
+        messages.append({'role': 'tool', 'content': str(result)[:2000 * 4], 'tool_name': tool_call.function.name})
+      else:
+        messages.append({'role': 'tool', 'content': f'Tool {tool_call.function.name} not found', 'tool_name': tool_call.function.name})
+  else:
+    break
+```
+**Result**
+```
+Thinking:  Okay, the user is asking about Ollama's new engine. I need to figure out what they're referring to. Ollama is a company that develops large language models, so maybe they've released a new model or an updated version of their existing engine....
+Tool calls:  [ToolCall(function=Function(name='web_search', arguments={'max_results': 3, 'query': 'Ollama new engine'}))]
+Result:  results=[WebSearchResult(content='# New model scheduling\n\n## September 23, 2025\n\nOllama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine
+Thinking:  Okay, the user asked about Ollama's new engine. Let me look at the search results.
+First result is from September 23, 2025, talking about new model scheduling. It mentions improved memory management, reduced crashes, better GPU utilization, and multi-GPU performance. Examples show speed improvements and accurate memory reporting. Supported models include gemma3, llama4, qwen3, etc...
+Content:  Ollama has introduced two key updates to its engine, both released in 2025:
+1. **Enhanced Model Scheduling (September 23, 2025)**
+   - **Precision Memory Management**: Exact memory allocation reduces out-of-memory crashes and optimizes GPU utilization.
+   - **Performance Gains**: Examples show significant speed improvements (e.g., 85.54 tokens/s vs 52.02 tokens/s) and full GPU layer utilization.
+   - **Multi-GPU Support**: Improved efficiency across multiple GPUs, with accurate memory reporting via tools like `nvidia-smi`.
+   - **Supported Models**: Includes `gemma3`, `llama4`, `qwen3`, `mistral-small3.2`, and more.
+2. **Multimodal Engine (May 15, 2025)**
+   - **Vision Support**: First-class support for vision models, including `llama4:scout` (109B parameters), `gemma3`, `qwen2.5vl`, and `mistral-small3.1`.
+   - **Multimodal Tasks**: Examples include identifying animals in multiple images, answering location-based questions from videos, and document scanning.
+These updates highlight Ollama's focus on efficiency, performance, and expanded capabilities for both text and vision tasks.
+```
+### Context length and agents
+Web search results can return thousands of tokens. It is recommended to increase the context length of the model to at least ~32000 tokens. Search agents work best with full context length. [Ollama's cloud models](https://docs.ollama.com/cloud) run at the full context length.
+## MCP Server
+You can enable web search in any MCP client through the [Python MCP server](https://github.com/ollama/ollama-python/blob/main/examples/web-search-mcp.py).
+### Cline
+Ollama's web search can be integrated with Cline easily using the MCP server configuration.
+`Manage MCP Servers` > `Configure MCP Servers` > Add the following configuration:
+```json
+{
+  "mcpServers": {
+    "web_search_and_fetch": {
+      "type": "stdio",
+      "command": "uv",
+      "args": ["run", "path/to/web-search-mcp.py"],
+      "env": { "OLLAMA_API_KEY": "your_api_key_here" }
+    }
+  }
+}
+```
+![Cline MCP Configuration](/images/cline-mcp.png)
+### Codex
+Ollama works well with OpenAI's Codex tool.
+Add the following configuration to `~/.codex/config.toml`
+```python
+[mcp_servers.web_search]
+command = "uv"
+args = ["run", "path/to/web-search-mcp.py"]
+env = { "OLLAMA_API_KEY" = "your_api_key_here" }
+```
+![Codex MCP Configuration](/images/codex-mcp.png)
+### Goose
+Ollama can integrate with Goose via its MCP feature.
+![Goose MCP Configuration 1](/images/goose-mcp-1.png)
+![Goose MCP Configuration 2](/images/goose-mcp-2.png)
+### Other integrations
+Ollama can be integrated into most of the tools available either through direct integration of Ollama's API, Python / JavaScript libraries, OpenAI compatible API, and MCP server integration.
--- a/docs/cli.mdx
+++ b/docs/cli.mdx
+---
+title: CLI Reference
+---
+### Run a model
+```
+ollama run gemma3
+```
+#### Multiline input
+For multiline input, you can wrap text with `"""`:
+```
+>>> """Hello,
+... world!
+... """
+I'm a basic program that prints the famous "Hello, world!" message to the console.
+```
+#### Multimodal models
+```
+ollama run gemma3 "What's in this image? /Users/jmorgan/Desktop/smile.png"
+```
+### Download a model
+```
+ollama pull gemma3
+```
+### Remove a model
+```
+ollama rm gemma3
+```
+### List models
+```
+ollama ls
+```
+### Sign in to Ollama
+```
+ollama signin
+```
+### Sign out of Ollama
+```
+ollama signout
+```
+### Create a customized model
+First, create a `Modelfile`
+```
+FROM gemma3
+SYSTEM """You are a happy cat."""
+```
+Then run `ollama create`:
+```
+ollama create -f Modelfile
+```
+### List running models
+```
+ollama ps
+```
+### Stop a running model
+```
+ollama stop gemma3
+```
+### Start Ollama
+```
+ollama serve
+```
+To view a list of environment variables that can be set run `ollama serve --help`
--- a/docs/cloud.mdx
+++ b/docs/cloud.mdx
-# Cloud
+---
+title: Cloud
+sidebarTitle: Cloud
+---
-| Ollama's cloud is currently in preview. For full documentation, see [Ollama's documentation](https://docs.ollama.com/cloud).
+<Info>Ollama's cloud is currently in preview.</Info>
 ## Cloud Models
-[Cloud models](https://ollama.com/cloud) are a new kind of model in Ollama that can run without a powerful GPU. Instead, cloud models are automatically offloaded to Ollama's cloud while offering the same capabilities as local models, making it possible to keep using your local tools while running larger models that wouldn’t fit on a personal computer.
+Ollama's cloud models are a new kind of model in Ollama that can run without a powerful GPU. Instead, cloud models are automatically offloaded to Ollama's cloud service while offering the same capabilities as local models, making it possible to keep using your local tools while running larger models that wouldn't fit on a personal computer.
 Ollama currently supports the following cloud models, with more coming soon:
+- `deepseek-v3.1:671b-cloud`
 - `gpt-oss:20b-cloud`
 - `gpt-oss:120b-cloud`
- `deepseek-v3.1:671b-cloud`
+- `kimi-k2:1t-cloud`
 - `qwen3-coder:480b-cloud`
+- `glm-4.6:cloud`
+### Running Cloud models
+Ollama's cloud models require an account on [ollama.com](https://ollama.com). To sign in or create an account, run:
+```
+ollama signin
+```
-### Get started
+<Tabs>
+  <Tab title="CLI">
 To run a cloud model, open the terminal and run:
@@ -21,20 +35,201 @@ To run a cloud model, open the terminal and run:
 ollama run gpt-oss:120b-cloud
 ```
-To run cloud models with integrations that work with Ollama, first download the cloud model:
+  </Tab>
+  <Tab title="Python">
+First, pull a cloud model so it can be accessed:
 ```
-ollama pull qwen3-coder:480b-cloud
+ollama pull gpt-oss:120b-cloud
 ```
-Then sign in to Ollama:
+Next, install [Ollama's Python library](https://github.com/ollama/ollama-python):
 ```
-ollama signin
+pip install ollama
+```
+Next, create and run a simple Python script:
+```python
+from ollama import Client
+client = Client()
+messages = [
+  {
+    'role': 'user',
+    'content': 'Why is the sky blue?',
+  },
+]
+for part in client.chat('gpt-oss:120b-cloud', messages=messages, stream=True):
+  print(part['message']['content'], end='', flush=True)
+```
+  </Tab>
+  <Tab title="JavaScript">
+First, pull a cloud model so it can be accessed:
+```
+ollama pull gpt-oss:120b-cloud
+```
+Next, install [Ollama's JavaScript library](https://github.com/ollama/ollama-js):
+```
+npm i ollama
+```
+Then use the library to run a cloud model:
+```typescript
+import { Ollama } from "ollama";
+const ollama = new Ollama();
+const response = await ollama.chat({
+  model: "gpt-oss:120b-cloud",
+  messages: [{ role: "user", content: "Explain quantum computing" }],
+  stream: true,
+});
+for await (const part of response) {
+  process.stdout.write(part.message.content);
+}
+```
+  </Tab>
+  <Tab title="cURL">
+First, pull a cloud model so it can be accessed:
+```
+ollama pull gpt-oss:120b-cloud
+```
+Run the following cURL command to run the command via Ollama's API:
+```
+curl http://localhost:11434/api/chat -d '{
+  "model": "gpt-oss:120b-cloud",
+  "messages": [{
+    "role": "user",
+    "content": "Why is the sky blue?"
+  }],
+  "stream": false
+}'
 ```
-Finally, access the model using the model name `qwen3-coder:480b-cloud` via Ollama's local API or tooling.
+  </Tab>
+</Tabs>
 ## Cloud API access
-Cloud models can also be accessed directly on ollama.com's API. For more information, see the [docs](https://docs.ollama.com/cloud).
+Cloud models can also be accessed directly on ollama.com's API. In this mode, ollama.com acts as a remote Ollama host.
+### Authentication
+For direct access to ollama.com's API, first create an [API key](https://ollama.com/settings/keys).
+Then, set the `OLLAMA_API_KEY` environment variable to your API key.
+```
+export OLLAMA_API_KEY=your_api_key
+```
+### Listing models
+For models available directly via Ollama's API, models can be listed via:
+```
+curl https://ollama.com/api/tags
+```
+### Generating a response
+<Tabs>
+  <Tab title="Python">
+First, install [Ollama's Python library](https://github.com/ollama/ollama-python)
+```
+pip install ollama
+```
+Then make a request
+```python
+import os
+from ollama import Client
+client = Client(
+    host="https://ollama.com",
+    headers={'Authorization': 'Bearer ' + os.environ.get('OLLAMA_API_KEY')}
+)
+messages = [
+  {
+    'role': 'user',
+    'content': 'Why is the sky blue?',
+  },
+]
+for part in client.chat('gpt-oss:120b', messages=messages, stream=True):
+  print(part['message']['content'], end='', flush=True)
+```
+  </Tab>
+  <Tab title="JavaScript">
+First, install [Ollama's JavaScript library](https://github.com/ollama/ollama-js):
+```
+npm i ollama
+```
+Next, make a request to the model:
+```typescript
+import { Ollama } from "ollama";
+const ollama = new Ollama({
+  host: "https://ollama.com",
+  headers: {
+    Authorization: "Bearer " + process.env.OLLAMA_API_KEY,
+  },
+});
+const response = await ollama.chat({
+  model: "gpt-oss:120b",
+  messages: [{ role: "user", content: "Explain quantum computing" }],
+  stream: true,
+});
+for await (const part of response) {
+  process.stdout.write(part.message.content);
+}
+```
+  </Tab>
+  <Tab title="cURL">
+Generate a response via Ollama's chat API:
+```
+curl https://ollama.com/api/chat \
+  -H "Authorization: Bearer $OLLAMA_API_KEY" \
+  -d '{
+    "model": "gpt-oss:120b",
+    "messages": [{
+      "role": "user",
+      "content": "Why is the sky blue?"
+    }],
+    "stream": false
+  }'
+```
+  </Tab>
+</Tabs>
--- a/docs/context-length.mdx
+++ b/docs/context-length.mdx
+---
+title: Context length
+---
+Context length is the maximum number of tokens that the model has access to in memory.  
+<Note>
+  The default context length in Ollama is 4096 tokens.
+</Note>
+Tasks which require large context like web search, agents, and coding tools should be set to at least 32000 tokens.
+## Setting context length
+Setting a larger context length will increase the amount of memory required to run a model. Ensure you have enough VRAM available to increase the context length.
+Cloud models are set to their maximum context length by default.
+### App
+Change the slider in the Ollama app under settings to your desired context length.
+![Context length in Ollama app](./images/ollama-settings.png)
+### CLI
+If editing the context length for Ollama is not possible, the context length can also be updated when serving Ollama.  
+```
+OLLAMA_CONTEXT_LENGTH=32000 ollama serve
+```
+### Check allocated context length and model offloading
+For best performance, use the maximum context length for a model, and avoid offloading the model to CPU. Verify the split under `PROCESSOR` using `ollama ps`.
+```
+ollama ps
+```
+```
+NAME             ID              SIZE      PROCESSOR    CONTEXT    UNTIL
+gemma3:latest    a2af6cc3eb7f    6.6 GB    100% GPU     65536      2 minutes from now
+```
--- a/docs/docker.mdx
+++ b/docs/docker.mdx
-# Ollama Docker image
+## CPU only
-### CPU only
 ```shell
 docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
 ```
-### Nvidia GPU
+## Nvidia GPU
 Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation).
-#### Install with Apt
+### Install with Apt
 1.  Configure the repository
    ```shell
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
        | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
-    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
+    curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
        | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
        | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    sudo apt-get update
@@ -27,37 +27,40 @@ Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-
    sudo apt-get install -y nvidia-container-toolkit
    ```
-#### Install with Yum or Dnf
+### Install with Yum or Dnf
 1.  Configure the repository
    ```shell
-    curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
+    curl -fsSL https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
        | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    ```
-2. Install the NVIDIA Container Toolkit packages
+2.  Install the NVIDIA Container Toolkit packages
    ```shell
    sudo yum install -y nvidia-container-toolkit
    ```
-#### Configure Docker to use Nvidia driver
+### Configure Docker to use Nvidia driver
 ```shell
 sudo nvidia-ctk runtime configure --runtime=docker
 sudo systemctl restart docker
 ```
-#### Start the container
+### Start the container
 ```shell
 docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
 ```
-> [!NOTE]  
+<Note>
-> If you're running on an NVIDIA JetPack system, Ollama can't automatically discover the correct JetPack version. Pass the environment variable JETSON_JETPACK=5 or JETSON_JETPACK=6 to the container to select version 5 or 6.
+  If you're running on an NVIDIA JetPack system, Ollama can't automatically discover the correct JetPack version.
+  Pass the environment variable `JETSON_JETPACK=5` or `JETSON_JETPACK=6` to the container to select version 5 or 6.
+</Note>
-### AMD GPU
+## AMD GPU
 To run Ollama using Docker with AMD GPUs, use the `rocm` tag and the following command:
@@ -65,7 +68,7 @@ To run Ollama using Docker with AMD GPUs, use the `rocm` tag and the following c
 docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm
 ```
-### Run model locally
+## Run model locally
 Now you can run a model:
@@ -73,6 +76,6 @@ Now you can run a model:
 docker exec -it ollama ollama run llama3.2
 ```
-### Try different models
+## Try different models
 More models can be found on the [Ollama library](https://ollama.com/library).
--- a/docs/docs.json
+++ b/docs/docs.json
+{
+  "$schema": "https://mintlify.com/docs.json",
+  "name": "Ollama",
+  "colors": {
+    "primary": "#000",
+    "light": "#b5b5b5",
+    "dark": "#000"
+  },
+  "favicon": "/images/favicon.png",
+  "logo": {
+    "light": "/images/logo.png",
+    "dark": "/images/logo-dark.png",
+    "href": "https://ollama.com"
+  },
+  "theme": "maple",
+  "background": {
+    "color": {
+      "light": "#ffffff",
+      "dark": "#000000"
+    }
+  },
+  "fonts": {
+    "family": "system-ui",
+    "heading": {
+      "family": "system-ui"
+    },
+    "body": {
+      "family": "system-ui"
+    }
+  },
+  "styling": {
+    "codeblocks": "system"
+  },
+  "contextual": {
+    "options": ["copy"]
+  },
+  "navbar": {
+    "links": [
+      {
+        "label": "Sign in",
+        "href": "https://ollama.com/signin"
+      }
+    ],
+    "primary": {
+      "type": "button",
+      "label": "Download",
+      "href": "https://ollama.com/download"
+    }
+  },
+  "api": {
+    "playground": {
+      "display": "simple"
+    },
+    "examples": {
+      "languages": ["curl"]
+    }
+  },
+  "redirects": [
+    {
+      "source": "/openai",
+      "destination": "/api/openai"
+    }
+  ],
+  "navigation": {
+    "tabs": [
+      {
+        "tab": "Documentation",
+        "groups": [
+          {
+            "group": "Get started",
+            "pages": [
+              "index",
+              "quickstart",
+              "/cloud"
+            ]
+          },
+          {
+            "group": "Capabilities",
+            "pages": [
+              "/capabilities/streaming",
+              "/capabilities/thinking",
+              "/capabilities/structured-outputs",
+              "/capabilities/vision",
+              "/capabilities/embeddings",
+              "/capabilities/tool-calling",
+              "/capabilities/web-search"
+            ]
+          },
+          {
+            "group": "Integrations",
+            "pages": [
+              "/integrations/vscode",
+              "/integrations/jetbrains",
+              "/integrations/codex",
+              "/integrations/cline",
+              "/integrations/droid",
+              "/integrations/goose",
+              "/integrations/zed",
+              "/integrations/roo-code",
+              "/integrations/n8n",
+              "/integrations/xcode"
+            ]
+          },
+          {
+            "group": "More information",
+            "pages": [
+              "/cli",
+              "/modelfile",
+              "/context-length",
+              "/linux",
+              "/docker",
+              "/faq",
+              "/gpu",
+              "/troubleshooting"
+            ]
+          }
+        ]
+      },
+      {
+        "tab": "API Reference",
+        "openapi": "/openapi.yaml",
+        "groups": [
+          {
+            "group": "API Reference",
+            "pages": [
+              "/api/index",
+              "/api/authentication",
+              "/api/streaming",
+              "/api/usage",
+              "/api/errors",
+              "/api/openai-compatibility"
+            ]
+          },
+          {
+            "group": "Endpoints",
+            "pages": [
+              "POST /api/generate",
+              "POST /api/chat",
+              "POST /api/embed",
+              "GET /api/tags",
+              "GET /api/ps",
+              "POST /api/show",
+              "POST /api/create",
+              "POST /api/copy",
+              "POST /api/pull",
+              "POST /api/push",
+              "DELETE /api/delete",
+              "GET /api/version"
+            ]
+          }
+        ]
+      }
+    ]
+  }
+}
--- a/docs/faq.mdx
+++ b/docs/faq.mdx
-# FAQ
+---
+title: FAQ
+---
 ## How can I upgrade Ollama?
@@ -20,9 +22,9 @@ Please refer to the [GPU docs](./gpu.md).
 ## How can I specify the context window size?
-By default, Ollama uses a context window size of 4096 tokens for most models. The `gpt-oss` model has a default context window size of 8192 tokens.
+By default, Ollama uses a context window size of 2048 tokens.
-This can be overridden in Settings in the Windows and macOS App, or with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
+This can be overridden with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
 ```shell
 OLLAMA_CONTEXT_LENGTH=8192 ollama serve
@@ -46,8 +48,6 @@ curl http://localhost:11434/api/generate -d '{
 }'
 ```
-Setting the context length higher may cause the model to not be able to fit onto the GPU which make the model run more slowly.
 ## How can I tell if my model was loaded onto the GPU?
 Use the `ollama ps` command to see what models are currently loaded into memory.
@@ -56,17 +56,16 @@ Use the `ollama ps` command to see what models are currently loaded into memory.
 ollama ps
 ```
-> **Output**:
+<Info>
->
+  **Output**: ``` NAME ID SIZE PROCESSOR UNTIL llama3:70b bcfb190ca3a7 42 GB
-> ```
+  100% GPU 4 minutes from now ```
-> NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
+</Info>
-> gpt-oss:20b    05afbac4bad6    16 GB    100% GPU     8192       4 minutes from now
-> ```
 The `Processor` column will show which memory the model was loaded in to:
-* `100% GPU` means the model was loaded entirely into the GPU
-* `100% CPU` means the model was loaded entirely in system memory
+- `100% GPU` means the model was loaded entirely into the GPU
-* `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory
+- `100% CPU` means the model was loaded entirely in system memory
+- `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory
 ## How do I configure Ollama server?
@@ -78,9 +77,9 @@ If Ollama is run as a macOS application, environment variables should be set usi
 1. For each environment variable, call `launchctl setenv`.
-    ```bash
+   ```bash
-    launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
+   launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
-    ```
+   ```
 2. Restart Ollama application.
@@ -92,10 +91,10 @@ If Ollama is run as a systemd service, environment variables should be set using
 2. For each environment variable, add a line `Environment` under section `[Service]`:
-    ```ini
+   ```ini
-    [Service]
+   [Service]
-    Environment="OLLAMA_HOST=0.0.0.0:11434"
+   Environment="OLLAMA_HOST=0.0.0.0:11434"
-    ```
+   ```
 3. Save and exit.
@@ -126,8 +125,10 @@ On Windows, Ollama inherits your user and system environment variables.
 Ollama pulls models from the Internet and may require a proxy server to access the models. Use `HTTPS_PROXY` to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.
-> [!NOTE]
+<Note>
-> Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server.
+  Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only
+  HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server.
+</Note>
 ### How do I use Ollama behind a proxy in Docker?
@@ -150,11 +151,9 @@ docker build -t ollama-with-ca .
 docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
 ```
-## Does Ollama send my prompts and responses back to ollama.com?
+## Does Ollama send my prompts and answers back to ollama.com?
-If you're running a model locally, your prompts and responses will always stay on your machine. Ollama Turbo in the App allows you to run your queries on Ollama's servers if you don't have a powerful enough GPU. Web search lets a model query the web, giving you more accurate and up-to-date information. Both Turbo and web search require sending your prompts and responses to Ollama.com. This data is neither logged nor stored.
+No. Ollama runs locally, and conversation data does not leave your machine.
-If you don't want to see the Turbo and web search options in the app, you can disable them in Settings by turning on Airplane mode. In Airplane mode, all models will run locally, and your prompts and responses will stay on your machine.
 ## How can I expose Ollama on my network?
@@ -216,7 +215,9 @@ Refer to the section [above](#how-do-i-configure-ollama-server) for how to set e
 If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory.
-> Note: on Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama <directory>`.
+<Note>
+  On Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama <directory>`.
+</Note>
 Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
@@ -235,7 +236,7 @@ GPU acceleration is not available for Docker Desktop in macOS due to the lack of
 This can impact both installing Ollama, as well as downloading models.
 Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`.
-Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. *Disable* both of these
+Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. _Disable_ both of these
 properties.
 ## How can I preload a model into Ollama to get faster response times?
@@ -269,10 +270,11 @@ ollama stop llama3.2
 ```
 If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to:
-* a duration string (such as "10m" or "24h")
-* a number in seconds (such as 3600)
+- a duration string (such as "10m" or "24h")
-* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
+- a number in seconds (such as 3600)
-* '0' which will unload the model immediately after generating a response
+- any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
+- '0' which will unload the model immediately after generating a response
 For example, to preload a model and leave it in memory use:
@@ -292,31 +294,31 @@ The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endp
 ## How do I manage the maximum number of requests the Ollama server can queue?
-If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded.  You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
+If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
 ## How does Ollama handle concurrent requests?
-Ollama supports two levels of concurrent processing.  If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time.  For a given model, if there is sufficient available memory when the model is loaded, it can be configured to allow parallel request processing.
+Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
-If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded.  As prior models become idle, one or more will be unloaded to make room for the new model.  Queued requests will be processed in order.  When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
+If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
-Parallel request processing for a given model results in increasing the context size by the number of parallel requests.  For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
+Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
 The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory.  The default is 3 * the number of GPUs or 3 for CPU inference.
+- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 \* the number of GPUs or 3 for CPU inference.
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time.  The default is 1, and will handle 1 request per model at a time.
+- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
 - `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
-Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting.  Once ROCm v6.2 is available, Windows Radeon will follow the defaults above.  You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
+Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
 ## How does Ollama load models on multiple GPUs?
-When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available.  If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.  This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference.  If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
+When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
 ## How can I enable Flash Attention?
-Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows.  To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server.
+Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server.
 ## How can I set the quantization type for the K/V cache?
@@ -324,9 +326,12 @@ The K/V context cache can be quantized to significantly reduce memory usage when
 To use quantized K/V cache with Ollama you can set the following environment variable:
- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache.  Default is `f16`.
+- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache. Default is `f16`.
-> Note: Currently this is a global option - meaning all models will run with the specified quantization type.
+<Note>
+  Currently this is a global option - meaning all models will run with the
+  specified quantization type.
+</Note>
 The currently available K/V cache quantization types are:
@@ -334,19 +339,40 @@ The currently available K/V cache quantization types are:
 - `q8_0` - 8-bit quantization, uses approximately 1/2 the memory of `f16` with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).
 - `q4_0` - 4-bit quantization, uses approximately 1/4 the memory of `f16` with a small-medium loss in precision that may be more noticeable at higher context sizes.
-How much the cache quantization impacts the model's response quality will depend on the model and the task.  Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.
+How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.
 You may need to experiment with different quantization types to find the best balance between memory usage and quality.
-## How can I stop Ollama from starting when I login to my computer
+## Where can I find my Ollama Public Key?
+Your **Ollama Public Key** is the public part of the key pair that lets your local Ollama instance talk to [ollama.com](https://ollama.com).
+You'll need it to:
+* Push models to Ollama
+* Pull private models from Ollama to your machine
+* Run models hosted in [Ollama Cloud](https://ollama.com/cloud)
+### How to Add the Key
+* **Sign-in via the Settings page** in the **Mac** and **Windows App**
+* **Sign‑in via CLI**
+```shell
+ollama signin
+```
-Ollama for Windows and macOS register as a login item during installation.  You can disable this if you prefer not to have Ollama automatically start.  Ollama will respect this setting across upgrades, unless you uninstall the application.
+* **Manually copy & paste** the key on the **Ollama Keys** page:
+[https://ollama.com/settings/keys](https://ollama.com/settings/keys)
-**Windows**
+### Where the Ollama Public Key lives
- Remove `%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\Ollama.lnk`
-**MacOS Monterey (v12)**
+| OS | Path to `id_ed25519.pub` |
- Open `Settings` -> `Users & Groups` -> `Login Items` and find the `Ollama` entry, then click the `-` (minus) to remove
+| :- | :- |
+| macOS 	| `~/.ollama/id_ed25519.pub`			|
+| Linux		| `/usr/share/ollama/.ollama/id_ed25519.pub`	|
+| Windows	| `C:\Users\<username>\.ollama\id_ed25519.pub`	|
-**MacOS Ventura (v13) and later**
+<Note>
- Open `Settings` and search for "Login Items", find the `Ollama` entry under "Allow in the Background`, then click the slider to disable.
+  Replace &lt;username&gt; with your actual Windows user name.
+</Note>