- 01 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.
-
- 22 Sep, 2025 2 commits
- 15 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 11 Sep, 2025 1 commit
-
-
Michael Yang authored
* feat: add field to truncate embeddings * add openai embeddings for dimensions
-
- 10 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
* Add support for upcoming NVIDIA Jetsons The latest Jetsons with JetPack 7 are moving to an SBSA compatible model and will not require building a JetPack specific variant. * cuda: bring back dual versions This adds back dual CUDA versions for our releases, with v11 and v13 to cover a broad set of GPUs and driver versions. * win: break up native builds in build_windows.ps1 * v11 build working on windows and linux * switch to cuda v12.8 not JIT * Set CUDA compression to size * enhance manual install linux docs
-
- 08 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
This debug setting can help troubleshoot obscure initialization failures.
-
- 15 Aug, 2025 1 commit
-
-
Thomas Pelster authored
-
- 14 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
Some users expect the rocm bundles to be self-sufficient, but are designed to be additive.
-
- 06 Aug, 2025 3 commits
-
-
Patrick Devine authored
-
Gao feng authored
update api.md to make it consist with code. https://github.com/ollama/ollama/blob/main/server/download.go#L447
-
Parth Sareen authored
-
- 05 Aug, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 28 Jul, 2025 1 commit
-
-
Yoshi authored
-
- 22 Jul, 2025 1 commit
-
-
ycomiti authored
-
- 17 Jul, 2025 1 commit
-
-
frob authored
-
- 16 Jul, 2025 1 commit
-
-
Marcelo Fornet authored
-
- 11 Jul, 2025 1 commit
-
- 08 Jul, 2025 2 commits
-
-
Daniel Hiltgen authored
also removes stale model dir instructions for windows
-
Daniel Hiltgen authored
The current scheduler algorithm of picking the paralellism based on available VRAM complicates the upcoming dynamic layer memory allocation algorithm. This changes the default to 1, with the intent going forward that parallelism is explicit and will no longer be dynamically determined. Removal of the dynamic logic will come in a follow up.
-
- 07 Jul, 2025 2 commits
-
-
Parth Sareen authored
-
Parth Sareen authored
-
- 05 Jul, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 23 Jun, 2025 1 commit
-
-
Daniel Hiltgen authored
* Re-remove cuda v11 Revert the revert - drop v11 support requiring drivers newer than Feb 23 This reverts commit c6bcdc42. * Simplify layout With only one version of the GPU libraries, we can simplify things down somewhat. (Jetsons still require special handling) * distinct sbsa variant for linux arm64 This avoids accidentally trying to load the sbsa cuda libraries on a jetson system which results in crashes. * temporary prevent rocm+cuda mixed loading
-
- 18 Jun, 2025 1 commit
-
-
Jeffrey Morgan authored
Removes a test under benchmark/ that is unused
-
- 07 Jun, 2025 2 commits
-
-
Krzysztof Jeziorny authored
-
Jeffrey Morgan authored
This reverts commit 09430011.
-
- 06 Jun, 2025 1 commit
-
-
Hunter Wittenborn authored
-
- 04 Jun, 2025 1 commit
-
-
JasonHonKL authored
-
- 29 May, 2025 1 commit
-
-
Devon Rifkin authored
- Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/se...
-
- 24 May, 2025 1 commit
-
-
frob authored
-
- 13 May, 2025 1 commit
-
-
Daniel Hiltgen authored
Bring back v11 until we can better warn users that their driver is too old. This reverts commit fa393554.
-
- 12 May, 2025 1 commit
-
-
Daniel Hiltgen authored
The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.
-
- 08 May, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 07 May, 2025 1 commit
-
-
Daniel Hiltgen authored
This reduces the size of our Windows installer payloads by ~256M by dropping support for nvidia drivers older than Feb 2023. Hardware support is unchanged. Linux default bundle sizes are reduced by ~600M to 1G.
-
- 05 May, 2025 1 commit
-
-
Jeffrey Morgan authored
Some options listed in api/types.go are not supported in newer models, or have been deprecated in the past. This is the first of a series of PRs to clean up the API options
-
- 29 Apr, 2025 1 commit
-
-
Devon Rifkin authored
-
- 28 Apr, 2025 1 commit
-
-
Devon Rifkin authored
This reverts commit 424f6486.
-
- 22 Apr, 2025 1 commit
-
-
Devon Rifkin authored
* increase default context length to 4096 We lower the default numParallel from 4 to 2 and use these "savings" to double the default context length from 2048 to 4096. We're memory neutral in cases when we previously would've used numParallel == 4, but we add the following mitigation to handle some cases where we would have previously fallen back to 1x2048 due to low VRAM: we decide between 2048 and 4096 using a runtime check, choosing 2048 if we're on a one GPU system with total VRAM of <= 4 GB. We purposefully don't check the available VRAM because we don't want the context window size to change unexpectedly based on the available VRAM. We plan on making the default even larger, but this is a relatively low-risk change we can make to quickly double it. * fix tests add an explicit context length so they don't get truncated. The code that converts -1 from being a signal for doing a runtime check isn't running as part of these tests. * tweak small gpu message * clarify context length default also make it actually show up in `ollama serve --help`
-
- 15 Apr, 2025 1 commit
-
-
Devon Rifkin authored
In #8215 syntax highlighting was added to most of the blocks, but there were a couple that were still being rendered as plaintext
-