- 06 Jan, 2026 1 commit
-
-
Parth Sareen authored
-
- 18 Dec, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 10 Dec, 2025 2 commits
-
-
Eloi Torrents authored
-
Julia Scheaffer authored
-
- 04 Dec, 2025 1 commit
-
-
Eloi Torrents authored
* cmd/bench: support writing benchmark output to file This changes Ollama to allow the bench command to write benchmark results to a user-specified output file instead of stdout when the --output flag is provided. --------- Co-authored-by:Patrick Devine <patrick@infrahq.com>
-
- 02 Dec, 2025 1 commit
-
-
Patrick Devine authored
This change: * fixes rope scaling in the mistral converter * updates ministral to include llama4 scaling * includes a new ministral parser for parsing reasoning and tool calling --------- Co-authored-by:jmorganca <jmorganca@gmail.com>
-
- 16 Nov, 2025 1 commit
-
-
Patrick Devine authored
This change adds a basic benchmarking test framework for Ollama which can be used to determine the prefill, eval, load duration, and total duration for running a given model or models.
-
- 05 Nov, 2025 2 commits
-
-
nicole pardal authored
Co-authored-by:A-Akhil <akhilrahul70@gmail.com> This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line. Added ollama embed MODEL [TEXT...] command for generating text embeddings Supports both direct text arguments and stdin piping for scripted workflows Outputs embeddings as JSON arrays (one per line)
-
Patrick Devine authored
-
- 30 Oct, 2025 1 commit
-
-
Michael Yang authored
this change fixes two bugs with `ollama rm`: 1. before a model is removed, it will first be stopped. this only happens for the first argument and skipped for all other models 2. models are unloaded indiscriminately. this errors for cloud models and should be omitted
-
- 26 Sep, 2025 1 commit
-
-
Patrick Devine authored
There are two bugs when using `/load <model>` for a model that doesn't exist, namely: 1. it will not restore the current model settings if the current model is a thinking model; and 2. it will crash is the current model is a non-thinking model This bug fix saves the current runOptions and then restores them if the model load doesn't happen. It also fixes the crash happening for non-thinking models.
-
- 25 Sep, 2025 1 commit
-
-
Patrick Devine authored
-
- 23 Sep, 2025 1 commit
-
-
Patrick Devine authored
* auth: fix problems with the ollama keypairs This change adds several fixes including: - reading in the pubkey files correctly - fixing the push unit test to create a keypair file in a temp directory - not return 500 errors for normal status error
-
- 17 Sep, 2025 1 commit
-
-
Patrick Devine authored
-
- 11 Sep, 2025 1 commit
-
-
fengyuchuanshen authored
-
- 15 Aug, 2025 1 commit
-
-
Patrick Devine authored
-
- 05 Aug, 2025 1 commit
-
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 24 Jul, 2025 1 commit
-
-
Patrick Devine authored
-
- 22 Jul, 2025 1 commit
-
-
Patrick Devine authored
--------- Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
- 17 Jul, 2025 1 commit
-
-
frob authored
-
- 16 Jul, 2025 1 commit
-
-
Parth Sareen authored
-
- 08 Jul, 2025 1 commit
-
-
Daniel Hiltgen authored
* API: expose context size of loaded models * CLI: add context UX This adds a column in the ps output to show the models context size.
-
- 09 Jun, 2025 1 commit
-
-
Daniel Hiltgen authored
When a user elects to keep the existing app, the new Ollama is named `Ollama 2.app` This fixes the app startup flow to handle this naming pattern.
-
- 08 Jun, 2025 1 commit
-
-
Daniel Hiltgen authored
Give the desktop app a hint to start fast.
-
- 06 Jun, 2025 2 commits
-
-
Daniel Hiltgen authored
When starting the app in the background, start it hidden.
-
Daniel Hiltgen authored
Fix an array out of bounds crash
-
- 29 May, 2025 1 commit
-
-
Devon Rifkin authored
- Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/set think` or `/set nothink` - A `--hidethinking` option has also been added to the CLI. This makes it easy to use thinking in scripting scenarios like `ollama run qwen3 --think --hidethinking "my question here"` where you just want to see the answer but still want the benefits of thinking models
-
- 21 May, 2025 1 commit
-
-
Daniel Hiltgen authored
Give the user a helpful error instead of showing connection refused errors.
-
- 15 May, 2025 2 commits
-
-
Daniel Hiltgen authored
-
Bruce MacDonald authored
When a piece of information has been truncated in the show output an ellipses to indicate that more data has not been displayed
-
- 13 May, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 10 May, 2025 1 commit
-
-
Bruce MacDonald authored
-
- 08 May, 2025 1 commit
-
-
Michael Yang authored
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 05 May, 2025 1 commit
-
-
Michael Yang authored
* default max term height * error on out of tree files
-
- 28 Apr, 2025 1 commit
-
-
Devon Rifkin authored
This reverts commit 424f6486.
-
- 22 Apr, 2025 1 commit
-
-
Devon Rifkin authored
* increase default context length to 4096 We lower the default numParallel from 4 to 2 and use these "savings" to double the default context length from 2048 to 4096. We're memory neutral in cases when we previously would've used numParallel == 4, but we add the following mitigation to handle some cases where we would have previously fallen back to 1x2048 due to low VRAM: we decide between 2048 and 4096 using a runtime check, choosing 2048 if we're on a one GPU system with total VRAM of <= 4 GB. We purposefully don't check the available VRAM because we don't want the context window size to change unexpectedly based on the available VRAM. We plan on making the default even larger, but this is a relatively low-risk change we can make to quickly double it. * fix tests add an explicit context length so they don't get truncated. The code that converts -1 from being a signal for doing a runtime check isn't running as part of these tests. * tweak small gpu message * clarify context length default also make it actually show up in `ollama serve --help`
-
- 20 Apr, 2025 1 commit
-
-
greengrass821 authored
Co-authored-by:tooth paste <tooth_paste91@Poorneshwars-MacBook-Pro.local>
-
- 16 Apr, 2025 1 commit
-
-
Blake Mizerany authored
This commit adds retry/backoff to the registry client for pull requests. Also, revert progress indication to match original client's until we can "get it right." Also, make WithTrace wrap existing traces instead of clobbering them. This allows clients to compose traces.
-
- 14 Apr, 2025 1 commit
-
-
CYJiang authored
-