- 05 Aug, 2025 1 commit
-
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 23 Jul, 2025 1 commit
-
-
minxinyi authored
-
- 22 Jul, 2025 1 commit
-
-
Patrick Devine authored
--------- Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
- 08 Jul, 2025 2 commits
-
-
Daniel Hiltgen authored
The current scheduler algorithm of picking the paralellism based on available VRAM complicates the upcoming dynamic layer memory allocation algorithm. This changes the default to 1, with the intent going forward that parallelism is explicit and will no longer be dynamically determined. Removal of the dynamic logic will come in a follow up.
-
Daniel Hiltgen authored
* API: expose context size of loaded models * CLI: add context UX This adds a column in the ps output to show the models context size.
-
- 27 Jun, 2025 1 commit
-
-
Michael Yang authored
this tensor isn't compatible with cuda when quantized to q4_K so skip it
-
- 20 Jun, 2025 1 commit
-
-
Michael Yang authored
* Reapply "feat: incremental gguf parser (#10822)" (#11114) This reverts commit a6e64fbd. * fix older ggufs
-
- 18 Jun, 2025 2 commits
-
-
Jeffrey Morgan authored
This reverts commit 6b04cad7.
-
曹家巧 authored
-
- 12 Jun, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Michael Yang authored
* incremental gguf parser * gguf: update test to not rely on gguf on disc * re-use existing create gguf * read capabilities from gguf kv * kv exists * update tests * s/doneFunc/successFunc/g * new buffered reader --------- Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 07 Jun, 2025 1 commit
-
-
Jeffrey Morgan authored
This reverts commit 09430011.
-
- 06 Jun, 2025 1 commit
-
-
Devon Rifkin authored
move thinking logic into its own package
-
- 05 Jun, 2025 1 commit
-
-
Devon Rifkin authored
-
- 04 Jun, 2025 1 commit
-
-
JasonHonKL authored
-
- 29 May, 2025 1 commit
-
-
Devon Rifkin authored
- Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/se...
-
- 27 May, 2025 1 commit
-
-
Kyle Steere authored
Signed-off-by:Kyle Steere <kyle.steere@chainguard.dev>
-
- 24 May, 2025 1 commit
-
-
frob authored
-
- 23 May, 2025 1 commit
-
-
Parth Sareen authored
-
- 22 May, 2025 2 commits
-
-
Daniel Hiltgen authored
When the same model is being reloaded rapidly with client connections being canceled before the model finishes loading, the queued unload event could cause a leak of runners by deleting a different runner from the loaded list.
-
Bruce MacDonald authored
Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.
-
- 21 May, 2025 1 commit
-
-
Michael Yang authored
* remove support for multiple ggufs in a single file this was an attempt to make it easier to import multimodal models into ollama. this was rarely used and error prone so remove it * fix: create fused model from blob
-
- 19 May, 2025 2 commits
-
-
Daniel Hiltgen authored
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 14 May, 2025 2 commits
-
-
Daniel Hiltgen authored
Older clients assumed the digest was at least 19 characters long so increase the size of the dummy digest to avoid array out of bounds crashes.
-
Michael Yang authored
-
- 13 May, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 12 May, 2025 3 commits
-
-
Daniel Hiltgen authored
The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.
-
Bruce MacDonald authored
When creating a quantized model from safetensors we need the array KV values to be loaded.Changing this value to -1 loads the KV values on the returned layer to be used and saved during quantization.
-
Michael Yang authored
reduce prompt log to trace level
-
- 08 May, 2025 2 commits
-
-
Michael Yang authored
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
-
Michael Yang authored
-
- 07 May, 2025 2 commits
-
-
Daniel Hiltgen authored
If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.
-
Jeffrey Morgan authored
-
- 06 May, 2025 3 commits
-
-
Devon Rifkin authored
Fixes: #5483
-
Michael Yang authored
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 05 May, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 03 May, 2025 1 commit
-
-
Daniel Hiltgen authored
This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.
-
- 01 May, 2025 1 commit
-
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-