- 05 Aug, 2025 1 commit
-
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 29 May, 2025 1 commit
-
-
Devon Rifkin authored
- Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/set think` or `/set nothink` - A `--hidethinking` option has also been added to the CLI. This makes it easy to use thinking in scripting scenarios like `ollama run qwen3 --think --hidethinking "my question here"` where you just want to see the answer but still want the benefits of thinking models
-
- 14 May, 2025 1 commit
-
-
Michael Yang authored
-
- 08 May, 2025 1 commit
-
-
Michael Yang authored
-
- 09 Dec, 2024 1 commit
-
-
Jesse Gross authored
New lines can be an important part of a user's prompt and trimming it can alter the results. We previously only trimmed prompts with images but refactoring brought this behavior to all prompts, where it became more noticable. The /generate endpoint adds less whitespace and therefore doesn't need to trim it out - this brings the same behavior to /chat. Thanks to @gabe-l-hart for spotting the issue! Fixes #7795
-
- 17 Nov, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 30 Oct, 2024 1 commit
-
-
Jesse Gross authored
-Update mllama to take the cross attention state as embeddings in a batch, more similar to how Llava handles it. This improves integration with the input cache. -Pass locations in a prompt for embeddings using tags similar to Llava. -Abstract interface to vision models so the main runner accesses Clip and Mllama similarly Co-authored-by:Michael Yang <mxyng@pm.me>
-
- 18 Oct, 2024 1 commit
-
-
Patrick Devine authored
Co-authored-by:
jmorganca <jmorganca@gmail.com> Co-authored-by:
Michael Yang <mxyng@pm.me> Co-authored-by:
Jesse Gross <jesse@ollama.com>
-
- 02 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 16 Jul, 2024 1 commit
-
-
Michael Yang authored
-
- 15 Jul, 2024 1 commit
-
-
Michael Yang authored
-
- 13 Jul, 2024 1 commit
-
-
Michael Yang authored
* fix system prompt * execute template when hitting previous roles * fix tests --------- Co-authored-by:jmorganca <jmorganca@gmail.com>
-
- 11 Jul, 2024 1 commit
-
-
Michael Yang authored
-
- 05 Jul, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 01 Jul, 2024 1 commit
-
-
Michael Yang authored
-
- 26 Mar, 2024 1 commit
-
-
Patrick Devine authored
-
- 29 Feb, 2024 1 commit
-
-
Michael Yang authored
instead of appending image tags, prepend them - this generally produces better results
-
- 16 Feb, 2024 1 commit
-
-
Bruce MacDonald authored
-
- 12 Feb, 2024 1 commit
-
-
Jeffrey Morgan authored
-