- 15 Dec, 2025 2 commits
-
-
Eva H authored
-
Daniel Hiltgen authored
This reverts commit 56f754f46b87749581f73ef3625314bb0e51bfed.
-
- 13 Dec, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 12 Dec, 2025 10 commits
-
-
Daniel Hiltgen authored
* flash attn: add auto mode for llama engine If the user does not specify fa in the environment, use auto-mode. * review comments * ensure kv cache quantized types have FA explicitly enabled additional review comments
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
This changes the default behavior to use the Ollama engine for supported models, while retaining the ability to disable the Ollama engine and fall back to the Llama engine. Models in the OllamaEngineRequired list will always run on the Ollama engine.
-
Eva H authored
-
Eva H authored
-
Devon Rifkin authored
* docs: add docs for v1/responses and rework openai compat section I reworked the examples to be separated by topic and to be fully runnable (i.e., they now log output instead of just suggesting how a call might be made). We now use `<CodeGroup>`s so that each example has a dropdown on the docs site for users to choose, which makes the examples a lot more digestible (since you only see approx 1/3 of the code you used to). I also added a new tool to extract code examples into files so that it's easier to actually run them and check that they work. ## Example ```shell go run docs/tools/extract-examples/main.go docs/api/openai-compatibility.mdx ``` Output: ``` Extracting code examples to: /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368 - 01_basic.py - 01_basic.js - 01_basic.sh - 02_responses.py - 02_responses.js - 02_responses.sh - 03_vision.py - 03_vision.js - 03_vision.sh Extracted 9 file(s) to /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368 To run examples: cd /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368 npm install # for JS examples then run individual files with `node file.js`, `python file.py`, `bash file.sh` ``` In the future we should consider actually running the examples in CI and having some sort of acceptance test so we can automatically detect when our examples break. So this is just a start in that direction. * Update docs/api/openai-compatibility.mdx Co-authored-by:
Parth Sareen <parth.sareen@ollama.com> * Update docs/api/openai-compatibility.mdx Co-authored-by:
Parth Sareen <parth.sareen@ollama.com> --------- Co-authored-by:
Parth Sareen <parth.sareen@ollama.com>
-
Parth Sareen authored
* openai: add tool call appending to previous asst message * add tests for thinking appending
-
Alexander Gusak authored
-
JJ authored
Correct Markdown syntax for Swollama GitHub and DocC documentation links
-
Jeffrey Morgan authored
-
- 11 Dec, 2025 5 commits
-
-
Devon Rifkin authored
Only supporting the stateless part of the API. Doc updates to come once this is shipped. Closes: #9659
-
nicole pardal authored
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch. Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash. This change ensures all tokens stay in one batch and prevents crashes. Fixes: #12938 #13054 Co-authored-by:Jesse Gross <jesse@ollama.com>
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
EasonLin authored
-
- 10 Dec, 2025 6 commits
-
-
Eloi Torrents authored
-
Julia Scheaffer authored
-
Gabe Goodhart authored
* feat: Bump llama.cpp to the latest master (17f7f4b) This brings in significant improvements to prefill performance for all models using the SSM_CONV and SSM_SCAN ops (granite4, jamba, falcon-h, nemotron-h, Qwen3 Next) on Apple Metal. See https://github.com/ggml-org/llama.cpp/pull/17876 Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches 1-4 Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix: Update patches 5-12 Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches 13-18 Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Update patch 20 Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches 21-31 Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Sync vendored code The two files I'm not sure about here are the swap from gemma3-iswa.cpp to gemma3.cpp (I chose to include this because I think it's required), and the inclusion of `ggml-zendnn.h` which I chose to omit. Branch: LlamaCPPMetalSSMImprovements Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com>
-
Eva H authored
app/ui: use requestAnimationFrame to prevent bottom line cutoff in streaming thinking display (#13137)
-
Eva H authored
-
Eva H authored
-
- 09 Dec, 2025 5 commits
-
-
nicole pardal authored
-
Parth Sareen authored
-
Parth Sareen authored
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 08 Dec, 2025 5 commits
-
-
Michael Yang authored
change to a flatter directory structure and group the options with the function update models to call rope in one place
-
nicole pardal authored
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
-
Daniel Hiltgen authored
Prevent CGO from accidentally reusing old object files from the cache across vendor updates
-
JJ authored
-
Jeffrey Morgan authored
-
- 06 Dec, 2025 1 commit
-
-
Daniel Hiltgen authored
Follow up from #12992 Free all streams, and keep the alloc logic aligned across streams.
-
- 05 Dec, 2025 1 commit
-
-
Sos Pogosyan authored
fix(api): correct Content-Type header for /api/chat and /api/generate when using cloud models (#13279) --------- Co-authored-by:
Pogosyan Sos <sos_pogosyan@MacBook-Pro-Sos.local> Co-authored-by:
Patrick Devine <patrick@infrahq.com>
-
- 04 Dec, 2025 3 commits
-
-
Jesse Gross authored
-
Jesse Gross authored
Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.
-
Jesse Gross authored
We currently use cache padding of 32 when not using flash attention and 256 with flash attention, which is based on the historic alignment requirements of these kernels. The restrictions have since been loosened but there are still performance benefits, such as better CUDA graph reuse. Since the requirement is no longer kernel-specific, set the padding uniformly to 256, as llama.cpp has.
-