Commits · 8dbc9e7b68eedcffb44ad2b29b4bdacce5f936ac · OpenDAS / ollama

15 Dec, 2025 2 commits
- app/ui: handle unspecified bind addresses and wait for server in ollama proxy (#13159) · 8dbc9e7b
  Eva H authored Dec 15, 2025
  
  8dbc9e7b
- Revert "Enable Ollama engine by default" (#13481) · abe67acf
  Daniel Hiltgen authored Dec 15, 2025
```
This reverts commit 56f754f46b87749581f73ef3625314bb0e51bfed.
```
  abe67acf
13 Dec, 2025 2 commits
- model: default gemma 3 rope scale to 1.0, apply corrections based on layer counts (#13453) · 4ff8a691
  Jeffrey Morgan authored Dec 12, 2025
  
  4ff8a691
- model: fix global layer rope scale values for gemma 3 (#13452) · 1b308e1d
  Jeffrey Morgan authored Dec 12, 2025
  
  1b308e1d
12 Dec, 2025 10 commits

flash attn: add auto mode for llama engine (#13052) · bd6c1d6b

Daniel Hiltgen authored Dec 12, 2025

* flash attn: add auto mode for llama engine

If the user does not specify fa in the environment, use auto-mode.

* review comments

* ensure kv cache quantized types have FA explicitly enabled

additional review comments

bd6c1d6b

model: force rope factor 1.0 for Gemma 3 (#13445) · 3af5d3b7
Jeffrey Morgan authored Dec 12, 2025

3af5d3b7

Enable Ollama engine by default (#13443) · 77308951

Daniel Hiltgen authored Dec 12, 2025

This changes the default behavior to use the Ollama engine for supported
models, while retaining the ability to disable the Ollama engine and
fall back to the Llama engine. Models in the OllamaEngineRequired list
will always run on the Ollama engine.

77308951

tidy up lint warnings on windows (#13430) · de9ecfd0
Eva H authored Dec 12, 2025

de9ecfd0
fix: select and update models folder in settings (#13412) · 95fdd8d6
Eva H authored Dec 12, 2025

95fdd8d6

docs: add docs for v1/responses and rework openai compat section (#13416) · 9f782285

Devon Rifkin authored Dec 11, 2025



* docs: add docs for v1/responses and rework openai compat section

I reworked the examples to be separated by topic and to be fully
runnable (i.e., they now log output instead of just suggesting how a
call might be made).

We now use `<CodeGroup>`s so that each example has a dropdown on the
docs site for users to choose, which makes the examples a lot more
digestible (since you only see approx 1/3 of the code you used to).

I also added a new tool to extract code examples into files so that it's
easier to actually run them and check that they work.

## Example

```shell
go run docs/tools/extract-examples/main.go docs/api/openai-compatibility.mdx
```

Output:

```
Extracting code examples to: /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368

  - 01_basic.py
  - 01_basic.js
  - 01_basic.sh
  - 02_responses.py
  - 02_responses.js
  - 02_responses.sh
  - 03_vision.py
  - 03_vision.js
  - 03_vision.sh

Extracted 9 file(s) to /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368

To run examples:

  cd /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
  npm install   # for JS examples

then run individual files with `node file.js`, `python file.py`, `bash file.sh`
```

In the future we should consider actually running the examples in CI and
having some sort of acceptance test so we can automatically detect when
our examples break. So this is just a start in that direction.

* Update docs/api/openai-compatibility.mdx
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

* Update docs/api/openai-compatibility.mdx
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

9f782285

openai: add tool call appending to previous assistant message (#13434) · 9b2035d1
Parth Sareen authored Dec 11, 2025
```
* openai: add tool call appending to previous asst message

* add tests for thinking appending
```
9b2035d1
docs: fix link to modelfile.mdx (#13220) · 93d45d7a
Alexander Gusak authored Dec 12, 2025

93d45d7a

Update README.md (#13373) · 709f8424

JJ authored Dec 11, 2025

Correct Markdown syntax for Swollama GitHub and DocC documentation links

709f8424

model: fix rotary embeddings for ministral 3 (#13432) · 2dfb7441
Jeffrey Morgan authored Dec 11, 2025

2dfb7441

11 Dec, 2025 5 commits
- openai: add v1/responses support (#13351) · 1eb5e759
  Devon Rifkin authored Dec 11, 2025
```
Only supporting the stateless part of the API.

Doc updates to come once this is shipped.

Closes: #9659
```
  1eb5e759
- embeddings: modified batch size (#13429) · 3475d915
  nicole pardal authored Dec 11, 2025
```
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
This change ensures all tokens stay in one batch and prevents crashes.
Fixes: #12938 #13054
Co-authored-by: Jesse Gross <jesse@ollama.com>
```
  3475d915
- template: add yesterdayDate helper function (#13431) · 48e78e9b
  Jeffrey Morgan authored Dec 11, 2025
  
  48e78e9b
- model: conversion and hyperparameter fixes for ministral and devstral (#13424) · a838421e
  Jeffrey Morgan authored Dec 11, 2025
  
  a838421e
- routes: add logprobs in tool calls (#13238) · 1c4e85b4
  EasonLin authored Dec 11, 2025
  
  1c4e85b4
10 Dec, 2025 6 commits

cmd/bench: fix binary name in README (#13276) · dac4f17f
Eloi Torrents authored Dec 10, 2025

dac4f17f
cmd/bench: fix options table in cmd/bench/README.md (#13216) · 56b8fb02
Julia Scheaffer authored Dec 10, 2025

56b8fb02

feat: llama.cpp bump (17f7f4) for SSM performance improvements (#13408) · b9569305

Gabe Goodhart authored Dec 10, 2025

* feat: Bump llama.cpp to the latest master (17f7f4b)

This brings in significant improvements to prefill performance for all
models using the SSM_CONV and SSM_SCAN ops (granite4, jamba, falcon-h,
nemotron-h, Qwen3 Next) on Apple Metal.

See https://github.com/ggml-org/llama.cpp/pull/17876



Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches 1-4

Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update patches 5-12

Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches 13-18

Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patch 20

Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches 21-31

Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Sync vendored code

The two files I'm not sure about here are the swap from gemma3-iswa.cpp to
gemma3.cpp (I chose to include this because I think it's required), and the
inclusion of `ggml-zendnn.h` which I chose to omit.

Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

b9569305

app/ui: use requestAnimationFrame to prevent bottom line cutoff in streaming... · c34fc646
Eva H authored Dec 10, 2025
```
app/ui: use requestAnimationFrame to prevent bottom line cutoff in streaming thinking display (#13137)
```
c34fc646
app/ui: refactor to use Ollama endpoints for user auth and health checks (#13081) · 7cf6f18c
Eva H authored Dec 10, 2025

7cf6f18c
app/ui: fix model capabilities not updating after download completion (#13179) · bbbb6b2a
Eva H authored Dec 10, 2025

bbbb6b2a

09 Dec, 2025 5 commits
- nomic-embed-text:v2: model implementation (#13162) · 76f88caf
  nicole pardal authored Dec 09, 2025
  
  76f88caf
- renderers/parsers: olmo3 instruct (#13383) · 2bccf8c6
  Parth Sareen authored Dec 09, 2025
  
  2bccf8c6
- parsers/renderers: olmo3 think (#13290) · 0c5e5f66
  Parth Sareen authored Dec 09, 2025
  
  0c5e5f66
- fix: qwen2.5vl metal argsort · d475d1f0
  Michael Yang authored Dec 08, 2025
  
  d475d1f0
- model: add rnj-1 inference support (#13354) · d2f334c1
  Jeffrey Morgan authored Dec 08, 2025
  
  d2f334c1
08 Dec, 2025 5 commits
- refactor rope · 603ceefa
  Michael Yang authored Nov 18, 2025
```
change to a flatter directory structure and group the options with the
function

update models to call rope in one place
```
  603ceefa
- truncation: fixed runner truncation logic + removed server truncation (#12839) · e082d60a
  nicole pardal authored Dec 08, 2025
```
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
```
  e082d60a
- CI: use vendor base commit in cache keys (#13348) · 5dae7380
  Daniel Hiltgen authored Dec 08, 2025
```
Prevent CGO from accidentally reusing old object files from the cache
across vendor updates
```
  5dae7380
- readme: fix broken Swollama link in community integrations (#13370) · 0c787231
  JJ authored Dec 07, 2025
  
  0c787231
- fs/ggml: write int32 and int64 values to gguf files (#13335) · 5a41d69b
  Jeffrey Morgan authored Dec 07, 2025
  
  5a41d69b
06 Dec, 2025 1 commit

ggml: handle all streams (#13350) · c146a138

Daniel Hiltgen authored Dec 05, 2025

Follow up from #12992

Free all streams, and keep the alloc logic aligned across streams.

c146a138

05 Dec, 2025 1 commit

fix(api): correct Content-Type header for /api/chat and /api/generate when... · 31b8c6a2

Sos Pogosyan authored Dec 05, 2025


fix(api): correct Content-Type header for /api/chat and /api/generate when using cloud models (#13279)

---------
Co-authored-by: Pogosyan Sos <sos_pogosyan@MacBook-Pro-Sos.local>
Co-authored-by: Patrick Devine <patrick@infrahq.com>

31b8c6a2

04 Dec, 2025 3 commits

llm: Enable flash attention for mistral3 by default · 9191dfaf
Jesse Gross authored Dec 04, 2025

9191dfaf

ggml: Enable flash attention for vision encoders · 1108d8b3

Jesse Gross authored Dec 02, 2025

Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.

This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.

1108d8b3

ggml: Always set cache padding to 256 · 7837a5bc

Jesse Gross authored Dec 04, 2025

We currently use cache padding of 32 when not using flash attention
and 256 with flash attention, which is based on the historic alignment
requirements of these kernels. The restrictions have since been
loosened but there are still performance benefits, such as better
CUDA graph reuse.

Since the requirement is no longer kernel-specific, set the padding
uniformly to 256, as llama.cpp has.

7837a5bc