Commits · 20b53eaa726a4c08043c7af1fa6a322295edcde2 · OpenDAS / ollama

"vscode:/vscode.git/clone" did not exist on "c626177d6ba65a211b8f791d612cdcf8b9c0fe7e"

09 Sep, 2025 4 commits

tests: add tool calling integration test (#12232) · 20b53eaa
Parth Sareen authored Sep 09, 2025

20b53eaa

tests: reduce stress on CPU to 2 models (#12161) · 67451828

Daniel Hiltgen authored Sep 09, 2025

* tests: reduce stress on CPU to 2 models

This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently

* tests: allow slow systems to pass on timeout

If a slow system is still streaming a response, and the response
will pass validation, don't fail just because the system is slow.

* test: unload embedding models more quickly

67451828

readme: add Clueless to community integrations (#12188) · f810ec74
Kashyap Tanuku authored Sep 09, 2025

f810ec74

llm: Clamp batch size to context size · e119783e

Jesse Gross authored Sep 08, 2025

The context must always be able to store the current batch, so
if the user requests a small context then we should also shrink
the batch to match. This also fixes the TestLongInputContext
test on the new engine. (The old engine already has this behavior.)

e119783e

08 Sep, 2025 4 commits

runner: move harmony to runner (#12052) · 1a558f98
Parth Sareen authored Sep 08, 2025

1a558f98

Hybrid and recurrent memory estimates (#12186) · 7b91c9ce

Gabe Goodhart authored Sep 08, 2025

This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.

The logic for the sizing of the recurrent layers comes from the llama.cpp implementation

ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

7b91c9ce

docs: show how to debug nvidia init failures (#12216) · 950d33aa
Daniel Hiltgen authored Sep 08, 2025
```
This debug setting can help troubleshoot obscure initialization failures.
```
950d33aa
fix: nil pointer dereference if cache is nil (#12215) · 9714e38d
Michael Yang authored Sep 08, 2025

9714e38d

05 Sep, 2025 1 commit

parser: don't check the file type of safetensors to prevent false negatives. (#12176) · 4378ae4f

frob authored Sep 06, 2025



* Don't check the file type of safetensor to prevent false negatives.

---------
Co-authored-by: Patrick Devine <patrick@infrahq.com>

4378ae4f

04 Sep, 2025 2 commits
- embedding gemma model (#12181) · 5994e8e8
  Michael Yang authored Sep 04, 2025
```
* ollama: add embeddings
```
  5994e8e8
- more logutil.Trace (#12177) · b3e61207
  Michael Yang authored Sep 03, 2025
  
  b3e61207
02 Sep, 2025 3 commits
- logutil: add Trace and TraceContext helpers (#12110) · fb92b617
  Michael Yang authored Sep 02, 2025
  
  fb92b617
- llm: Avoid underflow in free memory logging · 8149a3c8
  Jesse Gross authored Sep 02, 2025
```
If a GPU's free memory is less than the reserved amount, we might get
an underflow. Since it is an unsigned uint64, we print this as a large
number rather than the more correct 0. This only affects logging, the
actual layout code already handles this correctly.

Bug #12138
```
  8149a3c8
- harden uncaught exception registration (#12120) · 0cc90a81
  Daniel Hiltgen authored Sep 02, 2025
  
  0cc90a81
31 Aug, 2025 2 commits
- ml: fix struct field name in comment (#12123) · e42300f2
  pxwanglu authored Sep 01, 2025
  
  e42300f2
- readme: add NOMYO Router to community integrations (#12129) · 66e73809
  alpha-nerd-nomyo authored Aug 31, 2025
  
  66e73809
29 Aug, 2025 2 commits

perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd

Daniel Hiltgen authored Aug 29, 2025

* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.

517807cd

Always filter devices (#12108) · ead4a9a1

Daniel Hiltgen authored Aug 29, 2025

* Always filter devices

Avoid crashing on unsupported AMD iGPUs

* Remove cuda device filtering

This interferes with mixed setups

ead4a9a1

28 Aug, 2025 1 commit
- readme: add Neuro SAN to community integrations (#12109) · 4383a3ab
  ofrancon authored Aug 28, 2025
  
  4383a3ab
27 Aug, 2025 2 commits

ggml: Avoid allocating CUDA primary context on unused GPUs · 9d97e6a9

Jesse Gross authored Aug 26, 2025

The recent memory management changes caused all GPUs to be visible
to the runner, regardless of whether they are ultimately used. This
caused CUDA devices to allocate a primary context (~300 MB VRAM) on
each GPU, for each model. This is unnecessary, so we can both avoid
touching GPUs that we exclude in the early stage of allocation and
freeing the memory for any that we touch but don't use.

The issue will continue to exist for the old engine, since it touches
all devices during initialization.

9d97e6a9

fix keep alive (#12041) · 10815324
Michael Yang authored Aug 27, 2025

10815324

26 Aug, 2025 3 commits

convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018) · 59412fbb
Michael Yang authored Aug 26, 2025
```
* convert: return bytes written

* ggml flavor mxfp4

* simplify jit conversion

* comment
```
59412fbb

convert: fix tensor sorting (#12015) · 86834a27

Michael Yang authored Aug 26, 2025

there's two bugs here.

1. the check for a layer id is incorrect and should be >= 0 since layer
   0 is valid
2. if both tensors have an layer identifier, it will only compare the
   layer id which will return 0 if the tensors are in the same layer.
   instead it should fallback to comparing the full tensor name

86834a27

gptoss: enable flash attention by default (#11996) · 85ccf735
Michael Yang authored Aug 26, 2025

85ccf735

25 Aug, 2025 1 commit
- remove extra field attr (#11205) · 30fb7e19
  Michael Yang authored Aug 25, 2025
  
  30fb7e19
22 Aug, 2025 6 commits
- api: implement stringer for ToolFunctionParameters (#12038) · d3450dd5
  Jeffrey Morgan authored Aug 22, 2025
  
  d3450dd5
- tools: avoid matching braces that are part of tool content (#12039) · 4bcb04ad
  Jeffrey Morgan authored Aug 22, 2025
  
  4bcb04ad
- Merge pull request #12021 from ollama/drifkin/thinking-double-emit · e3d57087
  Devon Rifkin authored Aug 22, 2025
```
thinking: fix double emit when no opening tag
```
  e3d57087
- server: skip parsing initial <think> if provided in the prompt (#12024) · 4be4dc87
  Jeffrey Morgan authored Aug 22, 2025
  
  4be4dc87
- chore: remove redundant words in comment (#12028) · 109d4fc3
  zoupingshi authored Aug 23, 2025
```
Signed-off-by: zoupingshi <hangfachang@outlook.com>
```
  109d4fc3
- thinking: fix double emit when no opening tag · 2cb0a580
  Devon Rifkin authored Aug 21, 2025
```
The thinking parser will automatically transition to being a
pass-through if non-whitespace is seen before an opening tag. However,
we weren't clearing the buffer after the first non-whitespace input, so
in practice the first token would be emitted twice.

Added a test that demonstrated this, and then fixed the bug.
```
  2cb0a580
21 Aug, 2025 1 commit
- harmony: move harmony parsing into a package (#12016) · 7cce5aac
  Parth Sareen authored Aug 21, 2025
  
  7cce5aac
20 Aug, 2025 6 commits

gpt-oss: convert from hugging face format (#11907) · 4ae4f47b
Michael Yang authored Aug 20, 2025

4ae4f47b

llm: Don't always evict models in CPU-only mode · 073fa31d

Jesse Gross authored Aug 20, 2025

With old memory estimates, it's currently impossible to load more
than one model at a time when no GPUs are available. This is because
the check for whether we need to evict a model looks to see if all
layers of the new model can be loaded onto GPUs, which is never true
if there are no GPUs. Before the memory management changes, there
was a special code path for CPU-only systems.

This problem does not exist with new memory estimates.

Fixes #11974

073fa31d

openai: remove reasoning as an api.Options (#11993) · 91fc3c48
Michael Yang authored Aug 20, 2025

91fc3c48
Merge pull request #11973 from ollama/drifkin/bpe · 6de62664
Devon Rifkin authored Aug 19, 2025
```
model: fix boundary in bpe
```
6de62664
model: add bpe roundtripping tests · 463a6caa
Devon Rifkin authored Aug 19, 2025

463a6caa

model: fix boundary in bpe · fc5fb09f

Devon Rifkin authored Aug 19, 2025

0x007e is a tilde and was getting adjusted (+0x00a2) to 0x0120 in the
encode, but then in the decode it was getting adjusted down (-0x0100) to
0x0020. The boundary for the +0x00a2 case has been adjusted to fix this

Fixes: #11966

fc5fb09f

19 Aug, 2025 2 commits

kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c

Jesse Gross authored Aug 19, 2025

Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.

05ccb17c

disable output_all (#11959) · f804e8a4
Michael Yang authored Aug 18, 2025

f804e8a4