Commits · 950d33aa3045581906c2db11f33d2a4c2bca3033 · OpenDAS / ollama

08 Sep, 2025 2 commits
- docs: show how to debug nvidia init failures (#12216) · 950d33aa
  Daniel Hiltgen authored Sep 08, 2025
```
This debug setting can help troubleshoot obscure initialization failures.
```
  950d33aa
- fix: nil pointer dereference if cache is nil (#12215) · 9714e38d
  Michael Yang authored Sep 08, 2025
  
  9714e38d
05 Sep, 2025 1 commit

parser: don't check the file type of safetensors to prevent false negatives. (#12176) · 4378ae4f

frob authored Sep 06, 2025



* Don't check the file type of safetensor to prevent false negatives.

---------
Co-authored-by: Patrick Devine <patrick@infrahq.com>

4378ae4f

04 Sep, 2025 2 commits
- embedding gemma model (#12181) · 5994e8e8
  Michael Yang authored Sep 04, 2025
```
* ollama: add embeddings
```
  5994e8e8
- more logutil.Trace (#12177) · b3e61207
  Michael Yang authored Sep 03, 2025
  
  b3e61207
02 Sep, 2025 3 commits
- logutil: add Trace and TraceContext helpers (#12110) · fb92b617
  Michael Yang authored Sep 02, 2025
  
  fb92b617
- llm: Avoid underflow in free memory logging · 8149a3c8
  Jesse Gross authored Sep 02, 2025
```
If a GPU's free memory is less than the reserved amount, we might get
an underflow. Since it is an unsigned uint64, we print this as a large
number rather than the more correct 0. This only affects logging, the
actual layout code already handles this correctly.

Bug #12138
```
  8149a3c8
- harden uncaught exception registration (#12120) · 0cc90a81
  Daniel Hiltgen authored Sep 02, 2025
  
  0cc90a81
31 Aug, 2025 2 commits
- ml: fix struct field name in comment (#12123) · e42300f2
  pxwanglu authored Sep 01, 2025
  
  e42300f2
- readme: add NOMYO Router to community integrations (#12129) · 66e73809
  alpha-nerd-nomyo authored Aug 31, 2025
  
  66e73809
29 Aug, 2025 2 commits

perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd

Daniel Hiltgen authored Aug 29, 2025

* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.

517807cd

Always filter devices (#12108) · ead4a9a1

Daniel Hiltgen authored Aug 29, 2025

* Always filter devices

Avoid crashing on unsupported AMD iGPUs

* Remove cuda device filtering

This interferes with mixed setups

ead4a9a1

28 Aug, 2025 1 commit
- readme: add Neuro SAN to community integrations (#12109) · 4383a3ab
  ofrancon authored Aug 28, 2025
  
  4383a3ab
27 Aug, 2025 2 commits

ggml: Avoid allocating CUDA primary context on unused GPUs · 9d97e6a9

Jesse Gross authored Aug 26, 2025

The recent memory management changes caused all GPUs to be visible
to the runner, regardless of whether they are ultimately used. This
caused CUDA devices to allocate a primary context (~300 MB VRAM) on
each GPU, for each model. This is unnecessary, so we can both avoid
touching GPUs that we exclude in the early stage of allocation and
freeing the memory for any that we touch but don't use.

The issue will continue to exist for the old engine, since it touches
all devices during initialization.

9d97e6a9

fix keep alive (#12041) · 10815324
Michael Yang authored Aug 27, 2025

10815324

26 Aug, 2025 3 commits

convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018) · 59412fbb
Michael Yang authored Aug 26, 2025
```
* convert: return bytes written

* ggml flavor mxfp4

* simplify jit conversion

* comment
```
59412fbb

convert: fix tensor sorting (#12015) · 86834a27

Michael Yang authored Aug 26, 2025

there's two bugs here.

1. the check for a layer id is incorrect and should be >= 0 since layer
   0 is valid
2. if both tensors have an layer identifier, it will only compare the
   layer id which will return 0 if the tensors are in the same layer.
   instead it should fallback to comparing the full tensor name

86834a27

gptoss: enable flash attention by default (#11996) · 85ccf735
Michael Yang authored Aug 26, 2025

85ccf735

25 Aug, 2025 1 commit
- remove extra field attr (#11205) · 30fb7e19
  Michael Yang authored Aug 25, 2025
  
  30fb7e19
22 Aug, 2025 6 commits
- api: implement stringer for ToolFunctionParameters (#12038) · d3450dd5
  Jeffrey Morgan authored Aug 22, 2025
  
  d3450dd5
- tools: avoid matching braces that are part of tool content (#12039) · 4bcb04ad
  Jeffrey Morgan authored Aug 22, 2025
  
  4bcb04ad
- Merge pull request #12021 from ollama/drifkin/thinking-double-emit · e3d57087
  Devon Rifkin authored Aug 22, 2025
```
thinking: fix double emit when no opening tag
```
  e3d57087
- server: skip parsing initial <think> if provided in the prompt (#12024) · 4be4dc87
  Jeffrey Morgan authored Aug 22, 2025
  
  4be4dc87
- chore: remove redundant words in comment (#12028) · 109d4fc3
  zoupingshi authored Aug 23, 2025
```
Signed-off-by: zoupingshi <hangfachang@outlook.com>
```
  109d4fc3
- thinking: fix double emit when no opening tag · 2cb0a580
  Devon Rifkin authored Aug 21, 2025
```
The thinking parser will automatically transition to being a
pass-through if non-whitespace is seen before an opening tag. However,
we weren't clearing the buffer after the first non-whitespace input, so
in practice the first token would be emitted twice.

Added a test that demonstrated this, and then fixed the bug.
```
  2cb0a580
21 Aug, 2025 1 commit
- harmony: move harmony parsing into a package (#12016) · 7cce5aac
  Parth Sareen authored Aug 21, 2025
  
  7cce5aac
20 Aug, 2025 6 commits

gpt-oss: convert from hugging face format (#11907) · 4ae4f47b
Michael Yang authored Aug 20, 2025

4ae4f47b

llm: Don't always evict models in CPU-only mode · 073fa31d

Jesse Gross authored Aug 20, 2025

With old memory estimates, it's currently impossible to load more
than one model at a time when no GPUs are available. This is because
the check for whether we need to evict a model looks to see if all
layers of the new model can be loaded onto GPUs, which is never true
if there are no GPUs. Before the memory management changes, there
was a special code path for CPU-only systems.

This problem does not exist with new memory estimates.

Fixes #11974

073fa31d

openai: remove reasoning as an api.Options (#11993) · 91fc3c48
Michael Yang authored Aug 20, 2025

91fc3c48
Merge pull request #11973 from ollama/drifkin/bpe · 6de62664
Devon Rifkin authored Aug 19, 2025
```
model: fix boundary in bpe
```
6de62664
model: add bpe roundtripping tests · 463a6caa
Devon Rifkin authored Aug 19, 2025

463a6caa

model: fix boundary in bpe · fc5fb09f

Devon Rifkin authored Aug 19, 2025

0x007e is a tilde and was getting adjusted (+0x00a2) to 0x0120 in the
encode, but then in the decode it was getting adjusted down (-0x0100) to
0x0020. The boundary for the +0x00a2 case has been adjusted to fix this

Fixes: #11966

fc5fb09f

19 Aug, 2025 2 commits

kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c

Jesse Gross authored Aug 19, 2025

Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.

05ccb17c

disable output_all (#11959) · f804e8a4
Michael Yang authored Aug 18, 2025

f804e8a4

18 Aug, 2025 6 commits

readme: add any-agent to community integrations (#11950) · 9cfbffaf
Kostis authored Aug 19, 2025

9cfbffaf
readme: add Andes to community integrations (#11952) · 470d5802
Ruslan Suleymanov authored Aug 19, 2025

470d5802
Merge pull request #11910 from ollama/drifkin/harmony-fn-names · b517bb1c
Devon Rifkin authored Aug 18, 2025
```
harmony: convert fn names to be valid ts identifiers
```
b517bb1c

llm: Check for nil memory data before printing · e3ade453

Jesse Gross authored Aug 18, 2025

We dump out our best memory estimate after we complete processing
for any reason, including errors. This is helpful for finding what
what stopped us in error conditions but in some cases we might not
have gotten even the first result yet.

Fixes #11957

e3ade453

harmony: convert fn names to be valid ts identifiers · 048bd447

Devon Rifkin authored Aug 14, 2025

In <https://github.com/ollama/ollama/issues/11704#issuecomment-3177380197>
I noticed that hyphens in function names could possibly cause the model
to become confused. Later in that issue I found other explanations, but
at a minimum tool names with spaces in them are confusing to the model
because of the prompt format.

In this change I create a mapper that converts arbitrary tool names into
valid typescript identifiers. It's a little overly strict in that it
doesn't allow all unicode characters that might be valid in ts
identifiers, but it's still very permissive. Since mappings aren't
reversible, we must temporarily store this mapping in order to unmap it
if the model comes back with a call. We also handle the case where
multiple mappings collide into the same mapping and append a counter to
the end to make them unique

048bd447

Merge pull request #11875 from ollama/drifkin/print-template · ec8bf5e6
Devon Rifkin authored Aug 18, 2025
```
server: add debug option for printing out prompt instead of calling model
```
ec8bf5e6