Commits · ead4a9a1d073bdc2f175e429daa85f0e3a66fda7 · OpenDAS / ollama

29 Aug, 2025 1 commit

Always filter devices (#12108) · ead4a9a1

Daniel Hiltgen authored Aug 29, 2025

* Always filter devices

Avoid crashing on unsupported AMD iGPUs

* Remove cuda device filtering

This interferes with mixed setups

ead4a9a1

28 Aug, 2025 1 commit
- readme: add Neuro SAN to community integrations (#12109) · 4383a3ab
  ofrancon authored Aug 28, 2025
  
  4383a3ab
27 Aug, 2025 2 commits

ggml: Avoid allocating CUDA primary context on unused GPUs · 9d97e6a9

Jesse Gross authored Aug 26, 2025

The recent memory management changes caused all GPUs to be visible
to the runner, regardless of whether they are ultimately used. This
caused CUDA devices to allocate a primary context (~300 MB VRAM) on
each GPU, for each model. This is unnecessary, so we can both avoid
touching GPUs that we exclude in the early stage of allocation and
freeing the memory for any that we touch but don't use.

The issue will continue to exist for the old engine, since it touches
all devices during initialization.

9d97e6a9

fix keep alive (#12041) · 10815324
Michael Yang authored Aug 27, 2025

10815324

26 Aug, 2025 3 commits

convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018) · 59412fbb
Michael Yang authored Aug 26, 2025
```
* convert: return bytes written

* ggml flavor mxfp4

* simplify jit conversion

* comment
```
59412fbb

convert: fix tensor sorting (#12015) · 86834a27

Michael Yang authored Aug 26, 2025

there's two bugs here.

1. the check for a layer id is incorrect and should be >= 0 since layer
   0 is valid
2. if both tensors have an layer identifier, it will only compare the
   layer id which will return 0 if the tensors are in the same layer.
   instead it should fallback to comparing the full tensor name

86834a27

gptoss: enable flash attention by default (#11996) · 85ccf735
Michael Yang authored Aug 26, 2025

85ccf735

25 Aug, 2025 1 commit
- remove extra field attr (#11205) · 30fb7e19
  Michael Yang authored Aug 25, 2025
  
  30fb7e19
22 Aug, 2025 6 commits
- api: implement stringer for ToolFunctionParameters (#12038) · d3450dd5
  Jeffrey Morgan authored Aug 22, 2025
  
  d3450dd5
- tools: avoid matching braces that are part of tool content (#12039) · 4bcb04ad
  Jeffrey Morgan authored Aug 22, 2025
  
  4bcb04ad
- Merge pull request #12021 from ollama/drifkin/thinking-double-emit · e3d57087
  Devon Rifkin authored Aug 22, 2025
```
thinking: fix double emit when no opening tag
```
  e3d57087
- server: skip parsing initial <think> if provided in the prompt (#12024) · 4be4dc87
  Jeffrey Morgan authored Aug 22, 2025
  
  4be4dc87
- chore: remove redundant words in comment (#12028) · 109d4fc3
  zoupingshi authored Aug 23, 2025
```
Signed-off-by: zoupingshi <hangfachang@outlook.com>
```
  109d4fc3
- thinking: fix double emit when no opening tag · 2cb0a580
  Devon Rifkin authored Aug 21, 2025
```
The thinking parser will automatically transition to being a
pass-through if non-whitespace is seen before an opening tag. However,
we weren't clearing the buffer after the first non-whitespace input, so
in practice the first token would be emitted twice.

Added a test that demonstrated this, and then fixed the bug.
```
  2cb0a580
21 Aug, 2025 1 commit
- harmony: move harmony parsing into a package (#12016) · 7cce5aac
  Parth Sareen authored Aug 21, 2025
  
  7cce5aac
20 Aug, 2025 6 commits

gpt-oss: convert from hugging face format (#11907) · 4ae4f47b
Michael Yang authored Aug 20, 2025

4ae4f47b

llm: Don't always evict models in CPU-only mode · 073fa31d

Jesse Gross authored Aug 20, 2025

With old memory estimates, it's currently impossible to load more
than one model at a time when no GPUs are available. This is because
the check for whether we need to evict a model looks to see if all
layers of the new model can be loaded onto GPUs, which is never true
if there are no GPUs. Before the memory management changes, there
was a special code path for CPU-only systems.

This problem does not exist with new memory estimates.

Fixes #11974

073fa31d

openai: remove reasoning as an api.Options (#11993) · 91fc3c48
Michael Yang authored Aug 20, 2025

91fc3c48
Merge pull request #11973 from ollama/drifkin/bpe · 6de62664
Devon Rifkin authored Aug 19, 2025
```
model: fix boundary in bpe
```
6de62664
model: add bpe roundtripping tests · 463a6caa
Devon Rifkin authored Aug 19, 2025

463a6caa

model: fix boundary in bpe · fc5fb09f

Devon Rifkin authored Aug 19, 2025

0x007e is a tilde and was getting adjusted (+0x00a2) to 0x0120 in the
encode, but then in the decode it was getting adjusted down (-0x0100) to
0x0020. The boundary for the +0x00a2 case has been adjusted to fix this

Fixes: #11966

fc5fb09f

19 Aug, 2025 2 commits

kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c

Jesse Gross authored Aug 19, 2025

Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.

05ccb17c

disable output_all (#11959) · f804e8a4
Michael Yang authored Aug 18, 2025

f804e8a4

18 Aug, 2025 8 commits

readme: add any-agent to community integrations (#11950) · 9cfbffaf
Kostis authored Aug 19, 2025

9cfbffaf
readme: add Andes to community integrations (#11952) · 470d5802
Ruslan Suleymanov authored Aug 19, 2025

470d5802
Merge pull request #11910 from ollama/drifkin/harmony-fn-names · b517bb1c
Devon Rifkin authored Aug 18, 2025
```
harmony: convert fn names to be valid ts identifiers
```
b517bb1c

llm: Check for nil memory data before printing · e3ade453

Jesse Gross authored Aug 18, 2025

We dump out our best memory estimate after we complete processing
for any reason, including errors. This is helpful for finding what
what stopped us in error conditions but in some cases we might not
have gotten even the first result yet.

Fixes #11957

e3ade453

harmony: convert fn names to be valid ts identifiers · 048bd447

Devon Rifkin authored Aug 14, 2025

In <https://github.com/ollama/ollama/issues/11704#issuecomment-3177380197>
I noticed that hyphens in function names could possibly cause the model
to become confused. Later in that issue I found other explanations, but
at a minimum tool names with spaces in them are confusing to the model
because of the prompt format.

In this change I create a mapper that converts arbitrary tool names into
valid typescript identifiers. It's a little overly strict in that it
doesn't allow all unicode characters that might be valid in ts
identifiers, but it's still very permissive. Since mappings aren't
reversible, we must temporarily store this mapping in order to unmap it
if the model comes back with a call. We also handle the case where
multiple mappings collide into the same mapping and append a counter to
the end to make them unique

048bd447

Merge pull request #11875 from ollama/drifkin/print-template · ec8bf5e6
Devon Rifkin authored Aug 18, 2025
```
server: add debug option for printing out prompt instead of calling model
```
ec8bf5e6
readme: add any-llm to community integrations (#11956) · 709bbb0b
Kostis authored Aug 18, 2025

709bbb0b
readme: add Serene Pub to community integrations (#11946) · abeec240
Jody Doolittle authored Aug 18, 2025

abeec240

15 Aug, 2025 7 commits

gpt-oss: disable quantized kv cache (#11929) · df335aac
Michael Yang authored Aug 15, 2025

df335aac
cli: show the default context length env setting in online help (#11928) · 026bc292
Patrick Devine authored Aug 15, 2025

026bc292
docs: added missing comma in 'Ollama's Javascript library'' (#11915) · 883d0312
Thomas Pelster authored Aug 15, 2025

883d0312

handle cgo flags in docker build (#11909) · 5271ff85

Daniel Hiltgen authored Aug 15, 2025

Docker build requires build-args to be defined.  This ensures the release.yaml settings will be used.

5271ff85

test: improve scheduler/concurrency stress tests (#11906) · d6f7233a

Daniel Hiltgen authored Aug 15, 2025

* test: improve scheduler/concurrency stress tests

The scheduler test used to use approximate memory figures and would often
over or under shoot a systems capcity leading to flaky test results.
This should improve the reliability of this scenario by leveraging
ps output to determinie exactly how many models it takes to
trigger thrashing.

The concurrency test is also refined to target num_parallel + 1 and handle
timeouts better.

With these refinements, TestMultiModelConcurrency was redundant

* test: add parallel generate with history

TestGenerateWithHistory will help verify caching and context
are properly handled while making requests

* test: focus embed tests on embedding models

remove non-embedding models from the embedding tests

d6f7233a

server: add debug option for printing out prompt instead of calling model · 8de1da47
Devon Rifkin authored Aug 15, 2025

8de1da47
Revert "cuda: leverage JIT for smaller footprint (#11635)" (#11913) · d925b535
Daniel Hiltgen authored Aug 14, 2025
```
This reverts commit dc5a6454.
```
d925b535

14 Aug, 2025 2 commits

fix arm linux build when HWCAP2_SVE2 undefined (#11908) · 6eaf194b
Daniel Hiltgen authored Aug 14, 2025

6eaf194b

llm: New memory management · d5a0d8d9

Jesse Gross authored May 29, 2025

This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).

It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.

d5a0d8d9