Commits · fbe6ae285a23baddb14c5bbce26d4fcb837503e4 · OpenDAS / ollama

22 May, 2025 1 commit

server: improve tensor quantization fallback logic (#10806) · fbe6ae28

Bruce MacDonald authored May 22, 2025

Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.

fbe6ae28

21 May, 2025 1 commit

remove support for multiple ggufs in a single file (#10722) · 61aeaf7e

Michael Yang authored May 21, 2025

* remove support for multiple ggufs in a single file

this was an attempt to make it easier to import multimodal models into
ollama. this was rarely used and error prone so remove it

* fix: create fused model from blob

61aeaf7e

19 May, 2025 2 commits

avoid kv truncation during create (#10761) · 1a0cfd08
Daniel Hiltgen authored May 19, 2025

1a0cfd08

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

14 May, 2025 2 commits
- fix crash in old clients with quantization progress (#10710) · ff80718e
  Daniel Hiltgen authored May 14, 2025
```
Older clients assumed the digest was at least 19 characters long so increase the size
of the dummy digest to avoid array out of bounds crashes.
```
  ff80718e
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 1 commit
- server: add webp image input support (#10653) · c7f4ae7b
  Jeffrey Morgan authored May 12, 2025
  
  c7f4ae7b
12 May, 2025 3 commits

Follow up to #10363 (#10647) · 9d6df908

Daniel Hiltgen authored May 12, 2025

The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.

9d6df908

convert: quantize from safetensors needs kv (#10675) · ad035ad5

Bruce MacDonald authored May 12, 2025

When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.

ad035ad5

feat: add trace log level (#10650) · f95a1f2b
Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
f95a1f2b

08 May, 2025 2 commits

fix: stream accumulator exits early (#10593) · 0d6e35d3

Michael Yang authored May 08, 2025

the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.

0d6e35d3

lint: enable usetesting, disable tenv (#10594) · 6e9a7a25
Michael Yang authored May 08, 2025

6e9a7a25

07 May, 2025 2 commits

sched: fix race leading to orphaned runners (#10599) · 5e380c3b

Daniel Hiltgen authored May 07, 2025

If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.

5e380c3b

api: remove unused RetrieveModelResponse type (#10603) · 392de840
Jeffrey Morgan authored May 06, 2025

392de840

06 May, 2025 3 commits
- server: send 405 instead of 404 for unallowed methods (#10275) · 4090aca9
  Devon Rifkin authored May 06, 2025
```
Fixes: #5483
```
  4090aca9
- server: remove internal cmd (#10595) · 92ce438d
  Michael Yang authored May 06, 2025
  
  92ce438d
- Move quantization to new backend (#10363) · 42481045
  Daniel Hiltgen authored May 06, 2025
```
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
```
  42481045
05 May, 2025 1 commit
- server: fix panic when runner.Options is nil (#10566) · 1703d147
  Jeffrey Morgan authored May 05, 2025
  
  1703d147
03 May, 2025 1 commit

sched: logging improvements (#10550) · 76ea735a

Daniel Hiltgen authored May 03, 2025

This enhances our logging in the scheduler. The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state). Runners now have slog wiring to report more details about the
runner, including PID.

76ea735a

01 May, 2025 1 commit
- image: add vision capability for projector-based models (#10509) · e6d2d041
  frob authored May 02, 2025
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  e6d2d041
30 Apr, 2025 2 commits

strip out thinking tags in message history for qwen3 & r1 (#10490) · ad3c7c9b

Devon Rifkin authored Apr 30, 2025

* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check

ad3c7c9b

Fix "Stopping..." scheduler hang (#10487) · 415c8fcc

Daniel Hiltgen authored Apr 30, 2025

* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.

415c8fcc

29 Apr, 2025 1 commit

lower default num parallel to 2 · fe5b9bb2

Devon Rifkin authored Apr 29, 2025

this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k

fe5b9bb2

28 Apr, 2025 1 commit
- Revert "increase default context length to 4096 (#10364)" · dd93e1af
  Devon Rifkin authored Apr 28, 2025
```
This reverts commit 424f6486.
```
  dd93e1af
25 Apr, 2025 2 commits

explicitly decode maxarraysize 1024 · 340448d2
Michael Yang authored Apr 25, 2025

340448d2

fix superfluous call to WriteHeader · 214a7678

Michael Yang authored Apr 24, 2025

the first call to http.ResponseWriter.Write implicitly calls WriteHeader
with http.StatusOK if it hasn't already been called. once WriteHeader
has been called, subsequent calls has no effect. Write is called when
JSON encoding progressUpdateJSON{}. calls to
http.ResponseWriter.WriteHeader after the first encode is useless and
produces a warning:

http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)

214a7678

22 Apr, 2025 1 commit

increase default context length to 4096 (#10364) · 424f6486

Devon Rifkin authored Apr 22, 2025

* increase default context length to 4096

We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.

We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.

We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.

* fix tests

add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.

* tweak small gpu message

* clarify context length default

also make it actually show up in `ollama serve --help`

424f6486

19 Apr, 2025 2 commits

create tempdir in models directory · 88738b35

Michael Yang authored Apr 18, 2025

the models directory should have plenty of storage and also ensure
there's no cross-device copy

88738b35

server/internal/registry: make pull send errors with Error field (#10326) · 4e535e61

Blake Mizerany authored Apr 18, 2025

Previously, the pull handler would send an error message in the Status
field, this prevented the client from using the message as a signal to
stop. In the case of the "run" command, it would follow the pull with a
"show" which would print a nearly identical "not found" message for
unresolved models.

Fixes #10307

4e535e61

17 Apr, 2025 1 commit
- server/internal/client/ollama: handle some network errors gracefully (#10317) · 1d99451a
  Blake Mizerany authored Apr 17, 2025
  
  1d99451a
16 Apr, 2025 4 commits

server/internal/registry: remove superfluous progress bar flush (#10303) · 369de832

Blake Mizerany authored Apr 16, 2025

This removes the extra flushProgress() at the end of handlePull. It is
unnecessary because final progress updates are flushed in all cases of
the main select loop.

369de832

server/internal/client/ollama: cleanup use of multiple counters (#10304) · 3457a315

Blake Mizerany authored Apr 16, 2025

The completed and received counters must work in tandem and the code
should better reflect that. Previously, the act of updating them was 2-3
lines of code duplicated in multiple places. This consolidates them into
a single update closure for easy reading and maintenance.

This also simplifies error handling in places where we can use a return
parameter and defer to handle the error case for updates.

Also, remove the old Layer field from the trackingReader struct.

3457a315

Give tests more time to run (#10306) · 56dc316a
Daniel Hiltgen authored Apr 16, 2025
```
Fix flake failures on windows
```
56dc316a

cmd: add retry/backoff (#10069) · 1e7f62cb

Blake Mizerany authored Apr 15, 2025

This commit adds retry/backoff to the registry client for pull requests.

Also, revert progress indication to match original client's until we can
"get it right."

Also, make WithTrace wrap existing traces instead of clobbering them.
This allows clients to compose traces.

1e7f62cb

14 Apr, 2025 1 commit
- server: add `OpenAI-Beta` header to CORS safelist · 97fe45e3
  Devon Rifkin authored Apr 14, 2025
```
alphabetized the compat list and then added a single header

fixes: #9801
```
  97fe45e3
10 Apr, 2025 1 commit
- types: include the 'items' and '$defs' fields to properly handle "array" types (#10091) · ef65174d
  Tom Sheffler authored Apr 09, 2025
```
---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
```
  ef65174d
09 Apr, 2025 1 commit
- fix(scheduler): make model unload order deterministic (#10185) · 42ecb9f1
  Ire Gaddr authored Apr 09, 2025
  
  42ecb9f1
08 Apr, 2025 1 commit
- types: add any type and validation for ToolFunction enum (#10166) · 6747099d
  Parth Sareen authored Apr 08, 2025
  
  6747099d
07 Apr, 2025 1 commit
- types: allow tool function parameters with a single type or an array of types (#9434) · 2f723ac2
  Alex Rozgo authored Apr 07, 2025
  
  2f723ac2
03 Apr, 2025 1 commit

llm: set done reason at server level (#9830) · e53b3cbd

Bruce MacDonald authored Apr 03, 2025

No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.

e53b3cbd