Commits · 74bd09652d69c77a4bed34b3afda74c87295115b · OpenDAS / ollama

21 Mar, 2025 10 commits

ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
Michael Yang authored Mar 19, 2025

74bd0965
benchmark: performance of running ollama server (#8643) · fb6252d7
Bruce MacDonald authored Mar 21, 2025

fb6252d7
server/internal/client/ollama: persist through chunk download errors (#9923) · c794fef2
Blake Mizerany authored Mar 21, 2025

c794fef2
Revert "parser: remove role validation from Modelfile parser" (#9917) · 00ebda8c
Parth Sareen authored Mar 21, 2025
```
This reverts commit ffbfe833.
```
00ebda8c
docs: update final response for /api/chat stream (#9919) · d14ce75b
Parth Sareen authored Mar 21, 2025

d14ce75b

kvcache: Optimize sliding window attention · 2d6eac90

Jesse Gross authored Mar 18, 2025

Currently sliding window attention allocates and uses the full
context size and just masks out any tokens that are outside of the
window. However, we really only need (roughly) the sliding window
size.

At large context sizes this improves two things:
 - Memory allocated - since the fully context size is allocated up front,
   memory requirements drop substantially. On Gemma3:4b with a 32k
   context window, total memory usage (including weights and non-sliding
   layers) drops from ~20GB to ~8GB.
 - Computation - ranges that are completely outside of the sliding
   window are now removed from the tensors that are returned from the
   cache rather than simply being masked out. This results in more
   efficient processing, scaling with the size of the context that
   has actually been used.

Notable, this does not update the scheduler for any model to be aware of
the smaller memory requirements. This is difficult for Gemma3 because
the layers are heterogeneous between sliding and non-sliding attention.
As a result, while actual memory consumption will be reduced, the
scheduler will over-estimate the requirements of the model. This means
that splitting between GPUs or GPUs and CPUs will still be suboptimal.

Bug #9730

2d6eac90

kvcache: Pass granular cache size into implementations · 3ed7ad3a

Jesse Gross authored Mar 18, 2025

Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.

3ed7ad3a

fix: show correct bool value for kv in verbose show information (#9928) · 6d110304
Patrick Devine authored Mar 21, 2025

6d110304

ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758

Jesse Gross authored Mar 20, 2025

This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.

0ff28758

kvcache: Account for source tensors in defrag operation count · d3e9ca3e

Jesse Gross authored Mar 20, 2025

Defragging the KV cache can generate a lot of operations, so we
need to be careful that we don't overflow the number that the graph
can support. We currently account for all of the nodes that we add
to the graph for each move but we also need to include the original
cache tensors as well.

Fixes #9904

d3e9ca3e

20 Mar, 2025 6 commits

model: Pass input tensor instead of raw data to models · 0fbfcf3c

Jesse Gross authored Mar 19, 2025

Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.

0fbfcf3c

input: Rename Options to Batch · 0c220935
Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
0c220935
parser: remove role validation from Modelfile parser (#9874) · ffbfe833
rylativity authored Mar 20, 2025
```
* updates parser/parser.go to allow arbitrary roles in Modelfile MESSAGE blocks
```
ffbfe833
sample: add error handling for empty logits (#9740) · 42a14f7f
Parth Sareen authored Mar 20, 2025

42a14f7f
templates: add autotemplate for gemma3 (#9880) · f8c3dbe5
Patrick Devine authored Mar 20, 2025
```
This change allows the gemma3 template to be autodetected during `ollama
create`.
```
f8c3dbe5
gemma2: Remove second call to Rows · b078dd15
Jesse Gross authored Mar 19, 2025
```
Looks like a merge conflict that broke the model.
```
b078dd15

19 Mar, 2025 2 commits

server/internal/client/ollama: confirm all chunksums were received (#9893) · 2ddacd75

Blake Mizerany authored Mar 19, 2025

If the chunksums response is missing a chunk, the client should fail
the download. This changes the client to check that all bytes are
accounted for in the chunksums response.

It is possible there are overlaps or gaps in the chunksums response and
so the size is not the only thing left to check, but this provides
enough coverage for now. We may want to check that chunks are contiguous
later.

2ddacd75

ml: use input context for extracting outputs (#9875) · da0e3452
Jeffrey Morgan authored Mar 18, 2025

da0e3452

18 Mar, 2025 2 commits

ggml: return error on failure to read tensor data (#9872) · df94175a

Bruce MacDonald authored Mar 18, 2025

When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.

df94175a

convert: return name of unsupported architecture (#9862) · 61a88252
Bruce MacDonald authored Mar 18, 2025
```
When a model's architecture cannot be converted return the name of the unsupported arch in the error message.
```
61a88252

17 Mar, 2025 9 commits

Merge pull request #9824 from ollama/mxyng/sched · 021dcf08
Michael Yang authored Mar 17, 2025
```
conditionally enable parallel pipelines
```
021dcf08

ollamarunner: Check for minBatch of context space when shifting · bf24498b

Jesse Gross authored Mar 17, 2025

Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.

Note that there still some corner cases:
 - A long prompt that exceeds the context window can get truncated
   in the middle of an image. With the current models, this will
   result in the model not recognizing the image at all, which is
   pretty much the expected result with truncation.
 - The context window is set less than the minimum batch size. The
   only solution to this is to refuse to load the model with these
   settings. However, this can never occur with current models and
   default settings.

Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.

bf24498b

runner: remove cache prompt flag from ollama runner (#9826) · 95e271d9

Bruce MacDonald authored Mar 17, 2025

We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.

95e271d9

ml/backend/ggml: allocate memory with malloc when loading model (#9822) · 364629b8
Jeffrey Morgan authored Mar 17, 2025

364629b8
sample: make mutations in transforms explicit (#9743) · 108fe021
Parth Sareen authored Mar 17, 2025
```
* updated minP to use early exit making use of sorted tokens
```
108fe021
conditionally enable parallel pipelines · 4561fff3
Michael Yang authored Mar 14, 2025

4561fff3
Add support for ROCm gfx1151 (#9773) · 50b59620
Daniel Hiltgen authored Mar 17, 2025

50b59620
readme: add screenpipe to community integrations (#9786) · e27e4a3c
Louis Beaumont authored Mar 16, 2025

e27e4a3c
readme: add Ellama to list of community integrations (#9800) · 088514bb
zeo authored Mar 17, 2025

088514bb

15 Mar, 2025 3 commits

fix: correctly save in interactive mode (#9788) · 2c8b4846

Patrick Devine authored Mar 15, 2025

This fixes the case where a FROM line in previous modelfile points to a
file which may/may not be present in a different ollama instance. We
shouldn't be relying on the filename though and instead just check if
the FROM line was instead a valid model name and point to that instead.

2c8b4846

server/internal/client/ollama: set User-Agent for registry client (#9775) · 82946761

Blake Mizerany authored Mar 14, 2025

This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.

Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.

82946761

gemma3 quantization (#9776) · ef378ad6
Patrick Devine authored Mar 14, 2025

ef378ad6

14 Mar, 2025 7 commits

Align versions for local builds (#9635) · 2d2247e5
Daniel Hiltgen authored Mar 14, 2025
```
Darwin was using a different pattern for the version string
than linux or windows.
```
2d2247e5

gemma3: Allow multiple image in a single input · 7bf793a6

Jesse Gross authored Mar 12, 2025

Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.

This can no longer happen:
 - The vision encoder is now on the GPU so both images would be
   processed on the GPU.
 - We require images to be fully contained in a batch and each
   image including its special tokens is over half the batch size.
   As a result, we will never get two images in the same batch.

Fixes #9731

7bf793a6

ollamarunner: Use a separate context per multimodal input · 282bfaaa

Jesse Gross authored Mar 13, 2025

Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.

282bfaaa

ml: Allow models to constrain inputs to a single batch · 9679f401

Jesse Gross authored Mar 12, 2025

Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697

9679f401

llm: remove internal subprocess req and resp types (#9324) · 3892c3a7

Bruce MacDonald authored Mar 14, 2025

This commit refactors the LLM subsystem by removing internal subprocess
request and response types. It consolidates duplicate type definitions
across the codebase, moving them to centralized locations. The change also
standardizes interfaces between components, simplifies the ServerStatusResp
struct, and moves the ParseDurationMs function to a common package. This
cleanup reduces code duplication between different runner implementations
(llamarunner and ollamarunner).

3892c3a7

server/internal/chunks: remove chunks package (#9755) · 4e320b8b
Blake Mizerany authored Mar 14, 2025

4e320b8b

server/internal/client: use chunksums for concurrent blob verification (#9746) · eb2b22b0

Blake Mizerany authored Mar 13, 2025

Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.

The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.

This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.

The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.

eb2b22b0

13 Mar, 2025 1 commit
- Merge pull request #9703 from ollama/mxyng/gemma3-memory · 4ea4d2b1
  Michael Yang authored Mar 13, 2025
```
count gemma3 vision tensors
```
  4ea4d2b1