Commits · b452da5700fbf1477e1bb0e1c783c1cf54240691 · OpenDAS / ollama

02 Apr, 2025 1 commit
- [update] common.cuh && quantize.cu · b452da57
  xuxzh1 authored Apr 02, 2025
  
  b452da57
01 Apr, 2025 1 commit
- [Adaption] ollama0.6.3 for DCU · ac4166cb
  xuxz authored Apr 01, 2025
  
  ac4166cb
26 Mar, 2025 5 commits

docs: add molbal/orca-cli to community integrations (#9909) · e5d84fb9
molbal authored Mar 26, 2025

e5d84fb9
docs: add ollamb to community projects · dd66712e
Hengky Steen authored Mar 27, 2025

dd66712e

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

llm: Fix debug logging for memory estimates · f4f0992b
Jesse Gross authored Mar 25, 2025

f4f0992b

kvcache: Sliding window cache only needs a single batch total · 1feff619

Jesse Gross authored Mar 24, 2025

When computing the size of the cache for sliding window attention,
we don't need to multiple the batch size by the number of parallel
sequences - the batch size is constant.

This also simplifies the check for whether to allocate the cache
size based on capacity or window size as the batch size is already
incorporated into the capacity when handled by the runner.

1feff619

25 Mar, 2025 1 commit
- docs: add flags to example linux log output command (#9852) · 5e0b904e
  copeland3300 authored Mar 25, 2025
  
  5e0b904e
24 Mar, 2025 1 commit
- readme: add ollama-d library (#9907) · 131f0355
  Matheus C. França authored Mar 24, 2025
  
  131f0355
21 Mar, 2025 12 commits

server/internal/client/ollama: fix file descriptor management in Pull (#9931) · ce929984

Blake Mizerany authored Mar 21, 2025

Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.

Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.

Also rename 'total' to 'received' for clarity.

ce929984

Merge pull request #9897 from ollama/mxyng/chunk-load · 4b34930a
Michael Yang authored Mar 21, 2025
```
ml/backend/ggml: load tensors in 128KiB chunks
```
4b34930a
ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
Michael Yang authored Mar 19, 2025

74bd0965
benchmark: performance of running ollama server (#8643) · fb6252d7
Bruce MacDonald authored Mar 21, 2025

fb6252d7
server/internal/client/ollama: persist through chunk download errors (#9923) · c794fef2
Blake Mizerany authored Mar 21, 2025

c794fef2
Revert "parser: remove role validation from Modelfile parser" (#9917) · 00ebda8c
Parth Sareen authored Mar 21, 2025
```
This reverts commit ffbfe833.
```
00ebda8c
docs: update final response for /api/chat stream (#9919) · d14ce75b
Parth Sareen authored Mar 21, 2025

d14ce75b

kvcache: Optimize sliding window attention · 2d6eac90

Jesse Gross authored Mar 18, 2025

Currently sliding window attention allocates and uses the full
context size and just masks out any tokens that are outside of the
window. However, we really only need (roughly) the sliding window
size.

At large context sizes this improves two things:
 - Memory allocated - since the fully context size is allocated up front,
   memory requirements drop substantially. On Gemma3:4b with a 32k
   context window, total memory usage (including weights and non-sliding
   layers) drops from ~20GB to ~8GB.
 - Computation - ranges that are completely outside of the sliding
   window are now removed from the tensors that are returned from the
   cache rather than simply being masked out. This results in more
   efficient processing, scaling with the size of the context that
   has actually been used.

Notable, this does not update the scheduler for any model to be aware of
the smaller memory requirements. This is difficult for Gemma3 because
the layers are heterogeneous between sliding and non-sliding attention.
As a result, while actual memory consumption will be reduced, the
scheduler will over-estimate the requirements of the model. This means
that splitting between GPUs or GPUs and CPUs will still be suboptimal.

Bug #9730

2d6eac90

kvcache: Pass granular cache size into implementations · 3ed7ad3a

Jesse Gross authored Mar 18, 2025

Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.

3ed7ad3a

fix: show correct bool value for kv in verbose show information (#9928) · 6d110304
Patrick Devine authored Mar 21, 2025

6d110304

ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758

Jesse Gross authored Mar 20, 2025

This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.

0ff28758

kvcache: Account for source tensors in defrag operation count · d3e9ca3e

Jesse Gross authored Mar 20, 2025

Defragging the KV cache can generate a lot of operations, so we
need to be careful that we don't overflow the number that the graph
can support. We currently account for all of the nodes that we add
to the graph for each move but we also need to include the original
cache tensors as well.

Fixes #9904

d3e9ca3e

20 Mar, 2025 6 commits

model: Pass input tensor instead of raw data to models · 0fbfcf3c

Jesse Gross authored Mar 19, 2025

Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.

0fbfcf3c

input: Rename Options to Batch · 0c220935
Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
0c220935
parser: remove role validation from Modelfile parser (#9874) · ffbfe833
rylativity authored Mar 20, 2025
```
* updates parser/parser.go to allow arbitrary roles in Modelfile MESSAGE blocks
```
ffbfe833
sample: add error handling for empty logits (#9740) · 42a14f7f
Parth Sareen authored Mar 20, 2025

42a14f7f
templates: add autotemplate for gemma3 (#9880) · f8c3dbe5
Patrick Devine authored Mar 20, 2025
```
This change allows the gemma3 template to be autodetected during `ollama
create`.
```
f8c3dbe5
gemma2: Remove second call to Rows · b078dd15
Jesse Gross authored Mar 19, 2025
```
Looks like a merge conflict that broke the model.
```
b078dd15

19 Mar, 2025 2 commits

server/internal/client/ollama: confirm all chunksums were received (#9893) · 2ddacd75

Blake Mizerany authored Mar 19, 2025

If the chunksums response is missing a chunk, the client should fail
the download. This changes the client to check that all bytes are
accounted for in the chunksums response.

It is possible there are overlaps or gaps in the chunksums response and
so the size is not the only thing left to check, but this provides
enough coverage for now. We may want to check that chunks are contiguous
later.

2ddacd75

ml: use input context for extracting outputs (#9875) · da0e3452
Jeffrey Morgan authored Mar 18, 2025

da0e3452

18 Mar, 2025 2 commits

ggml: return error on failure to read tensor data (#9872) · df94175a

Bruce MacDonald authored Mar 18, 2025

When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.

df94175a

convert: return name of unsupported architecture (#9862) · 61a88252
Bruce MacDonald authored Mar 18, 2025
```
When a model's architecture cannot be converted return the name of the unsupported arch in the error message.
```
61a88252

17 Mar, 2025 9 commits

Merge pull request #9824 from ollama/mxyng/sched · 021dcf08
Michael Yang authored Mar 17, 2025
```
conditionally enable parallel pipelines
```
021dcf08

ollamarunner: Check for minBatch of context space when shifting · bf24498b

Jesse Gross authored Mar 17, 2025

Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.

Note that there still some corner cases:
 - A long prompt that exceeds the context window can get truncated
   in the middle of an image. With the current models, this will
   result in the model not recognizing the image at all, which is
   pretty much the expected result with truncation.
 - The context window is set less than the minimum batch size. The
   only solution to this is to refuse to load the model with these
   settings. However, this can never occur with current models and
   default settings.

Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.

bf24498b

runner: remove cache prompt flag from ollama runner (#9826) · 95e271d9

Bruce MacDonald authored Mar 17, 2025

We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.

95e271d9

ml/backend/ggml: allocate memory with malloc when loading model (#9822) · 364629b8
Jeffrey Morgan authored Mar 17, 2025

364629b8
sample: make mutations in transforms explicit (#9743) · 108fe021
Parth Sareen authored Mar 17, 2025
```
* updated minP to use early exit making use of sorted tokens
```
108fe021
conditionally enable parallel pipelines · 4561fff3
Michael Yang authored Mar 14, 2025

4561fff3
Add support for ROCm gfx1151 (#9773) · 50b59620
Daniel Hiltgen authored Mar 17, 2025

50b59620
readme: add screenpipe to community integrations (#9786) · e27e4a3c
Louis Beaumont authored Mar 16, 2025

e27e4a3c
readme: add Ellama to list of community integrations (#9800) · 088514bb
zeo authored Mar 17, 2025

088514bb