Commits · 26a26998fb24f1aaa1f0a95980050086d6cf64f0 · OpenDAS / ollama

10 Mar, 2025 10 commits
- Merge pull request #9590 from ollama/mxyng/dump-pad · 26a26998
  Michael Yang authored Mar 10, 2025
```
fix: pad tensor item if ge zero
```
  26a26998
- fix: pad tensor item if ge zero · 9926eae0
  Michael Yang authored Mar 07, 2025
```
this produces a nicer output since both positive and negative values
produces the same width
```
  9926eae0
- docs: add opik to observability integrations (#9626) · 8585b7b1
  Vincent Koc authored Mar 11, 2025
  
  8585b7b1
- sample: add numerical stability to temperature/softmax transform (#9631) · 7e34f4fb
  Parth Sareen authored Mar 10, 2025
  
  7e34f4fb
- Merge pull request #9569 from dwt/patch-1 · fe776293
  Michael Yang authored Mar 10, 2025
```
Better WantedBy declaration
```
  fe776293
- docs: Add OLLAMA_CONTEXT_LENGTH to FAQ. (#9545) · d8a5d96b
  frob authored Mar 10, 2025
  
  d8a5d96b
- docs: add SwiftChat (#9540) · 757668c4
  Xiaowei Zhu authored Mar 10, 2025
  
  757668c4
- docs(tool): add mcp-llm (#9537) · 96ec8afd
  Sam authored Mar 11, 2025
  
  96ec8afd
- sample: temporarily use grammars for constrained generation in new engine (#9586) · e093db92
  Jeffrey Morgan authored Mar 10, 2025
  
  e093db92
- model: Update encoder cache to use multimodal input processing handler · a1cda80b
  Jesse Gross authored Mar 08, 2025
```
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.
```
  a1cda80b
09 Mar, 2025 1 commit

ollamarunner: Don't panic for unimplemented features at runtime. · 4614fafa

Jesse Gross authored Mar 08, 2025

It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.

4614fafa

08 Mar, 2025 5 commits
- ml: Add support for quantized KV cache · 4100ed7b
  Jesse Gross authored Feb 21, 2025
```
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
```
  4100ed7b
- kvcache: Set context for shift offsets · f52b2615
  Jesse Gross authored Mar 07, 2025
  
  f52b2615
- ggml-backend: Ensure allocation meet backend requirements · 25f9b152
  Jesse Gross authored Mar 07, 2025
```
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
```
  25f9b152
- kvcache: Support non-causal attention · 6da8b6a8
  Jesse Gross authored Mar 07, 2025
```
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
```
  6da8b6a8
- ollamarunner: Quiet debug logging and panic on unimplemented features · 0daaaef8
  Jesse Gross authored Mar 07, 2025
```
Debug logging of every token has previously caused test timeouts
on slower machines.
```
  0daaaef8
07 Mar, 2025 19 commits

additional review comments · 98272fbd
Jesse Gross authored Mar 07, 2025

98272fbd
ml/backend/ggml: use backend buffer type · b27e8f3f
Michael Yang authored Mar 05, 2025
```
this ensures the tensor is created on the right buffer type for backends
such as cpu
```
b27e8f3f
comments · 45df786f
Michael Yang authored Mar 04, 2025

45df786f
ml/backend/ggml: clean up · daaf42e4
Michael Yang authored Feb 28, 2025

daaf42e4
ml/backend/ggml: offload vision to cpu · 2dc60d46
Michael Yang authored Feb 27, 2025
```
temporary until tensor loading can accurately account for vision models
```
2dc60d46
ml/backend/ggml: handle tensor split · b5312f30
Michael Yang authored Feb 26, 2025

b5312f30
ml/backend/ggml: handle user specified cpu offloading · 26c2e0bd
Michael Yang authored Feb 26, 2025

26c2e0bd
ml/backend/ggml: set cpu n_threads · bf920883
Michael Yang authored Feb 26, 2025

bf920883
kvcache: update tests · 58b9ec1f
Michael Yang authored Feb 26, 2025

58b9ec1f

ml/backend/ggml: create tensor on specific backend · 7bae7fa5

Michael Yang authored Feb 25, 2025

some tensors should be created on specific backends to reduce number of
copies and improve performance

7bae7fa5

kvcache: create cache ctx per layer · 764e199d

Michael Yang authored Feb 25, 2025

each cache layer creates and maintains its own context instead of using
a large context for all layers

764e199d

model: load non-repeated tensors into multiple backends · bfce55db

Michael Yang authored Feb 24, 2025

some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends

bfce55db

ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d

Michael Yang authored Feb 19, 2025

use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor

bab6f34d

sample: improve ollama engine sampler performance (#9374) · 0682dae0

Parth Sareen authored Mar 07, 2025

This change bring in various interface cleanups along with greatly improving the performance of the sampler.

Tested with llama3.2 on local machine.
Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
Without topK performance is ~ 110 tokens/s

0682dae0

readme: add QwQ to the supported models list (#9565) · 1f6986e9
Breaker authored Mar 08, 2025

1f6986e9
llama: fix kv loading on snowflake-arctic-embed models (#9536) · 4289c743
Jeffrey Morgan authored Mar 07, 2025

4289c743

Better WantedBy declaration · 25248f4b

‮rekcäH nitraM‮ authored Mar 07, 2025

The problem with default.target is that it always points to the target that is currently started. So if you boot into single user mode or the rescue mode still Ollama tries to start.

I noticed this because either tried (and failed) to start all the time during a system update, where Ollama definitely is not wanted.

25248f4b

ollamarunner: Improve multimodal input handling · a7e63b82

Jesse Gross authored Mar 05, 2025

Various vision models have different requirements for how they
receive their inputs. For example:
 - Mllama wants images together with text and the image embeddings
   don't themselves have positions or get stored in the main KV cache
 - Llava-style models feed in embeddings similar to tokens and
   images correspond to a varying number of tokens in the cache.

In addition, the strategy for providing inputs must support batching
and multiple sequences, which are managed by the runner. At the same
time, we want to keep data handling fully in the model so that new
architectures are not bottlenecked by runner code which does not
understand their particular requirements.

This provides a method for models to edit the input stream so that
it meets their needs while still being in a format that the runner
understands. This allows the runner to avoid special processing
for different models.

In addition, this fixes a regression where non-vision models may
try to incorrectly interpret images.

a7e63b82

model: Don't unconditionally add special tokens · b70fc4d5

Jesse Gross authored Mar 05, 2025

We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.

b70fc4d5

05 Mar, 2025 2 commits

server/internal/registry: take over pulls from server package (#9485) · e2252d0f

Blake Mizerany authored Mar 05, 2025

This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.

The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".

Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.

This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.

e2252d0f

Win: doc new rocm zip file (#9367) · cae5d4d4

Daniel Hiltgen authored Mar 05, 2025

To stay under the 2G github artifact limit, we're splitting ROCm
out like we do on linux.

cae5d4d4

04 Mar, 2025 3 commits

ml/backend/ggml: consolidate system info logging · 05a01fde

Michael Yang authored Feb 28, 2025

- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name

05a01fde

docs: add granite-3.2 to the readme · 8fe6f69f
aritra saha authored Mar 05, 2025

8fe6f69f

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c