Commits · a1cda80bcb0b47d493be9dc061a2dfa8a0ddd61c · OpenDAS / ollama

10 Mar, 2025 1 commit

model: Update encoder cache to use multimodal input processing handler · a1cda80b

Jesse Gross authored Mar 08, 2025

The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.

a1cda80b

09 Mar, 2025 1 commit

ollamarunner: Don't panic for unimplemented features at runtime. · 4614fafa

Jesse Gross authored Mar 08, 2025

It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.

4614fafa

08 Mar, 2025 5 commits
- ml: Add support for quantized KV cache · 4100ed7b
  Jesse Gross authored Feb 21, 2025
```
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
```
  4100ed7b
- kvcache: Set context for shift offsets · f52b2615
  Jesse Gross authored Mar 07, 2025
  
  f52b2615
- ggml-backend: Ensure allocation meet backend requirements · 25f9b152
  Jesse Gross authored Mar 07, 2025
```
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
```
  25f9b152
- kvcache: Support non-causal attention · 6da8b6a8
  Jesse Gross authored Mar 07, 2025
```
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
```
  6da8b6a8
- ollamarunner: Quiet debug logging and panic on unimplemented features · 0daaaef8
  Jesse Gross authored Mar 07, 2025
```
Debug logging of every token has previously caused test timeouts
on slower machines.
```
  0daaaef8
07 Mar, 2025 18 commits

additional review comments · 98272fbd
Jesse Gross authored Mar 07, 2025

98272fbd
ml/backend/ggml: use backend buffer type · b27e8f3f
Michael Yang authored Mar 05, 2025
```
this ensures the tensor is created on the right buffer type for backends
such as cpu
```
b27e8f3f
comments · 45df786f
Michael Yang authored Mar 04, 2025

45df786f
ml/backend/ggml: clean up · daaf42e4
Michael Yang authored Feb 28, 2025

daaf42e4
ml/backend/ggml: offload vision to cpu · 2dc60d46
Michael Yang authored Feb 27, 2025
```
temporary until tensor loading can accurately account for vision models
```
2dc60d46
ml/backend/ggml: handle tensor split · b5312f30
Michael Yang authored Feb 26, 2025

b5312f30
ml/backend/ggml: handle user specified cpu offloading · 26c2e0bd
Michael Yang authored Feb 26, 2025

26c2e0bd
ml/backend/ggml: set cpu n_threads · bf920883
Michael Yang authored Feb 26, 2025

bf920883
kvcache: update tests · 58b9ec1f
Michael Yang authored Feb 26, 2025

58b9ec1f

ml/backend/ggml: create tensor on specific backend · 7bae7fa5

Michael Yang authored Feb 25, 2025

some tensors should be created on specific backends to reduce number of
copies and improve performance

7bae7fa5

kvcache: create cache ctx per layer · 764e199d

Michael Yang authored Feb 25, 2025

each cache layer creates and maintains its own context instead of using
a large context for all layers

764e199d

model: load non-repeated tensors into multiple backends · bfce55db

Michael Yang authored Feb 24, 2025

some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends

bfce55db

ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d

Michael Yang authored Feb 19, 2025

use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor

bab6f34d

sample: improve ollama engine sampler performance (#9374) · 0682dae0

Parth Sareen authored Mar 07, 2025

This change bring in various interface cleanups along with greatly improving the performance of the sampler.

Tested with llama3.2 on local machine.
Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
Without topK performance is ~ 110 tokens/s

0682dae0

readme: add QwQ to the supported models list (#9565) · 1f6986e9
Breaker authored Mar 08, 2025

1f6986e9
llama: fix kv loading on snowflake-arctic-embed models (#9536) · 4289c743
Jeffrey Morgan authored Mar 07, 2025

4289c743

ollamarunner: Improve multimodal input handling · a7e63b82

Jesse Gross authored Mar 05, 2025

Various vision models have different requirements for how they
receive their inputs. For example:
 - Mllama wants images together with text and the image embeddings
   don't themselves have positions or get stored in the main KV cache
 - Llava-style models feed in embeddings similar to tokens and
   images correspond to a varying number of tokens in the cache.

In addition, the strategy for providing inputs must support batching
and multiple sequences, which are managed by the runner. At the same
time, we want to keep data handling fully in the model so that new
architectures are not bottlenecked by runner code which does not
understand their particular requirements.

This provides a method for models to edit the input stream so that
it meets their needs while still being in a format that the runner
understands. This allows the runner to avoid special processing
for different models.

In addition, this fixes a regression where non-vision models may
try to incorrectly interpret images.

a7e63b82

model: Don't unconditionally add special tokens · b70fc4d5

Jesse Gross authored Mar 05, 2025

We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.

b70fc4d5

05 Mar, 2025 2 commits

server/internal/registry: take over pulls from server package (#9485) · e2252d0f

Blake Mizerany authored Mar 05, 2025

This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.

The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".

Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.

This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.

e2252d0f

Win: doc new rocm zip file (#9367) · cae5d4d4

Daniel Hiltgen authored Mar 05, 2025

To stay under the 2G github artifact limit, we're splitting ROCm
out like we do on linux.

cae5d4d4

04 Mar, 2025 6 commits

ml/backend/ggml: consolidate system info logging · 05a01fde

Michael Yang authored Feb 28, 2025

- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name

05a01fde

docs: add granite-3.2 to the readme · 8fe6f69f
aritra saha authored Mar 05, 2025

8fe6f69f

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c

server/internal/registry: reintroduce pruning on model deletion (#9489) · 7a01ad76

Blake Mizerany authored Mar 03, 2025

This reintroduces aggressive pruning on model deletion as a temporary
measure until a more controlled garbage collection (GC) mechanism is
implemented.

Issues with the current approach:

1. Users may accidentally delete a model (`ollama rm llama3.3` instead
   of `ollama rm llama3.2`), requiring a full re-download unless another
   model references the same blobs.

2. Users may assume a deleted model is still referenced elsewhere, but
   due to prior updates or deletions, the references no longer exist,
   leading to unnecessary re-downloads.

Soon, we should implement a structured GC mechanism to retain
unreferenced blobs for a configurable period before removal, which will
run on "ollama rm" and other commands we deem appropriate.

Users that want to immediately remove unreferenced blobs can use a new
prune command that will allow them to specify the age and class of blobs
to remove.

Example usage:

    # Run basic blob GC
    $ ollama prune

    # Remove unreferenced blobs older than 7 days
    $ ollama prune --age 7d

    # Remove all blobs, referenced or not, older than 7 days (and their manifests?)
    $ ollama prune --age 7d --all

    # Remove all unreferenced blobs immediately
    $ ollama prune --age 0 --all

    # Remove all blobs
    $ ollama prune --age 0 --all

This should provide a safer and more predictable cleanup process.

7a01ad76

server/.../backoff,syncs: don't break builds without synctest (#9484) · 55ab9f37

Blake Mizerany authored Mar 03, 2025

Previously, developers without the synctest experiment enabled would see
build failures when running tests in some server/internal/internal
packages using the synctest package. This change makes the transition to
use of the package less painful but guards the use of the synctest
package with build tags.

synctest is enabled in CI. If a new change will break a synctest
package, it will break in CI, even if it does not break locally.

The developer docs have been updated to help with any confusion about
why package tests pass locally but fail in CI.

55ab9f37

docs: add Ollama Android Chat community integration · fefbf8f7
KindBrave authored Mar 04, 2025

fefbf8f7

03 Mar, 2025 7 commits
- docker: use go version from go.mod · b428ddd7
  Michael Yang authored Mar 03, 2025
  
  b428ddd7
- fix: own lib/ollama directory · ba7d3124
  Michael Yang authored Mar 03, 2025
```
expand backend loading error handling to catch more problems and log
them instead of panicing
```
  ba7d3124
- cmd: add default err return for stop (#9458) · d25efe39
  CYJiang authored Mar 04, 2025
  
  d25efe39
- docs: don't use self-closing tag for anchor element (#9456) · 36dfb906
  Mark authored Mar 03, 2025
  
  36dfb906
- docs: update phi3-mini to phi4-mini (#9424) · a6f0f908
  aritra saha authored Mar 04, 2025
```
* Update README.md

removed phi 3 mini and added phi4-mini

* Update README.md

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
```
  a6f0f908
- docs: add reins to community integrations (#9411) · 3b1ddb2b
  İbrahim Çetin authored Mar 03, 2025
  
  3b1ddb2b
- build: install binutils alongside gcc in Dockerfile (#9475) · 1579c4f0
  Jeffrey Morgan authored Mar 03, 2025
  
  1579c4f0