Commits · 0fbfcf3c9c7bfdbf4616238595eafd7eca2a916c · OpenDAS / ollama

20 Mar, 2025 3 commits

model: Pass input tensor instead of raw data to models · 0fbfcf3c

Jesse Gross authored Mar 19, 2025

Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.

0fbfcf3c

input: Rename Options to Batch · 0c220935
Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
0c220935
gemma2: Remove second call to Rows · b078dd15
Jesse Gross authored Mar 19, 2025
```
Looks like a merge conflict that broke the model.
```
b078dd15

19 Mar, 2025 1 commit
- ml: use input context for extracting outputs (#9875) · da0e3452
  Jeffrey Morgan authored Mar 18, 2025
  
  da0e3452
14 Mar, 2025 2 commits

ollamarunner: Use a separate context per multimodal input · 282bfaaa

Jesse Gross authored Mar 13, 2025

Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.

282bfaaa

ml: Allow models to constrain inputs to a single batch · 9679f401

Jesse Gross authored Mar 12, 2025

Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697

9679f401

13 Mar, 2025 2 commits
- Update model/model.go · 3e102b7d
  Michael Yang authored Mar 13, 2025
```
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
```
  3e102b7d
- fix: error if image requested without vision model · 5e2e0b46
  Michael Yang authored Mar 13, 2025
  
  5e2e0b46
12 Mar, 2025 1 commit

models/gemma3: remove final logit softcap (#9692) · a70820da

Bruce MacDonald authored Mar 12, 2025

Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.

a70820da

11 Mar, 2025 23 commits
- all: address linter errors · 83f0ec82
  jmorganca authored Mar 11, 2025
  
  83f0ec82
- model: add more spm tokenizer tests · fb4664fc
  jmorganca authored Mar 11, 2025
  
  fb4664fc
- model: validate left and right pairs before merging them · 20e35938
  jmorganca authored Mar 11, 2025
  
  20e35938
- use 2d pooling · 63a39406
  Michael Yang authored Mar 11, 2025
  
  63a39406
- add trailing \n\n after <end_of_image> to match reference implementation · 11bfa627
  jmorganca authored Mar 11, 2025
  
  11bfa627
- reduce kernel size, add TODO for loading from config · f63e62e5
  jmorganca authored Mar 11, 2025
  
  f63e62e5
- Revert "Allow models to force a new batch" · 65b0f329
  jmorganca authored Mar 11, 2025
```
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
```
  65b0f329
- Allow models to force a new batch · 06007c0a
  Jesse Gross authored Mar 10, 2025
```
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
```
  06007c0a
- Disable causal attention based on batch index · a8e83a76
  Jesse Gross authored Mar 10, 2025
```
Currently we are using positions, which are relative to a
sequence and may not be unique.
```
  a8e83a76
- Fix follow up images and images split across batches · 2c40c4d3
  Jesse Gross authored Mar 09, 2025
  
  2c40c4d3
- use non-causal mask only for image positions · e9527893
  Michael Yang authored Mar 10, 2025
  
  e9527893
- use non-causal mask for inputs with images · 9d2a20a7
  Michael Yang authored Mar 10, 2025
  
  9d2a20a7
- compat with upstream gguf · 6b32a2d5
  Michael Yang authored Mar 10, 2025
  
  6b32a2d5
- fix vision encoder · f8889128
  Michael Yang authored Mar 09, 2025
  
  f8889128
- fix configs · 9b54267e
  Patrick Devine authored Mar 08, 2025
  
  9b54267e
- update model · 46bb0169
  Michael Yang authored Mar 08, 2025
  
  46bb0169
- use fast attention · 8934324b
  Michael Yang authored Mar 07, 2025
  
  8934324b
- Fix tests and drift from main · 0e886595
  Jesse Gross authored Mar 07, 2025
  
  0e886595
- fix conversion · c62861f4
  Patrick Devine authored Mar 07, 2025
  
  c62861f4
- set non-causal attention · 0df18004
  Michael Yang authored Mar 07, 2025
  
  0df18004
- fix drift from main · 4346c240
  Jesse Gross authored Mar 07, 2025
  
  4346c240
- add gemma vision encoder · 4b037a97
  Michael Yang authored Mar 06, 2025
  
  4b037a97
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
10 Mar, 2025 1 commit

model: Update encoder cache to use multimodal input processing handler · a1cda80b

Jesse Gross authored Mar 08, 2025

The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.

a1cda80b

08 Mar, 2025 1 commit
- ollamarunner: Quiet debug logging and panic on unimplemented features · 0daaaef8
  Jesse Gross authored Mar 07, 2025
```
Debug logging of every token has previously caused test timeouts
on slower machines.
```
  0daaaef8
07 Mar, 2025 5 commits

additional review comments · 98272fbd
Jesse Gross authored Mar 07, 2025

98272fbd

ml/backend/ggml: create tensor on specific backend · 7bae7fa5

Michael Yang authored Feb 25, 2025

some tensors should be created on specific backends to reduce number of
copies and improve performance

7bae7fa5

ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d

Michael Yang authored Feb 19, 2025

use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor

bab6f34d

ollamarunner: Improve multimodal input handling · a7e63b82

Jesse Gross authored Mar 05, 2025

Various vision models have different requirements for how they
receive their inputs. For example:
 - Mllama wants images together with text and the image embeddings
   don't themselves have positions or get stored in the main KV cache
 - Llava-style models feed in embeddings similar to tokens and
   images correspond to a varying number of tokens in the cache.

In addition, the strategy for providing inputs must support batching
and multiple sequences, which are managed by the runner. At the same
time, we want to keep data handling fully in the model so that new
architectures are not bottlenecked by runner code which does not
understand their particular requirements.

This provides a method for models to edit the input stream so that
it meets their needs while still being in a format that the runner
understands. This allows the runner to avoid special processing
for different models.

In addition, this fixes a regression where non-vision models may
try to incorrectly interpret images.

a7e63b82

model: Don't unconditionally add special tokens · b70fc4d5

Jesse Gross authored Mar 05, 2025

We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.

b70fc4d5

04 Mar, 2025 1 commit

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c