Commits · d75557747357bfb3afd441a0cc207ec944bd3a18 · OpenDAS / ollama

16 May, 2025 1 commit

model: handle multiple eos tokens (#10577) · 333e3604

Michael Yang authored May 16, 2025

* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)

333e3604

15 May, 2025 2 commits

ollamarunner: Separate text and multimodal graphs · 3c14461d

Jesse Gross authored May 05, 2025

For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.

3c14461d

fix pixel values padding (#10718) · ef202789
Michael Yang authored May 15, 2025
```
* panic if trying to pad 4d

* fix pixel values padding
```
ef202789

14 May, 2025 2 commits
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 1 commit
- fix vocabulary (#10679) · 526b2ed1
  Michael Yang authored May 12, 2025
  
  526b2ed1
12 May, 2025 2 commits
- models: remove unused qwen2vl processing (#10677) · a7240c6d
  Bruce MacDonald authored May 12, 2025
  
  a7240c6d
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
26 Apr, 2025 1 commit
- model: fix build (#10416) · 5cfc1c39
  Michael Yang authored Apr 25, 2025
  
  5cfc1c39
25 Apr, 2025 6 commits
- fixes for maverick · 7ba9fa9c
  Michael Yang authored Apr 21, 2025
  
  7ba9fa9c
- chunked attention · 8bf11b84
  Michael Yang authored Apr 10, 2025
  
  8bf11b84
- connect vision to text · 470af8ab
  Michael Yang authored Apr 17, 2025
  
  470af8ab
- image processing · 178761ae
  Michael Yang authored Apr 16, 2025
```
Co-authored-by: Patrick Devine <patrick@infrahq.com>
```
  178761ae
- llama4 · f0c66e6d
  Michael Yang authored Apr 03, 2025
  
  f0c66e6d
- fix token type · d26c18e2
  Michael Yang authored Apr 23, 2025
  
  d26c18e2
24 Apr, 2025 1 commit
- llama: remove model loading for grammar (#10096) · a53d744b
  Parth Sareen authored Apr 24, 2025
  
  a53d744b
18 Apr, 2025 1 commit
- arange · 40b8fdbd
  Michael Yang authored Apr 03, 2025
  
  40b8fdbd
08 Apr, 2025 1 commit

ollamarunner: Preallocate worst case graph at startup · dbb149e6

Jesse Gross authored Apr 03, 2025

Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.

dbb149e6

03 Apr, 2025 2 commits

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

fs: move ml.Config to fs package · 3b96a936
Michael Yang authored Mar 18, 2025

3b96a936

02 Apr, 2025 1 commit
- model: fix issues with spm tokenizer for Gemma 3 (#10081) · b51e0f39
  Jeffrey Morgan authored Apr 02, 2025
  
  b51e0f39
21 Mar, 2025 1 commit
- ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
  Michael Yang authored Mar 19, 2025
  
  74bd0965
20 Mar, 2025 3 commits

model: Pass input tensor instead of raw data to models · 0fbfcf3c

Jesse Gross authored Mar 19, 2025

Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.

0fbfcf3c

input: Rename Options to Batch · 0c220935
Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
0c220935
gemma2: Remove second call to Rows · b078dd15
Jesse Gross authored Mar 19, 2025
```
Looks like a merge conflict that broke the model.
```
b078dd15

19 Mar, 2025 1 commit
- ml: use input context for extracting outputs (#9875) · da0e3452
  Jeffrey Morgan authored Mar 18, 2025
  
  da0e3452
14 Mar, 2025 2 commits

ollamarunner: Use a separate context per multimodal input · 282bfaaa

Jesse Gross authored Mar 13, 2025

Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.

282bfaaa

ml: Allow models to constrain inputs to a single batch · 9679f401

Jesse Gross authored Mar 12, 2025

Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697

9679f401

13 Mar, 2025 2 commits
- Update model/model.go · 3e102b7d
  Michael Yang authored Mar 13, 2025
```
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
```
  3e102b7d
- fix: error if image requested without vision model · 5e2e0b46
  Michael Yang authored Mar 13, 2025
  
  5e2e0b46
12 Mar, 2025 1 commit

models/gemma3: remove final logit softcap (#9692) · a70820da

Bruce MacDonald authored Mar 12, 2025

Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.

a70820da

11 Mar, 2025 9 commits
- all: address linter errors · 83f0ec82
  jmorganca authored Mar 11, 2025
  
  83f0ec82
- model: add more spm tokenizer tests · fb4664fc
  jmorganca authored Mar 11, 2025
  
  fb4664fc
- model: validate left and right pairs before merging them · 20e35938
  jmorganca authored Mar 11, 2025
  
  20e35938
- use 2d pooling · 63a39406
  Michael Yang authored Mar 11, 2025
  
  63a39406
- add trailing \n\n after <end_of_image> to match reference implementation · 11bfa627
  jmorganca authored Mar 11, 2025
  
  11bfa627
- reduce kernel size, add TODO for loading from config · f63e62e5
  jmorganca authored Mar 11, 2025
  
  f63e62e5
- Revert "Allow models to force a new batch" · 65b0f329
  jmorganca authored Mar 11, 2025
```
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
```
  65b0f329
- Allow models to force a new batch · 06007c0a
  Jesse Gross authored Mar 10, 2025
```
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
```
  06007c0a
- Disable causal attention based on batch index · a8e83a76
  Jesse Gross authored Mar 10, 2025
```
Currently we are using positions, which are relative to a
sequence and may not be unique.
```
  a8e83a76