Commits · 1f371ea92f7ebe4edd208b6732753473b2c4d0cd · OpenDAS / ollama

22 May, 2025 2 commits

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

fix: mllama quality (#10807) · adff143b

Michael Yang authored May 22, 2025

* fix mllama convert

- transform attn_gate and ffn_gate
- swap attention heads for vision models

* fix mllama

the mlp gate which was applied in the wrong place

adff143b

21 May, 2025 3 commits

feat: port qwen2 model (#10782) · c8900113
Michael Yang authored May 21, 2025

c8900113
feat: qwen3 dense and sparse models (#10708) · e0ed984c
Michael Yang authored May 21, 2025
```
* feat: qwen3 dense
* feat: qwen3moe
* fix llama4 moe
```
e0ed984c

fix: qwen25vl assign samebatch in multimodal input (#10789) · 69b2fe92

Michael Yang authored May 21, 2025

setting samebatch on the vision start token is problematic because it
will be shared with other inputs that also use images. this will cause
the input to be cached and the runner will not see SameBatch. SameBatch
will also be incorrect since it may be for a different image.

assigning samebatch to the input tokens resolves this by ensure it's
assigned correctly to inputs corresponding to the image.

not setting same batch correctly may cause panics during inference since
images are no longer guaranteed to be in the same batch.

69b2fe92

20 May, 2025 1 commit
- ml: add more rope options (#10775) · 9ed8bf14
  Michael Yang authored May 20, 2025
  
  9ed8bf14
19 May, 2025 2 commits

fix llama and mistral3 models (#10774) · ff180c34
Michael Yang authored May 19, 2025
```
* fix llama model

* fix mistral3.1 model

do not set default vision layers
```
ff180c34

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

16 May, 2025 1 commit

model: handle multiple eos tokens (#10577) · 333e3604

Michael Yang authored May 16, 2025

* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)

333e3604

15 May, 2025 2 commits

ollamarunner: Separate text and multimodal graphs · 3c14461d

Jesse Gross authored May 05, 2025

For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.

3c14461d

fix pixel values padding (#10718) · ef202789
Michael Yang authored May 15, 2025
```
* panic if trying to pad 4d

* fix pixel values padding
```
ef202789

14 May, 2025 2 commits
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 1 commit
- fix vocabulary (#10679) · 526b2ed1
  Michael Yang authored May 12, 2025
  
  526b2ed1
12 May, 2025 2 commits
- models: remove unused qwen2vl processing (#10677) · a7240c6d
  Bruce MacDonald authored May 12, 2025
  
  a7240c6d
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
26 Apr, 2025 1 commit
- model: fix build (#10416) · 5cfc1c39
  Michael Yang authored Apr 25, 2025
  
  5cfc1c39
25 Apr, 2025 6 commits
- fixes for maverick · 7ba9fa9c
  Michael Yang authored Apr 21, 2025
  
  7ba9fa9c
- chunked attention · 8bf11b84
  Michael Yang authored Apr 10, 2025
  
  8bf11b84
- connect vision to text · 470af8ab
  Michael Yang authored Apr 17, 2025
  
  470af8ab
- image processing · 178761ae
  Michael Yang authored Apr 16, 2025
```
Co-authored-by: Patrick Devine <patrick@infrahq.com>
```
  178761ae
- llama4 · f0c66e6d
  Michael Yang authored Apr 03, 2025
  
  f0c66e6d
- fix token type · d26c18e2
  Michael Yang authored Apr 23, 2025
  
  d26c18e2
24 Apr, 2025 1 commit
- llama: remove model loading for grammar (#10096) · a53d744b
  Parth Sareen authored Apr 24, 2025
  
  a53d744b
18 Apr, 2025 1 commit
- arange · 40b8fdbd
  Michael Yang authored Apr 03, 2025
  
  40b8fdbd
08 Apr, 2025 1 commit

ollamarunner: Preallocate worst case graph at startup · dbb149e6

Jesse Gross authored Apr 03, 2025

Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.

dbb149e6

03 Apr, 2025 2 commits

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

fs: move ml.Config to fs package · 3b96a936
Michael Yang authored Mar 18, 2025

3b96a936

02 Apr, 2025 1 commit
- model: fix issues with spm tokenizer for Gemma 3 (#10081) · b51e0f39
  Jeffrey Morgan authored Apr 02, 2025
  
  b51e0f39
21 Mar, 2025 1 commit
- ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
  Michael Yang authored Mar 19, 2025
  
  74bd0965
20 Mar, 2025 3 commits

model: Pass input tensor instead of raw data to models · 0fbfcf3c

Jesse Gross authored Mar 19, 2025

Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.

0fbfcf3c

input: Rename Options to Batch · 0c220935
Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
0c220935
gemma2: Remove second call to Rows · b078dd15
Jesse Gross authored Mar 19, 2025
```
Looks like a merge conflict that broke the model.
```
b078dd15

19 Mar, 2025 1 commit
- ml: use input context for extracting outputs (#9875) · da0e3452
  Jeffrey Morgan authored Mar 18, 2025
  
  da0e3452
14 Mar, 2025 2 commits

ollamarunner: Use a separate context per multimodal input · 282bfaaa

Jesse Gross authored Mar 13, 2025

Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.

282bfaaa

ml: Allow models to constrain inputs to a single batch · 9679f401

Jesse Gross authored Mar 12, 2025

Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697

9679f401

13 Mar, 2025 2 commits
- Update model/model.go · 3e102b7d
  Michael Yang authored Mar 13, 2025
```
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
```
  3e102b7d
- fix: error if image requested without vision model · 5e2e0b46
  Michael Yang authored Mar 13, 2025
  
  5e2e0b46
12 Mar, 2025 1 commit

models/gemma3: remove final logit softcap (#9692) · a70820da

Bruce MacDonald authored Mar 12, 2025

Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.

a70820da

11 Mar, 2025 1 commit
- all: address linter errors · 83f0ec82
  jmorganca authored Mar 11, 2025
  
  83f0ec82