Commits · 603ceefaa67feee627e01cae1df1e0642e1c868f · OpenDAS / ollama

08 Dec, 2025 1 commit

Michael Yang authored Nov 18, 2025

change to a flatter directory structure and group the options with the
function

update models to call rope in one place

603ceefa

02 Dec, 2025 1 commit

model: ministral w/ llama4 scaling (#13292) · d3e0a0de

Patrick Devine authored Dec 01, 2025



This change:

* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>

d3e0a0de

13 Nov, 2025 1 commit

chore: update models to use slice/chunk/chunksections (#12934) · 333203d8

Michael Yang authored Nov 13, 2025

* use slice/chunks

* bert

* llama4

* gemma3n

* gptoss

* mistral3

* qwen3vl

* qwen25vl

* deepseek2

* remove unused ops

333203d8

28 Oct, 2025 1 commit
- s/From*Slice/From*s/ (#12255) · 1188f408
  Michael Yang authored Oct 28, 2025
  
  1188f408
23 Sep, 2025 1 commit
- multi-regexp pretokenizer (#12325) · a40d427b
  Michael Yang authored Sep 23, 2025
  
  a40d427b
17 Sep, 2025 1 commit
- fix(llama): other llama flavours (#12308) · 564b558c
  Michael Yang authored Sep 17, 2025
```
* fix(llama): rope scale

* spm llama

* skip moe models

* cleanup
```
  564b558c
16 Sep, 2025 1 commit
- use split activations when possible (#12293) · ad95d5b3
  Michael Yang authored Sep 16, 2025
```
* use ggml_*_split activations when possible

* forward qkv
```
  ad95d5b3
15 Sep, 2025 1 commit
- batch: use tensors for outputs (#12185) · 6f711714
  Michael Yang authored Sep 15, 2025
```
this cleans up the model interface slightly without too much impact in
other areas
```
  6f711714
29 Aug, 2025 1 commit

perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd

Daniel Hiltgen authored Aug 29, 2025

* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.

517807cd

25 Aug, 2025 1 commit
- remove extra field attr (#11205) · 30fb7e19
  Michael Yang authored Aug 25, 2025
  
  30fb7e19
22 May, 2025 1 commit

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

20 May, 2025 1 commit
- ml: add more rope options (#10775) · 9ed8bf14
  Michael Yang authored May 20, 2025
  
  9ed8bf14
19 May, 2025 1 commit
- fix llama and mistral3 models (#10774) · ff180c34
  Michael Yang authored May 19, 2025
```
* fix llama model

* fix mistral3.1 model

do not set default vision layers
```
  ff180c34
16 May, 2025 1 commit

model: handle multiple eos tokens (#10577) · 333e3604

Michael Yang authored May 16, 2025

* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)

333e3604

15 May, 2025 1 commit

ollamarunner: Separate text and multimodal graphs · 3c14461d

Jesse Gross authored May 05, 2025

For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.

3c14461d

13 May, 2025 1 commit
- fix vocabulary (#10679) · 526b2ed1
  Michael Yang authored May 12, 2025
  
  526b2ed1
25 Apr, 2025 1 commit
- fix token type · d26c18e2
  Michael Yang authored Apr 23, 2025
  
  d26c18e2
24 Apr, 2025 1 commit
- llama: remove model loading for grammar (#10096) · a53d744b
  Parth Sareen authored Apr 24, 2025
  
  a53d744b
03 Apr, 2025 1 commit

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983