Commits · 603ceefaa67feee627e01cae1df1e0642e1c868f · OpenDAS / ollama

"docs/vscode:/vscode.git/clone" did not exist on "41d007b422b713e38150f87f4adbd6ab91d0ca9d"

08 Dec, 2025 1 commit

Michael Yang authored Nov 18, 2025

change to a flatter directory structure and group the options with the
function

update models to call rope in one place

603ceefa

02 Dec, 2025 1 commit

model: ministral w/ llama4 scaling (#13292) · d3e0a0de

Patrick Devine authored Dec 01, 2025



This change:

* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>

d3e0a0de

20 Nov, 2025 1 commit

deepseek2: upgrade to run v3+ models (#13166) · 5c1063df

Michael Yang authored Nov 19, 2025

the check for mla omits v3 and r1 which should not return unsupported.
instead check the tokenizer for compatibility

5c1063df

19 Nov, 2025 3 commits
- models: enable deepseek2 (deepseek v3.1 w/ MLA) on the new engine (#13151) · 604e43b2
  Patrick Devine authored Nov 18, 2025
  
  604e43b2
- nomic-embed-text model implementation (#13071) · 8de30b56
  nicole pardal authored Nov 18, 2025
  
  8de30b56
- deepseekocr · 92981ae3
  Michael Yang authored Oct 31, 2025
  
  92981ae3
18 Nov, 2025 1 commit
- Add deepseek v3.1 (#13063) · 584e2d64
  Grace authored Nov 17, 2025
```
* Add mla for flash attention
* Revert to using chunks
```
  584e2d64
13 Nov, 2025 1 commit

chore: update models to use slice/chunk/chunksections (#12934) · 333203d8

Michael Yang authored Nov 13, 2025

* use slice/chunks

* bert

* llama4

* gemma3n

* gptoss

* mistral3

* qwen3vl

* qwen25vl

* deepseek2

* remove unused ops

333203d8

06 Nov, 2025 1 commit
- ggml update to b6840 (#12791) · 544b6739
  Daniel Hiltgen authored Nov 06, 2025
  
  544b6739
03 Nov, 2025 1 commit
- chore(gptoss): cleanup dead code (#12932) · ce3eb0a3
  Michael Yang authored Nov 03, 2025
  
  ce3eb0a3
30 Oct, 2025 2 commits
- interleaved mrope (#12807) · f67a6df1
  Michael Yang authored Oct 30, 2025
```
* ml(ggml): mrope
* interleave mrope
```
  f67a6df1
- fix: qwen2.5vl, qwen3vl composite image (#12841) · d432ade7
  Michael Yang authored Oct 30, 2025
```
this change fixes images with an alpha channel by overlaying the image
onto a white background
```
  d432ade7
29 Oct, 2025 1 commit
- feat(model): add qwen3vl (#12665) · 7d25b9e1
  Michael Yang authored Oct 28, 2025
  
  7d25b9e1
28 Oct, 2025 2 commits
- s/From*Slice/From*s/ (#12255) · 1188f408
  Michael Yang authored Oct 28, 2025
  
  1188f408
- gemma3: make embedding non-causal (#12297) · ec9eb28f
  Michael Yang authored Oct 27, 2025
  
  ec9eb28f
18 Oct, 2025 1 commit
- contiguous input per layer (#12686) · bc1a818f
  Daniel Hiltgen authored Oct 17, 2025
```
Co-authored-by: Michael Yang <git@mxy.ng>
```
  bc1a818f
13 Oct, 2025 1 commit
- fix(qwen3): deepseek distill · 6c833d5f
  Michael Yang authored Oct 13, 2025
```
deepseek's qwen3 distill uses a different rope scheme so support both
```
  6c833d5f
09 Oct, 2025 2 commits
- refactor: use builtin max and min · 47298fce
  shengxinjing authored Sep 28, 2025
  
  47298fce
- refactor: use builtin max and min · 4a48937e
  shengxinjing authored Sep 25, 2025
  
  4a48937e
03 Oct, 2025 1 commit
- Fixed Deepseek2 adding nil tensor error · 33801c15
  Grace authored Oct 03, 2025
  
  33801c15
24 Sep, 2025 1 commit

Grace/deepseek v3 migration (#12385) · fbd82ba5

Grace authored Sep 24, 2025



* init deepseek model file

* temp removal of flash attention implementation

* shapes and proper, can make a pass

* query, key, value have good cosine similarity, but the max diff is a bit high

* Attention block is working! ** with eager for now, have not added the mask line

* Attention block is working! ** with eager for now, have not added the mask line

* working MoE at around 0.95 cosine sim

* added cosine similarity function

* Starting end to end structure

* Trying (and failing) to get rope to work, going to test full thing on tater

* running on tater36... just not the right outputs

* we have the right values for rope... but its still not working?

* chnage Extrapolation Factor to 1

* removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer

* Temporary modelfiles for cpu

* change kpass intermediate step to kv, two layer outputs [0,1] look fine

* this calls for 16 chicken nuggets

* whoops

* cleaning up code

* delete stuff we dont need

* getting rid of debug statements for llama cpp

* working with long contexts

* fix long context view error

* reverting some changes I made for files that are not apart of pr

* Added proper tokenizer for deeepseek3

* clean up model and go test

* remove Modelfile

* not passing the tests

* whoops

* how to pass the ci tests

* resolving some of the comments

* rename

* linted and renamed deepseek3 -> deepseek2

* remove name go

* addressed changes - main change was adopting qwen3 naming scheme

* I cannot with linters

* clean up logs

* clean up logs

---------
Co-authored-by: Grace Guo <graceguo@Graces-MBP.localdomain>
Co-authored-by: Grace Guo <graceguo@Graces-MacBook-Pro.local>
Co-authored-by: graceguo <graceguo@tater36.localdomain>

fbd82ba5

23 Sep, 2025 2 commits
- add pre:, suf: to tags (#12274) · bf78ed6e
  Michael Yang authored Sep 23, 2025
  
  bf78ed6e
- multi-regexp pretokenizer (#12325) · a40d427b
  Michael Yang authored Sep 23, 2025
  
  a40d427b
19 Sep, 2025 1 commit
- gemma: fix rope scaling for qat models (#12348) · dba39b2e
  Patrick Devine authored Sep 19, 2025
```
* gemma: fix rope scaling for qat models

* gofumpt yourself
```
  dba39b2e
18 Sep, 2025 1 commit
- feat: qwen3 embed (#12301) · 7460259e
  Michael Yang authored Sep 18, 2025
```
* cleanup

* use pooling.TypeNone

* pooling test

* qwen3 embed
```
  7460259e
17 Sep, 2025 1 commit
- fix(llama): other llama flavours (#12308) · 564b558c
  Michael Yang authored Sep 17, 2025
```
* fix(llama): rope scale

* spm llama

* skip moe models

* cleanup
```
  564b558c
16 Sep, 2025 2 commits
- use split activations when possible (#12293) · ad95d5b3
  Michael Yang authored Sep 16, 2025
```
* use ggml_*_split activations when possible

* forward qkv
```
  ad95d5b3
- embed: cleanup (#12299) · c253433d
  Michael Yang authored Sep 16, 2025
```
* cleanup

* use pooling.TypeNone

* pooling test
```
  c253433d
15 Sep, 2025 2 commits

model: implement bert in ollama engine (#9080) · 3f6642f6

Michael Yang authored Sep 15, 2025

* fix truncate

* s/SentencePieceModel/SentencePiece/

* bert

* wordpiece

* refactor pooling

* more tokenizers

* normalize embeddings

3f6642f6

batch: use tensors for outputs (#12185) · 6f711714
Michael Yang authored Sep 15, 2025
```
this cleans up the model interface slightly without too much impact in
other areas
```
6f711714

04 Sep, 2025 1 commit
- embedding gemma model (#12181) · 5994e8e8
  Michael Yang authored Sep 04, 2025
```
* ollama: add embeddings
```
  5994e8e8
29 Aug, 2025 1 commit

perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd

Daniel Hiltgen authored Aug 29, 2025

* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.

517807cd

25 Aug, 2025 1 commit
- remove extra field attr (#11205) · 30fb7e19
  Michael Yang authored Aug 25, 2025
  
  30fb7e19
14 Aug, 2025 1 commit

update vendored llama.cpp and ggml (#11823) · 1a19df1f

Michael Yang authored Aug 14, 2025

* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch

This will be redone once my branch is merged upstream in llama.cpp

* feat: Update all patches

There are a number that are no longer needed at all:

- 0003-embeddings: Embeddings entirely overhauled on master
- 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
    overhauled on master
- 0019-metal-add-mean-kernel-14267: Merged upstream
- 0020-CUDA-add-mean-operation-14313: Merged upstream

* feat: Sync llama.cpp and ggml

* fix: Update rsync-filter for all moved/new/removed files

* fix: Add files missing from sync

* fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs

* fix: Add ggml files missing from sync

* fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files

* fix: Remove mtmd main cpp files

* fix: Add missing include in sampling_ext.cpp

* fix: Update llama.go to use mtmd instead of clip/llava

* fix: Add patch for mtmd_input_text

* chore: Ignore *.patched in the patch directory

* fix: Fix support for arch-specific ggml-cpu source files with new arrangement

In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
implementations were split out into a nested tree structure under
ggml-cpu/arch. This conflicts with standard CGO layout where all
arch-specific source files are expected to live in the same directory as
the parent go module and use suffixes based on GOOS and GOARCH. As such,
there were really two options for getting this to work:

1. Add a patch on top of the GGML sync to rearrange the files to match the
GO layout convention
2. Use CGO directives to conditionally include the nested source files in
the compilation units

This commit does (2) in order to minimize the set of changes needed on top
of the upstream file layout. To get this to work, there are two key things
needed:

1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
the preprocessor directives
2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
explicitly include the .c|.cpp files for the given architecture from the
nested directory

* fix: Use mtmd_helper to correctly load the bitmap for the image

* fix: Apply patch for mtmd_text_input

* fix: Add missing stb to llama.cpp rsync-filter

* fix: Add sync'ed stb vendored header

* fix: Use c++17 and include vendor for go wrapper modules

* fix: Update patch 0015 for upstream implementation of uuid

* feat: Bump to the latest tip of the branch

* fix: Update patches for bump

* feat: Bump back to the cenral repo and point at the latest master

This includes granite 4 and a number of other model architectures!

* fix: Revert changes to ggml export GPU UUID patch

* fix: Add patch for GGML_VERSION and GGML_COMMIT constants

* feat: Sync all patched code

* build: Include cmake/common.cmake in ggml sync

* build: Add top-level include for GNUINstallDirs in CMakeLists.txt

This is used to populate CMAKE_INSTALL_BINDIR

* fix: Add a patch to avoid power throttling API on non-msvc windows builds

* fix: Sync patch changes for ggml-cpu.c

* feat: Bump llama.cpp to 4a4f42

This picks up support for Kimi K2 and PLaMO-2

* feat: Sync llama.cpp

* fix: Handle multi-chunk image encodings from mtmd

* fix: Re-number patches after merge with `main`

* feat: Bump to 41e78c in the makefile

* fix: Fix Solar and argsort/copy patches after bump

* fix: Remove Gemma3n CUDA Graphs patch

It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

* feat: Sync llama.cpp / ggml after latest bump

* build: Remove unnecessary CFLAGS definitions in cpu.go

* fix: Remove unnecessary additions in the rsync-filter

* fix: Remove unused vendored code for chat template parsing

* Revert "fix: Remove Gemma3n CUDA Graphs patch"

This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea.

* fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes

https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394



* fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n

* unwind mxfp4 patch

Prepare to bump ggml with their impl for mxfp4

* bump

* fix windows build error

* Convert tensors at load time

Repack the mxfp4 tensors as ggmls kernels expect them to be.

* convert mlp bf16 to f32

* buffer the conversion better

* reshape earlier

* openai swiglu

* add ids

* split qkv, gate_up

* fix nested alt tags

* fast attention

* remove debug messages

* fix lint

* remove redundant test

* remap values only if source/target are different

* add back i32->i32 copy

* refactor cpu quants

* clean up vendor

* update patch instructions

* clean up patches

* remove webgpu

* update mem

* also handle gpt-oss

* revert convert changes

---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

1a19df1f

05 Aug, 2025 1 commit

gpt-oss (#11672) · fa7776fd

Michael Yang authored Aug 05, 2025



* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

fa7776fd

29 Jul, 2025 1 commit

Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525) · ea85e27b

Oliver Simons authored Jul 29, 2025

* Enable CUDA Graphs for gemma3n.

Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust

ea85e27b

11 Jul, 2025 1 commit

Only load supported models on new engine (#11362) · f8a6e888

Daniel Hiltgen authored Jul 11, 2025

* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models

f8a6e888

27 Jun, 2025 1 commit
- chore: cleanup comments + unused vars (#11225) · 4129af92
  Michael Yang authored Jun 27, 2025
  
  4129af92
26 Jun, 2025 1 commit

add new gemma model (#11204) · 73b642e6

Michael Yang authored Jun 25, 2025

* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n

73b642e6

11 Jun, 2025 1 commit

use nn.Linear in place of ml.Tensor (#11049) · 2e77aa1a

Michael Yang authored Jun 11, 2025

while nn.Linear.Forward isn't applicable for sparse MLP, it's still
a nice container for the tensors

2e77aa1a