Commits · 0aa8b371ddd24a2d0ce859903a9284e9544f5c78 · OpenDAS / ollama

14 May, 2025 2 commits
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 2 commits
- llama: fix defrag patch to defragment when no slots are available (#10695) · f46df4e5
  Jeffrey Morgan authored May 13, 2025
  
  f46df4e5
- llama: fix crash on snowflake embedding model (#10690) · 4b903f08
  Jeffrey Morgan authored May 13, 2025
  
  4b903f08
12 May, 2025 1 commit
- llama: update to commit de4c07f93 (#10655) · 0cefd46f
  Jeffrey Morgan authored May 12, 2025
  
  0cefd46f
06 May, 2025 1 commit

Move quantization to new backend (#10363) · 42481045

Daniel Hiltgen authored May 06, 2025

* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.

42481045

02 May, 2025 2 commits

ollamarunner: Re-enable worst case graph preallocation. · c2f5d666

Jesse Gross authored May 02, 2025

Worst case graph preallocation was disabled by a27462b7
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.

c2f5d666

llama: update to commit e1e8e099 (#10513) · 8dd12c87
Jeffrey Morgan authored May 01, 2025

8dd12c87

25 Apr, 2025 1 commit
- llama: update to commit 2016f07b (#10352) · e9e5f61c
  Jeffrey Morgan authored Apr 25, 2025
  
  e9e5f61c
24 Apr, 2025 1 commit
- llama: remove model loading for grammar (#10096) · a53d744b
  Parth Sareen authored Apr 24, 2025
  
  a53d744b
17 Apr, 2025 1 commit
- ml: add missing cmake property and remove additional CMakeLists.txt (#10310) · dc264be6
  Jeffrey Morgan authored Apr 16, 2025
  
  dc264be6
16 Apr, 2025 1 commit
- llama: update to commit 71e90e88 (#10192) · 943464cc
  Jeffrey Morgan authored Apr 16, 2025
  
  943464cc
15 Apr, 2025 1 commit

ggml: Free ggml_backend_buffer_t when releasing buffer · ccb7eb81

Jesse Gross authored Apr 14, 2025

When ggml_backend_buffer_free() is called, the device memory
is released but not all backends consistently release the actual
ggml_backend_buffer_t in system RAM, causing a memory leak.

Bug #10040

ccb7eb81

03 Apr, 2025 1 commit

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

27 Mar, 2025 1 commit
- Add gfx1200 & gfx1201 support on linux (#9878) · ead27aa9
  saman-amd authored Mar 27, 2025
  
  ead27aa9
15 Mar, 2025 1 commit
- gemma3 quantization (#9776) · ef378ad6
  Patrick Devine authored Mar 14, 2025
  
  ef378ad6
11 Mar, 2025 1 commit
- ollama debug tensor · 9e4642e9
  Michael Yang authored Mar 09, 2025
  
  9e4642e9
07 Mar, 2025 1 commit
- llama: fix kv loading on snowflake-arctic-embed models (#9536) · 4289c743
  Jeffrey Morgan authored Mar 07, 2025
  
  4289c743
03 Mar, 2025 1 commit

fix: own lib/ollama directory · ba7d3124

Michael Yang authored Mar 03, 2025

expand backend loading error handling to catch more problems and log
them instead of panicing

ba7d3124

28 Feb, 2025 1 commit
- llama: add phi4 mini support (#9403) · 98d44fa3
  Jeffrey Morgan authored Feb 27, 2025
  
  98d44fa3
27 Feb, 2025 1 commit
- llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) · d7d7e996
  Jeffrey Morgan authored Feb 26, 2025
  
  d7d7e996
24 Feb, 2025 1 commit
- ml/backend/ggml: fix crash on windows paths with wide characters (#9305) · 8c13cfa4
  Jeffrey Morgan authored Feb 23, 2025
  
  8c13cfa4
20 Feb, 2025 1 commit
- reorder patches · bda4ef6c
  Michael Yang authored Feb 19, 2025
  
  bda4ef6c
19 Feb, 2025 1 commit
- llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) · d2eb226c
  Jeffrey Morgan authored Feb 18, 2025
  
  d2eb226c
18 Feb, 2025 1 commit

build: remove backend build for sapphirerapids · 5f8c0318

Michael Yang authored Feb 18, 2025

sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity

5f8c0318

14 Feb, 2025 1 commit
- ml/backend/ggml: stable sort devices by score (#9081) · 6600bd7d
  Jeffrey Morgan authored Feb 13, 2025
  
  6600bd7d
11 Feb, 2025 1 commit
- fix: harden backend loading (#9024) · 49df03da
  Michael Yang authored Feb 11, 2025
```
* wrap ggml_backend_load_best in try/catch
* ignore non-ollama paths
```
  49df03da
10 Feb, 2025 1 commit
- ml/backend/ggml: fix crash on dlopen for non-AVX systems (#8976) · f4711da7
  Jeffrey Morgan authored Feb 10, 2025
  
  f4711da7
05 Feb, 2025 1 commit
- llama: use dynamic backend loading for mllama and clip (#8835) · cd3fbf1c
  Jeffrey Morgan authored Feb 05, 2025
  
  cd3fbf1c
29 Jan, 2025 1 commit

next build (#8539) · dcfb7a10

Michael Yang authored Jan 29, 2025



* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

dcfb7a10

08 Jan, 2025 1 commit
- llama: update vendored code to commit 46e3556 (#8308) · 1deafd82
  Jeffrey Morgan authored Jan 08, 2025
  
  1deafd82
17 Dec, 2024 1 commit

llama: Ensure KV cache is fully defragmented. · 08a832b4

Jesse Gross authored Dec 12, 2024

Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.

Fixes #7949

08a832b4

14 Dec, 2024 1 commit
- llama: update vendor code to commit ba1cb19c (#8101) · 7a81daf0
  Jeffrey Morgan authored Dec 14, 2024
  
  7a81daf0
12 Dec, 2024 1 commit
- llama: enable JSON schema key ordering for generating grammars (#8055) · 18f6a98b
  Parth Sareen authored Dec 11, 2024
  
  18f6a98b
11 Dec, 2024 1 commit
- llama: update vendored code to commit 40c6d79f (#7875) · 527cc978
  Jeffrey Morgan authored Dec 10, 2024
  
  527cc978
30 Oct, 2024 1 commit

runner.go: Better abstract vision model integration · c826e574

Jesse Gross authored Oct 11, 2024



-Update mllama to take the cross attention state as embeddings in
a batch, more similar to how Llava handles it. This improves
integration with the input cache.
-Pass locations in a prompt for embeddings using tags similar to Llava.
-Abstract interface to vision models so the main runner accesses Clip
and Mllama similarly
Co-authored-by: Michael Yang <mxyng@pm.me>

c826e574

26 Oct, 2024 1 commit
- Fix deepseek deseret regex (#7369) · 099f7077
  Daniel Hiltgen authored Oct 26, 2024
```
On windows compiled with gcc the c++ regex library failed to handle
the characters
```
  099f7077
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

17 Oct, 2024 2 commits

llama: Decouple patching script from submodule (#7139) · bf4018b9

Daniel Hiltgen authored Oct 17, 2024

* Refine llama.cpp vendoring workflow tools

Switch from the sync.sh over to make based tooling

* Run new make sync and patch flow

bf4018b9

IBM granite/granitemoe architecture support (#6760) · f2890a44

Gabe Goodhart authored Oct 17, 2024

* fix(ext_server): Port llama.cpp sampling refactors to ext_server

This was a fairly large changeset. I closely followed the changes here:
https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82



Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Bump llama.cpp to the latest master with `granite` support

This does not yet have granite MoE support, but that can come in a
follow up PR

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update solar patch for llama.cpp bump

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update the solar-pro patch for latest llama.cpp bump

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump to the latest master of llama.cpp

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches for latest bump

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama): Always run sync.sh from the right directory

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Update llama patches

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama)!: Rough sync with llama.cpp submodule

There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Add a patch and update for missing ggml-impl.h include

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Add missing log.cpp

This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Overhaul use of sampling module for llama.cpp changes

The changes here reflect the changes made in the big llama.cpp sampling PR
https://github.com/ggerganov/llama.cpp/pull/9294



The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix the impl of SampleTokenGreedy for new sampling

I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport

* fix(llama): Remove unused SampleTokenGreedy

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(sync): Remove bash-specific change to sync.sh

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* chore(gofumpt): Format on llama.go to pass linting

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Fix missing <thread> include in ext_server

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove TODO about grammar_first

This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Better naming for sampling wrapper and args

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix patch 05 to use new wrapper api and re-sync

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* runner: Flush pending responses before returning

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes #6707

* fix(llama/sampling): Use gpt_sampler with a forward declaration

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove unnecessary patch for gguf impl header

This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Remove use of deprecated --log-disable flag

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

f2890a44