Commits · 20c5fd39c8b275c0c7d7e7be8ce03d48aa32c64e · OpenDAS / ollama

06 May, 2025 1 commit

Move quantization to new backend (#10363) · 42481045

Daniel Hiltgen authored May 06, 2025

* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.

42481045

05 May, 2025 1 commit
- ggml: Reduce log level of "key not found" · 70736007
  Jesse Gross authored May 05, 2025
```
Most of the time this is not an error.
```
  70736007
27 Apr, 2025 1 commit

ggml: fix crash for array head counts · 6ed88985

Devon Rifkin authored Apr 25, 2025

If it's an array, it uses the max value in the array

If array values for head counts becomes more popular, we can consider a
more invasive change like #10225 to calculate more accurate estimates.

Fixes: #9984

6ed88985

25 Apr, 2025 8 commits
- memory · f0ad49ea
  Michael Yang authored Apr 23, 2025
  
  f0ad49ea
- llama4 · f0c66e6d
  Michael Yang authored Apr 03, 2025
  
  f0c66e6d
- fix parameter count · ced7d0e5
  Michael Yang authored Apr 23, 2025
  
  ced7d0e5
- default slice values · a0dba0f8
  Michael Yang authored Apr 23, 2025
  
  a0dba0f8
- update comment · 5e20b170
  Michael Yang authored Apr 23, 2025
  
  5e20b170
- fix token type · d26c18e2
  Michael Yang authored Apr 23, 2025
  
  d26c18e2
- zero means zero · 8d376acc
  Michael Yang authored Apr 23, 2025
```
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
```
  8d376acc
- generic ggml.array · 5d027916
  Michael Yang authored Apr 23, 2025
  
  5d027916
03 Apr, 2025 1 commit

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

26 Mar, 2025 1 commit

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

13 Mar, 2025 6 commits
- count non-repeating vision layers · 8d76fa23
  Michael Yang authored Mar 13, 2025
  
  8d76fa23
- fix divide by zero · 65b88c54
  Michael Yang authored Mar 13, 2025
  
  65b88c54
- roughly count gemma3 graph · a422ba39
  Michael Yang authored Mar 13, 2025
```
the largest operation is by far (q @ k) so just count that for
simplicity
```
  a422ba39
- count all vision tensors · d2ec2237
  Michael Yang authored Mar 12, 2025
  
  d2ec2237
- count gemma3 vision tensors · 033cec23
  Michael Yang authored Mar 12, 2025
  
  033cec23
- add verbose mode to the show command (#9640) · 4bed7392
  Patrick Devine authored Mar 13, 2025
```
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
```
  4bed7392
11 Mar, 2025 2 commits
- llm: auto detect models that require Ollama Engine (#1 ) · ab39e08e
  Daniel Hiltgen authored Mar 11, 2025
  
  ab39e08e
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
04 Mar, 2025 1 commit

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c

27 Feb, 2025 1 commit
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
25 Feb, 2025 1 commit
- fix: add back bf16 support · b16367b4
  Michael Yang authored Feb 25, 2025
```
this was accidentally removed when moving fs/ggml from its previous
location
```
  b16367b4
14 Feb, 2025 1 commit

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

03 Dec, 2024 1 commit
- llm: introduce k/v context quantization (vRAM improvements) (#6279) · 1bdab9fd
  Sam authored Dec 04, 2024
  
  1bdab9fd
01 Nov, 2024 2 commits
- refactor kv estimation · d07cf41a
  Michael Yang authored Oct 31, 2024
  
  d07cf41a
- mllama cross attention · 8c238e70
  Michael Yang authored Oct 31, 2024
  
  8c238e70
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

15 Oct, 2024 1 commit
- Add missing BF16 tensor type. (#7193) · 09035b71
  frob authored Oct 15, 2024
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  09035b71
23 Aug, 2024 1 commit
- convert safetensor adapters into GGUF (#6327) · 0c819e16
  Patrick Devine authored Aug 23, 2024
  
  0c819e16
12 Aug, 2024 1 commit
- add conversion for microsoft phi 3 mini/medium 4k, 128 · 6ffb5cb0
  Michael Yang authored Jun 03, 2024
  
  6ffb5cb0
08 Aug, 2024 1 commit
- llama3.1 memory · 2003d601
  Michael Yang authored Aug 08, 2024
  
  2003d601
31 Jul, 2024 1 commit
- update convert test to check result data · 6b252918
  Michael Yang authored Jun 03, 2024
  
  6b252918
10 Jul, 2024 1 commit
- chatglm graph · 5a739ff4
  Michael Yang authored Jul 10, 2024
  
  5a739ff4
27 Jun, 2024 1 commit
- gemma2 graph · de2163da
  Michael Yang authored Jun 27, 2024
  
  de2163da
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

20 Jun, 2024 1 commit
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
18 Jun, 2024 1 commit
- deepseek v2 graph · e873841c
  Michael Yang authored Jun 18, 2024
  
  e873841c
14 Jun, 2024 1 commit

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9