Commits · f4f0992b6ea5d651eff609461c24ece936bd5708 · OpenDAS / ollama

13 Mar, 2025 6 commits
- count non-repeating vision layers · 8d76fa23
  Michael Yang authored Mar 13, 2025
  
  8d76fa23
- fix divide by zero · 65b88c54
  Michael Yang authored Mar 13, 2025
  
  65b88c54
- roughly count gemma3 graph · a422ba39
  Michael Yang authored Mar 13, 2025
```
the largest operation is by far (q @ k) so just count that for
simplicity
```
  a422ba39
- count all vision tensors · d2ec2237
  Michael Yang authored Mar 12, 2025
  
  d2ec2237
- count gemma3 vision tensors · 033cec23
  Michael Yang authored Mar 12, 2025
  
  033cec23
- add verbose mode to the show command (#9640) · 4bed7392
  Patrick Devine authored Mar 13, 2025
```
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
```
  4bed7392
11 Mar, 2025 2 commits
- llm: auto detect models that require Ollama Engine (#1 ) · ab39e08e
  Daniel Hiltgen authored Mar 11, 2025
  
  ab39e08e
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
04 Mar, 2025 1 commit

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c

27 Feb, 2025 1 commit
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
25 Feb, 2025 1 commit
- fix: add back bf16 support · b16367b4
  Michael Yang authored Feb 25, 2025
```
this was accidentally removed when moving fs/ggml from its previous
location
```
  b16367b4
14 Feb, 2025 1 commit

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

03 Dec, 2024 1 commit
- llm: introduce k/v context quantization (vRAM improvements) (#6279) · 1bdab9fd
  Sam authored Dec 04, 2024
  
  1bdab9fd
01 Nov, 2024 2 commits
- refactor kv estimation · d07cf41a
  Michael Yang authored Oct 31, 2024
  
  d07cf41a
- mllama cross attention · 8c238e70
  Michael Yang authored Oct 31, 2024
  
  8c238e70
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

15 Oct, 2024 1 commit
- Add missing BF16 tensor type. (#7193) · 09035b71
  frob authored Oct 15, 2024
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  09035b71
23 Aug, 2024 1 commit
- convert safetensor adapters into GGUF (#6327) · 0c819e16
  Patrick Devine authored Aug 23, 2024
  
  0c819e16
12 Aug, 2024 1 commit
- add conversion for microsoft phi 3 mini/medium 4k, 128 · 6ffb5cb0
  Michael Yang authored Jun 03, 2024
  
  6ffb5cb0
08 Aug, 2024 1 commit
- llama3.1 memory · 2003d601
  Michael Yang authored Aug 08, 2024
  
  2003d601
31 Jul, 2024 1 commit
- update convert test to check result data · 6b252918
  Michael Yang authored Jun 03, 2024
  
  6b252918
10 Jul, 2024 1 commit
- chatglm graph · 5a739ff4
  Michael Yang authored Jul 10, 2024
  
  5a739ff4
27 Jun, 2024 1 commit
- gemma2 graph · de2163da
  Michael Yang authored Jun 27, 2024
  
  de2163da
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

20 Jun, 2024 1 commit
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
18 Jun, 2024 1 commit
- deepseek v2 graph · e873841c
  Michael Yang authored Jun 18, 2024
  
  e873841c
14 Jun, 2024 1 commit

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

11 Jun, 2024 1 commit

Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order" · 7bdcd1da

Michael Yang authored Jun 11, 2024

This reverts commit f5f245cc, reversing
changes made to 94d37fdc.

this change broke gguf v2 which is incorrectly detected as big endian

7bdcd1da

08 Jun, 2024 1 commit
- fix parsing big endian gguf · 620d5c56
  Michael Yang authored Jun 08, 2024
  
  620d5c56
06 Jun, 2024 1 commit
- detect chat template from KV · 9b6c2e6e
  Michael Yang authored Jun 03, 2024
  
  9b6c2e6e
24 May, 2024 2 commits
- Update llm/ggml.go · d51f1525
  Michael Yang authored May 24, 2024
```
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
```
  d51f1525
- fix q5_0, q5_1 · 8f440d57
  Michael Yang authored May 24, 2024
  
  8f440d57
23 May, 2024 1 commit
- Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL (#4322) · d6f692ad
  Bruce MacDonald authored May 23, 2024
```
Co-authored-by: ManniX-ITA <20623405+mann1x@users.noreply.github.com>
```
  d6f692ad
21 May, 2024 1 commit
- simplify safetensors reading · 171eb040
  Michael Yang authored May 20, 2024
  
  171eb040
10 May, 2024 1 commit
- add phi2 mem · 1eb382da
  Michael Yang authored May 10, 2024
  
  1eb382da
08 May, 2024 1 commit
- skip if same quantization · eeb69526
  Michael Yang authored May 07, 2024
  
  eeb69526
06 May, 2024 2 commits
- comments · 01811c17
  Michael Yang authored Apr 23, 2024
  
  01811c17
- quantize any fp16/fp32 model · 9685c345
  Michael Yang authored Apr 12, 2024
```
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
```
  9685c345
23 Apr, 2024 1 commit
- fix: mixtral graph · 435cc866
  Michael Yang authored Apr 22, 2024
  
  435cc866
17 Apr, 2024 1 commit
- add stablelm graph calculation · 3cf483fe
  Michael Yang authored Apr 17, 2024
  
  3cf483fe