Commits · 340448d2d1689f707d7644bc8ca018e0db015c9c · OpenDAS / ollama

25 Apr, 2025 1 commit
- explicitly decode maxarraysize 1024 · 340448d2
  Michael Yang authored Apr 25, 2025
  
  340448d2
26 Mar, 2025 2 commits

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

llm: Fix debug logging for memory estimates · f4f0992b
Jesse Gross authored Mar 25, 2025

f4f0992b

13 Mar, 2025 1 commit
- count gemma3 vision tensors · 033cec23
  Michael Yang authored Mar 12, 2025
  
  033cec23
04 Mar, 2025 1 commit

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c

14 Feb, 2025 1 commit

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

10 Dec, 2024 1 commit
- Prevent underflow when FreeMemory < overhead (#8014) · 63269668
  frob authored Dec 10, 2024
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  63269668
04 Dec, 2024 1 commit
- llm: normalise kvct parameter handling (#7926) · 539be436
  Sam authored Dec 04, 2024
  
  539be436
03 Dec, 2024 1 commit
- llm: introduce k/v context quantization (vRAM improvements) (#6279) · 1bdab9fd
  Sam authored Dec 04, 2024
  
  1bdab9fd
01 Nov, 2024 1 commit
- refactor kv estimation · d07cf41a
  Michael Yang authored Oct 31, 2024
  
  d07cf41a
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

17 Oct, 2024 1 commit
- Rename gpu package discover (#7143) · 05cd82ef
  Daniel Hiltgen authored Oct 16, 2024
```
Cleaning up go package naming
```
  05cd82ef
06 Sep, 2024 1 commit

Improve logging on GPU too small (#6666) · 56318fb3

Daniel Hiltgen authored Sep 06, 2024

When we determine a GPU is too small for any layers, it's not always clear why.
This will help troubleshoot those scenarios.

56318fb3

05 Sep, 2024 1 commit

Introduce GPU Overhead env var (#5922) · b05c9e83

Daniel Hiltgen authored Sep 05, 2024

Provide a mechanism for users to set aside an amount of VRAM on each GPU
to make room for other applications they want to start after Ollama, or workaround
memory prediction bugs

b05c9e83

20 Jun, 2024 1 commit
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
18 Jun, 2024 2 commits

Handle models with divergent layer sizes · 359b15a5

Daniel Hiltgen authored Jun 18, 2024

The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.

359b15a5

Tighten up memory prediction logging · 7784ca33

Daniel Hiltgen authored Jun 17, 2024

Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.

7784ca33

14 Jun, 2024 3 commits
- Remove mmap related output calc logic · 17df6520
  Daniel Hiltgen authored Jun 13, 2024
  
  17df6520
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Improve multi-gpu handling at the limit · 6fd04ca9
  Daniel Hiltgen authored May 18, 2024
```
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
```
  6fd04ca9
04 Jun, 2024 2 commits
- gofmt, goimports · 6297f856
  Michael Yang authored Jun 04, 2024
  
  6297f856
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
13 May, 2024 2 commits
- typo · 1d359e73
  Michael Yang authored May 13, 2024
  
  1d359e73
- count memory up to NumGPU · 50b9056e
  Michael Yang authored May 10, 2024
  
  50b9056e
10 May, 2024 1 commit
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
08 May, 2024 1 commit
- Record GPU usage information · bee2f4a3
  Daniel Hiltgen authored May 04, 2024
```
This records more GPU usage information for eventual UX inclusion.
```
  bee2f4a3
07 May, 2024 1 commit
- llm: add minimum based on layer size · 4736391b
  Michael Yang authored May 06, 2024
  
  4736391b
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

01 May, 2024 1 commit
- gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068) · f0c454ab
  Jeffrey Morgan authored May 01, 2024
  
  f0c454ab
26 Apr, 2024 1 commit
- fix gemma, command-r layer weights · f81f3081
  Michael Yang authored Apr 26, 2024
  
  f81f3081
25 Apr, 2024 1 commit
- only count output tensors · 7bb7cb8a
  Michael Yang authored Apr 25, 2024
  
  7bb7cb8a
24 Apr, 2024 1 commit

Add back memory escape valve · 5445aaa9

Daniel Hiltgen authored Apr 23, 2024

If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround.  Recent multi-gpu
refactoring accidentally removed it, so this adds it back.

5445aaa9

23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a