Commits · 415c8fcc3deee73e8a11822a962a87c7cb58d938 · OpenDAS / ollama

30 Apr, 2025 1 commit

Fix "Stopping..." scheduler hang (#10487) · 415c8fcc

Daniel Hiltgen authored Apr 30, 2025

* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.

415c8fcc

29 Apr, 2025 1 commit

lower default num parallel to 2 · fe5b9bb2

Devon Rifkin authored Apr 29, 2025

this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k

fe5b9bb2

28 Apr, 2025 1 commit
- Revert "increase default context length to 4096 (#10364)" · dd93e1af
  Devon Rifkin authored Apr 28, 2025
```
This reverts commit 424f6486.
```
  dd93e1af
22 Apr, 2025 1 commit

increase default context length to 4096 (#10364) · 424f6486

Devon Rifkin authored Apr 22, 2025

* increase default context length to 4096

We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.

We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.

We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.

* fix tests

add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.

* tweak small gpu message

* clarify context length default

also make it actually show up in `ollama serve --help`

424f6486

09 Apr, 2025 1 commit
- fix(scheduler): make model unload order deterministic (#10185) · 42ecb9f1
  Ire Gaddr authored Apr 09, 2025
  
  42ecb9f1
02 Apr, 2025 1 commit

chore(all): replace instances of interface with any (#10067) · 9876c9fa

Bruce MacDonald authored Apr 02, 2025

Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.

9876c9fa

01 Apr, 2025 1 commit

api: return model capabilities from the show endpoint (#10066) · e172f095

Bruce MacDonald authored Apr 01, 2025

With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.

e172f095

26 Mar, 2025 1 commit

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

20 Feb, 2025 1 commit
- server: add missing function parens to debug log (#9255) · 7c168b08
  frob authored Feb 20, 2025
  
  7c168b08
14 Feb, 2025 1 commit

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

10 Dec, 2024 1 commit
- all: fix typos in documentation, code, and comments (#7021) · abfdc471
  Stefan Weil authored Dec 10, 2024
  
  abfdc471
06 Nov, 2024 1 commit

sched: Lift parallel restriction for multimodal models except mllama · 6cd56687

Jesse Gross authored Oct 30, 2024

The Go runner does not have a problem with supporting parallel
requests for most multimodal models. Now that we won't be potentially
falling back to server.cpp, this restriction can be lifted.

However, the new mllama model can't support parallel requests, so we
will need to keep a restriction for that.

6cd56687

17 Oct, 2024 1 commit
- Rename gpu package discover (#7143) · 05cd82ef
  Daniel Hiltgen authored Oct 16, 2024
```
Cleaning up go package naming
```
  05cd82ef
11 Sep, 2024 1 commit
- add "stop" command (#6739) · abed273d
  Patrick Devine authored Sep 11, 2024
  
  abed273d
22 Aug, 2024 1 commit

Fix embeddings memory corruption (#6467) · 90ca8417

Daniel Hiltgen authored Aug 22, 2024

* Fix embeddings memory corruption

The patch was leading to a buffer overrun corruption.  Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count.  To
work around this, only use slot 0 for embeddings.

* Fix embed integration test assumption

The token eval count has changed with recent llama.cpp bumps (0.3.5+)

90ca8417

18 Aug, 2024 2 commits
- Fix white space. · 885cf450
  Richard Lyons authored Aug 18, 2024
  
  885cf450
- Reset NumCtx. · 9352eeb7
  Richard Lyons authored Aug 18, 2024
  
  9352eeb7
17 Aug, 2024 1 commit
- Override numParallel only if unset. · 0ad0e738
  Richard Lyons authored Aug 18, 2024
  
  0ad0e738
13 Aug, 2024 1 commit

lint · 2697d7f5

Michael Yang authored Aug 13, 2024

- fixes printf: non-constant format string in call to fmt.Printf
- fixes SA1032: arguments have the wrong order
- disables testifylint

2697d7f5

02 Aug, 2024 1 commit
- lint · b732beba
  Michael Yang authored Aug 01, 2024
  
  b732beba
30 Jul, 2024 1 commit

Prevent partial loading on mixed GPU brands · 34542099

Daniel Hiltgen authored Jul 22, 2024

In mult-brand GPU setups, if we couldn't fully load the model we
would fall through the scheduler and mistakenly try to load across
a mix of brands.  This makes sure we find the set of GPU(s) that
best fit for the partial load.

34542099

22 Jul, 2024 4 commits
- comments · 85d9d73a
  Michael Yang authored Jul 08, 2024
  
  85d9d73a
- int · 0f191012
  Michael Yang authored Jul 03, 2024
  
  0f191012
- keepalive · 8570c1c0
  Michael Yang authored Jul 03, 2024
  
  8570c1c0
- bool · 55cd3ddc
  Michael Yang authored Jul 03, 2024
  
  55cd3ddc
11 Jul, 2024 1 commit
- sched: only error when over-allocating system memory (#5626) · 791650dd
  Jeffrey Morgan authored Jul 11, 2024
  
  791650dd
09 Jul, 2024 1 commit
- server: fix model reloads when setting `OLLAMA_NUM_PARALLEL` (#5560) · e4ff7329
  Jeffrey Morgan authored Jul 08, 2024
```
* server: fix unneeded model reloads when setting `OLLAMA_NUM_PARALLEL`

* remove whitespace change

* undo some changes
```
  e4ff7329
07 Jul, 2024 1 commit
- sched: don't error if paging to disk on Windows and macOS (#5523) · 0ee87615
  Jeffrey Morgan authored Jul 06, 2024
  
  0ee87615
03 Jul, 2024 2 commits

Only set default keep_alive on initial model load · 955f2a4e

Daniel Hiltgen authored Jul 02, 2024

This change fixes the handling of keep_alive so that if client
request omits the setting, we only set this on initial load.  Once
the model is loaded, if new requests leave this unset, we'll keep
whatever keep_alive was there.

955f2a4e

Prevent loading models larger than total memory · 3c75113e

Daniel Hiltgen authored Jul 03, 2024

Users may not realize the siny new model they're trying to load
fits on their disk, but can't load into system+GPU memory.  Today
we crash, but with this fix, we'll give them a better error message
before even trying to load it.

3c75113e

01 Jul, 2024 1 commit
- Fix case for NumCtx · cff3f44f
  Daniel Hiltgen authored Jul 01, 2024
  
  cff3f44f
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

21 Jun, 2024 2 commits

Disable concurrency for AMD + Windows · 9929751c

Daniel Hiltgen authored Jun 19, 2024

Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.

9929751c

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

14 Jun, 2024 6 commits
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Prevent multiple concurrent loads on the same gpus · ff4f0cbd
  Daniel Hiltgen authored Jun 04, 2024
```
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
```
  ff4f0cbd
- Refine CPU load behavior with system memory visibility · fc37c192
  Daniel Hiltgen authored Jun 03, 2024
  
  fc37c192
- Reintroduce nvidia nvml library for windows · 434dfe30
  Daniel Hiltgen authored Jun 03, 2024
```
This library will give us the most reliable free VRAM reporting on windows
to enable concurrent model scheduling.
```
  434dfe30
- Harden unload for empty runners · 48702dd1
  Daniel Hiltgen authored May 30, 2024
  
  48702dd1
- Support forced spreading for multi GPU · 5e8ff556
  Daniel Hiltgen authored May 08, 2024
```
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one.  This exposes that
tunable behavior.
```
  5e8ff556