Commits · 82a02e18d96ce2fff1791e6d1a080d3afa11370e · OpenDAS / ollama

06 Nov, 2024 1 commit

sched: Lift parallel restriction for multimodal models except mllama · 6cd56687

Jesse Gross authored Oct 30, 2024

The Go runner does not have a problem with supporting parallel
requests for most multimodal models. Now that we won't be potentially
falling back to server.cpp, this restriction can be lifted.

However, the new mllama model can't support parallel requests, so we
will need to keep a restriction for that.

6cd56687

17 Oct, 2024 1 commit
- Rename gpu package discover (#7143) · 05cd82ef
  Daniel Hiltgen authored Oct 16, 2024
```
Cleaning up go package naming
```
  05cd82ef
11 Sep, 2024 1 commit
- add "stop" command (#6739) · abed273d
  Patrick Devine authored Sep 11, 2024
  
  abed273d
22 Aug, 2024 1 commit

Fix embeddings memory corruption (#6467) · 90ca8417

Daniel Hiltgen authored Aug 22, 2024

* Fix embeddings memory corruption

The patch was leading to a buffer overrun corruption.  Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count.  To
work around this, only use slot 0 for embeddings.

* Fix embed integration test assumption

The token eval count has changed with recent llama.cpp bumps (0.3.5+)

90ca8417

18 Aug, 2024 2 commits
- Fix white space. · 885cf450
  Richard Lyons authored Aug 18, 2024
  
  885cf450
- Reset NumCtx. · 9352eeb7
  Richard Lyons authored Aug 18, 2024
  
  9352eeb7
17 Aug, 2024 1 commit
- Override numParallel only if unset. · 0ad0e738
  Richard Lyons authored Aug 18, 2024
  
  0ad0e738
13 Aug, 2024 1 commit

lint · 2697d7f5

Michael Yang authored Aug 13, 2024

- fixes printf: non-constant format string in call to fmt.Printf
- fixes SA1032: arguments have the wrong order
- disables testifylint

2697d7f5

02 Aug, 2024 1 commit
- lint · b732beba
  Michael Yang authored Aug 01, 2024
  
  b732beba
30 Jul, 2024 1 commit

Prevent partial loading on mixed GPU brands · 34542099

Daniel Hiltgen authored Jul 22, 2024

In mult-brand GPU setups, if we couldn't fully load the model we
would fall through the scheduler and mistakenly try to load across
a mix of brands.  This makes sure we find the set of GPU(s) that
best fit for the partial load.

34542099

22 Jul, 2024 4 commits
- comments · 85d9d73a
  Michael Yang authored Jul 08, 2024
  
  85d9d73a
- int · 0f191012
  Michael Yang authored Jul 03, 2024
  
  0f191012
- keepalive · 8570c1c0
  Michael Yang authored Jul 03, 2024
  
  8570c1c0
- bool · 55cd3ddc
  Michael Yang authored Jul 03, 2024
  
  55cd3ddc
11 Jul, 2024 1 commit
- sched: only error when over-allocating system memory (#5626) · 791650dd
  Jeffrey Morgan authored Jul 11, 2024
  
  791650dd
09 Jul, 2024 1 commit
- server: fix model reloads when setting `OLLAMA_NUM_PARALLEL` (#5560) · e4ff7329
  Jeffrey Morgan authored Jul 08, 2024
```
* server: fix unneeded model reloads when setting `OLLAMA_NUM_PARALLEL`

* remove whitespace change

* undo some changes
```
  e4ff7329
07 Jul, 2024 1 commit
- sched: don't error if paging to disk on Windows and macOS (#5523) · 0ee87615
  Jeffrey Morgan authored Jul 06, 2024
  
  0ee87615
03 Jul, 2024 2 commits

Only set default keep_alive on initial model load · 955f2a4e

Daniel Hiltgen authored Jul 02, 2024

This change fixes the handling of keep_alive so that if client
request omits the setting, we only set this on initial load.  Once
the model is loaded, if new requests leave this unset, we'll keep
whatever keep_alive was there.

955f2a4e

Prevent loading models larger than total memory · 3c75113e

Daniel Hiltgen authored Jul 03, 2024

Users may not realize the siny new model they're trying to load
fits on their disk, but can't load into system+GPU memory.  Today
we crash, but with this fix, we'll give them a better error message
before even trying to load it.

3c75113e

01 Jul, 2024 1 commit
- Fix case for NumCtx · cff3f44f
  Daniel Hiltgen authored Jul 01, 2024
  
  cff3f44f
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

21 Jun, 2024 2 commits

Disable concurrency for AMD + Windows · 9929751c

Daniel Hiltgen authored Jun 19, 2024

Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.

9929751c

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

14 Jun, 2024 6 commits
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Prevent multiple concurrent loads on the same gpus · ff4f0cbd
  Daniel Hiltgen authored Jun 04, 2024
```
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
```
  ff4f0cbd
- Refine CPU load behavior with system memory visibility · fc37c192
  Daniel Hiltgen authored Jun 03, 2024
  
  fc37c192
- Reintroduce nvidia nvml library for windows · 434dfe30
  Daniel Hiltgen authored Jun 03, 2024
```
This library will give us the most reliable free VRAM reporting on windows
to enable concurrent model scheduling.
```
  434dfe30
- Harden unload for empty runners · 48702dd1
  Daniel Hiltgen authored May 30, 2024
  
  48702dd1
- Support forced spreading for multi GPU · 5e8ff556
  Daniel Hiltgen authored May 08, 2024
```
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one.  This exposes that
tunable behavior.
```
  5e8ff556
04 Jun, 2024 3 commits
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
- some gocritic · c895a7d1
  Michael Yang authored May 21, 2024
  
  c895a7d1
- replace x/exp/slices with slices · 04f3c12b
  Michael Yang authored May 21, 2024
  
  04f3c12b
24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
21 May, 2024 1 commit

Correct typo in error message (#4535) · 4434d7f4

Sang Park authored May 22, 2024

The spelling of the term "request" has been corrected, which was previously mistakenly written as "requeset" in the error log message.

4434d7f4

14 May, 2024 2 commits
- Remove VRAM convergence check for windows · ec231a79
  Daniel Hiltgen authored May 14, 2024
```
The APIs we query are optimistic on free space, and windows pages
VRAM, so we don't have to wait to see reported usage recover on unload
```
  ec231a79
- Ollama `ps` command for showing currently loaded models (#4327) · 68459888
  Patrick Devine authored May 13, 2024
  
  68459888
10 May, 2024 2 commits
- Always use the sorted list of GPUs · 4142c3ef
  Daniel Hiltgen authored May 10, 2024
```
Make sure the first GPU has the most free space
```
  4142c3ef
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
09 May, 2024 1 commit

Wait for GPU free memory reporting to converge · 354ad925

Daniel Hiltgen authored May 09, 2024

The GPU drivers take a while to update their free memory reporting, so we need
to wait until the values converge with what we're expecting before proceeding
to start another runner in order to get an accurate picture.

354ad925

06 May, 2024 1 commit
- Skip scheduling cancelled requests, always reload unloaded runners (#4189) · c9f98622
  Jeffrey Morgan authored May 06, 2024
  
  c9f98622