Commits · 25906d72d1482bc9dc2e4300a42c8db4823ee1a3 · OpenDAS / ollama

02 Aug, 2024 1 commit
- lint · b732beba
  Michael Yang authored Aug 01, 2024
  
  b732beba
31 Jul, 2024 2 commits
- comments · df993fa3
  Michael Yang authored Jul 08, 2024
  
  df993fa3
- refactor convert · 5e9db9fb
  Michael Yang authored May 31, 2024
  
  5e9db9fb
30 Jul, 2024 2 commits

Add Metrics to `api\embed` response (#5709) · 1b44d873

royjhan authored Jul 30, 2024

* add prompt tokens to embed response

* rm slog

* metrics

* types

* prompt n

* clean up

* reset submodule

* update tests

* test name

* list metrics

1b44d873

Prevent partial loading on mixed GPU brands · 34542099

Daniel Hiltgen authored Jul 22, 2024

In mult-brand GPU setups, if we couldn't fully load the model we
would fall through the scheduler and mistakenly try to load across
a mix of brands.  This makes sure we find the set of GPU(s) that
best fit for the partial load.

34542099

22 Jul, 2024 1 commit
- int · 0f191012
  Michael Yang authored Jul 03, 2024
  
  0f191012
21 Jul, 2024 1 commit
- Remove out of space test temporarily (#5825) · 80ee9b5e
  Jeffrey Morgan authored Jul 21, 2024
  
  80ee9b5e
15 Jul, 2024 1 commit

Introduce `/api/embed` endpoint supporting batch embedding (#5127) · b9f5e16c

royjhan authored Jul 15, 2024

* Initial Batch Embedding

* Revert "Initial Batch Embedding"

This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29.

* Initial Draft

* mock up notes

* api/embed draft

* add server function

* check normalization

* clean up

* normalization

* playing around with truncate stuff

* Truncation

* Truncation

* move normalization to go

* Integration Test Template

* Truncation Integration Tests

* Clean up

* use float32

* move normalize

* move normalize test

* refactoring

* integration float32

* input handling and handler testing

* Refactoring of legacy and new

* clear comments

* merge conflicts

* touches

* embedding type 64

* merge conflicts

* fix hanging on single string

* refactoring

* test values

* set context length

* clean up

* testing clean up

* testing clean up

* remove function closure

* Revert "remove function closure"

This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787.

* remove function closure

* remove redundant error check

* clean up

* more clean up

* clean up

b9f5e16c

09 Jul, 2024 1 commit

Refine scheduler unit tests for reliability · f4408219

Daniel Hiltgen authored Jul 05, 2024

This breaks up some of the test scenarios to create a
more reliable set of tests, as well as adding a little more
coverage.

f4408219

03 Jul, 2024 2 commits

Only set default keep_alive on initial model load · 955f2a4e

Daniel Hiltgen authored Jul 02, 2024

This change fixes the handling of keep_alive so that if client
request omits the setting, we only set this on initial load.  Once
the model is loaded, if new requests leave this unset, we'll keep
whatever keep_alive was there.

955f2a4e

Prevent loading models larger than total memory · 3c75113e

Daniel Hiltgen authored Jul 03, 2024

Users may not realize the siny new model they're trying to load
fits on their disk, but can't load into system+GPU memory.  Today
we crash, but with this fix, we'll give them a better error message
before even trying to load it.

3c75113e

25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

21 Jun, 2024 1 commit

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

14 Jun, 2024 4 commits
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Refine CPU load behavior with system memory visibility · fc37c192
  Daniel Hiltgen authored Jun 03, 2024
  
  fc37c192
- Improve multi-gpu handling at the limit · 6fd04ca9
  Daniel Hiltgen authored May 18, 2024
```
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
```
  6fd04ca9
- server: longer timeout in `TestRequests` (#5046) · dd7c9ebe
  Jeffrey Morgan authored Jun 14, 2024
  
  dd7c9ebe
04 Jun, 2024 1 commit
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
23 May, 2024 1 commit

Use flash attention flag for now (#4580) · 38255d2a

Jeffrey Morgan authored May 22, 2024

* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests

38255d2a

14 May, 2024 1 commit
- Ollama `ps` command for showing currently loaded models (#4327) · 68459888
  Patrick Devine authored May 13, 2024
  
  68459888
06 May, 2024 2 commits
- Fix stale test logic · 0a954e50
  Daniel Hiltgen authored May 06, 2024
```
The model processing was recently changed to be deferred but
this test scenario hadn't been adjusted for that change in behavior.
```
  0a954e50
- unload in critical section (#4187) · dfa2f32c
  Jeffrey Morgan authored May 05, 2024
  
  dfa2f32c
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

03 May, 2024 1 commit
- Soften timeouts on sched unit tests · 9a32c514
  Daniel Hiltgen authored May 03, 2024
```
This gives us more headroom on the scheduler tests to tamp
down some flakes.
```
  9a32c514
28 Apr, 2024 1 commit

Fix concurrency for CPU mode · d6e3b645

Daniel Hiltgen authored Apr 28, 2024

Prior refactoring passes accidentally removed the logic to bypass VRAM
checks for CPU loads. This adds that back, along with test coverage.

This also fixes loaded map access in the unit test to be behind the mutex which was
likely the cause of various flakes in the tests.

d6e3b645

25 Apr, 2024 1 commit
- Reload model if `num_gpu` changes (#3920) · 00b0699c
  Jeffrey Morgan authored Apr 25, 2024
```
* reload model if `num_gpu` changes

* dont reload on -1

* fix tests
```
  00b0699c
24 Apr, 2024 3 commits
- Restructure loading conditional chain · 36a6dacc
  Bryce Reitano authored Apr 24, 2024
  
  36a6dacc
- Provide variable ggml for TestLoad · ceb0e26e
  Bryce Reitano authored Apr 24, 2024
  
  ceb0e26e
- Move ggml loading to when we attempt fitting · 284e02be
  Bryce Reitano authored Apr 24, 2024
  
  284e02be
23 Apr, 2024 2 commits

Harden sched TestLoad · d8851cb7
Daniel Hiltgen authored Apr 23, 2024
```
Give the go routine a moment to deliver the expired event
```
d8851cb7

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a