Commits · 10d49bce7052b83e5e20f1c87c6f7cd2eb135bef · OpenDAS / ollama

05 Aug, 2024 1 commit
- fix concurrency test · 7ed36741
  Michael Yang authored Aug 05, 2024
  
  7ed36741
22 Jul, 2024 2 commits

uint64 · 1954ec59
Michael Yang authored Jul 03, 2024

1954ec59

Remove no longer supported max vram var · cc269ba0

Daniel Hiltgen authored Jul 22, 2024

The OLLAMA_MAX_VRAM env var was a temporary workaround for OOM
scenarios. With Concurrency this was no longer wired up, and the simplistic
value doesn't map to multi-GPU setups. Users can still set `num_gpu`
to limit memory usage to avoid OOM if we get our predictions wrong.

cc269ba0

14 Jun, 2024 3 commits

refined test timing · 68dfc623
Daniel Hiltgen authored May 31, 2024
```
adjust timing on some tests so they don't timeout on small/slow GPUs
```
68dfc623

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

Fix concurrency integration test to work locally · 206797bd
Daniel Hiltgen authored May 23, 2024
```
This worked remotely but wound up trying to spawn multiple servers
locally which doesn't work
```
206797bd

10 May, 2024 1 commit
- Integration fixes · 074dc3b9
  Daniel Hiltgen authored May 10, 2024
  
  074dc3b9
23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a