Commits · 68dfc6236a320efae53f6ad01b79ff92906dc77b · OpenDAS / ollama

14 Jun, 2024 3 commits

refined test timing · 68dfc623
Daniel Hiltgen authored May 31, 2024
```
adjust timing on some tests so they don't timeout on small/slow GPUs
```
68dfc623

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

Fix concurrency integration test to work locally · 206797bd
Daniel Hiltgen authored May 23, 2024
```
This worked remotely but wound up trying to spawn multiple servers
locally which doesn't work
```
206797bd

10 May, 2024 1 commit
- Integration fixes · 074dc3b9
  Daniel Hiltgen authored May 10, 2024
  
  074dc3b9
23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a