Commits · 6fd04ca922e5da7ef8c52d86118fc58b798a7e4a · OpenDAS / ollama

14 Jun, 2024 1 commit

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

04 Jun, 2024 2 commits
- gofmt, goimports · 6297f856
  Michael Yang authored Jun 04, 2024
  
  6297f856
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
13 May, 2024 2 commits
- typo · 1d359e73
  Michael Yang authored May 13, 2024
  
  1d359e73
- count memory up to NumGPU · 50b9056e
  Michael Yang authored May 10, 2024
  
  50b9056e
10 May, 2024 1 commit
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
08 May, 2024 1 commit
- Record GPU usage information · bee2f4a3
  Daniel Hiltgen authored May 04, 2024
```
This records more GPU usage information for eventual UX inclusion.
```
  bee2f4a3
07 May, 2024 1 commit
- llm: add minimum based on layer size · 4736391b
  Michael Yang authored May 06, 2024
  
  4736391b
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

01 May, 2024 1 commit
- gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068) · f0c454ab
  Jeffrey Morgan authored May 01, 2024
  
  f0c454ab
26 Apr, 2024 1 commit
- fix gemma, command-r layer weights · f81f3081
  Michael Yang authored Apr 26, 2024
  
  f81f3081
25 Apr, 2024 1 commit
- only count output tensors · 7bb7cb8a
  Michael Yang authored Apr 25, 2024
  
  7bb7cb8a
24 Apr, 2024 1 commit

Add back memory escape valve · 5445aaa9

Daniel Hiltgen authored Apr 23, 2024

If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround.  Recent multi-gpu
refactoring accidentally removed it, so this adds it back.

5445aaa9

23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a