Commits · 879e2caf8ccd93e4ff758089f50f985d3b470f4e · OpenDAS / ollama

10 May, 2024 2 commits
- Fall back to CPU runner with zero layers · c4014e73
  Daniel Hiltgen authored May 10, 2024
  
  c4014e73
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
09 May, 2024 5 commits
- fix typo · cf442cd5
  Michael Yang authored May 09, 2024
  
  cf442cd5
- only forward some env vars · ce3b212d
  Michael Yang authored May 09, 2024
  
  ce3b212d
- log clean up · 58876091
  Michael Yang authored May 09, 2024
  
  58876091
- add done_reason to the api (#4235) · cfa84b84
  Bruce MacDonald authored May 09, 2024
  
  cfa84b84
- Refine subprocess reaping · 84ac7ce1
  Daniel Hiltgen authored May 09, 2024
  
  84ac7ce1
08 May, 2024 1 commit
- Record GPU usage information · bee2f4a3
  Daniel Hiltgen authored May 04, 2024
```
This records more GPU usage information for eventual UX inclusion.
```
  bee2f4a3
07 May, 2024 1 commit

Detect noexec and report a better error · 72700279

Daniel Hiltgen authored May 07, 2024

This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess

72700279

06 May, 2024 3 commits
- Use our libraries first · 380378cc
  Daniel Hiltgen authored May 05, 2024
```
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
```
  380378cc
- Fix `no slots available` error with concurrent requests (#4160) · ed740a25
  Jeffrey Morgan authored May 06, 2024
  
  ed740a25
- Fix llava models not working after first request (#4164) · 1b0e6c9c
  Jeffrey Morgan authored May 05, 2024
```
* fix llava models not working after first request

* individual requests only for llava models
```
  1b0e6c9c
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

01 May, 2024 4 commits
- Removing go routine calling .wait from load. · 321d57e1
  Mark Ward authored May 01, 2024
  
  321d57e1
- it will always return an error due to Kill() discarding Wait() errors · ba26c7aa
  Mark Ward authored Apr 29, 2024
  
  ba26c7aa
- log when the waiting for the process to stop to help debug when other tasks... · 63c76368
  Mark Ward authored Apr 29, 2024
```
log when the waiting for the process to stop to help debug when other tasks execute during this wait.
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
```
  63c76368
- fix sched to wait for the runner to terminate to ensure following vram check will be more accurate · 948114e3
  Mark Ward authored Apr 28, 2024
  
  948114e3
29 Apr, 2024 1 commit
- llm: dont cap context window limit to training context window (#3988) · 7aa08a77
  Jeffrey Morgan authored Apr 29, 2024
  
  7aa08a77
26 Apr, 2024 1 commit
- return code `499` when user cancels request while a model is loading (#3955) · bb31def0
  Jeffrey Morgan authored Apr 26, 2024
  
  bb31def0
25 Apr, 2024 1 commit
- llm: limit generation to 10x context size to avoid run on generations (#3918) · 993cf8bf
  Jeffrey Morgan authored Apr 25, 2024
```
* llm: limit generation to 10x context size to avoid run on generations

* add comment

* simplify condition statement
```
  993cf8bf
23 Apr, 2024 4 commits

Detect and recover if runner removed · 58888a74

Daniel Hiltgen authored Apr 23, 2024

Tmp cleaners can nuke the file out from underneath us.  This detects the missing
runner, and re-initializes the payloads.

58888a74

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a

Report errors on server lookup instead of path lookup failure · 8711d03d
Daniel Hiltgen authored Apr 22, 2024

8711d03d
Trim spaces and quotes from llm lib override · aa72281e
Daniel Hiltgen authored Apr 22, 2024

aa72281e

17 Apr, 2024 3 commits
- add stablelm graph calculation · 3cf483fe
  Michael Yang authored Apr 17, 2024
  
  3cf483fe
- account for all non-repeating layers · a8b9b930
  Michael Yang authored Apr 17, 2024
  
  a8b9b930
- Streamlined WaitUntilRunning · bd54b082
  ManniX-ITA authored Apr 17, 2024
  
  bd54b082
16 Apr, 2024 2 commits
- scale graph based on gpu count · 26df6747
  Michael Yang authored Apr 16, 2024
  
  26df6747
- darwin: no partial offloading if required memory greater than system · 41a272de
  Michael Yang authored Apr 16, 2024
  
  41a272de
15 Apr, 2024 1 commit
- Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653) · a0b8a32e
  Jeffrey Morgan authored Apr 15, 2024
```
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading

* use `unload` in signal handler
```
  a0b8a32e
10 Apr, 2024 2 commits
- partial offloading · 7e33a017
  Michael Yang authored Apr 05, 2024
  
  7e33a017
- refactor tensor query · 8b2c1006
  Michael Yang authored Apr 03, 2024
  
  8b2c1006
09 Apr, 2024 1 commit
- Handle very slow model loads · c5ff443b
  Daniel Hiltgen authored Apr 09, 2024
```
During testing, we're seeing some models take over 3 minutes.
```
  c5ff443b
06 Apr, 2024 1 commit
- no rope parameters · be517e49
  Michael Yang authored Apr 05, 2024
  
  be517e49
03 Apr, 2024 1 commit
- update graph size estimate · 12e923e1
  Michael Yang authored Apr 02, 2024
  
  12e923e1
02 Apr, 2024 2 commits
- Revert options as a ref in the server · 6589eb8a
  Daniel Hiltgen authored Apr 02, 2024
  
  6589eb8a
- fix metal gpu · 80163ebc
  Michael Yang authored Apr 02, 2024
  
  80163ebc
01 Apr, 2024 1 commit

Switch back to subprocessing for llama.cpp · 58d95cc9

Daniel Hiltgen authored Mar 14, 2024

This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.

58d95cc9