- 24 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 21 May, 2024 1 commit
-
-
Sang Park authored
The spelling of the term "request" has been corrected, which was previously mistakenly written as "requeset" in the error log message.
-
- 14 May, 2024 2 commits
-
-
Daniel Hiltgen authored
The APIs we query are optimistic on free space, and windows pages VRAM, so we don't have to wait to see reported usage recover on unload
-
Patrick Devine authored
-
- 10 May, 2024 2 commits
-
-
Daniel Hiltgen authored
Make sure the first GPU has the most free space
-
Jeffrey Morgan authored
* dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning
-
- 09 May, 2024 1 commit
-
-
Daniel Hiltgen authored
The GPU drivers take a while to update their free memory reporting, so we need to wait until the values converge with what we're expecting before proceeding to start another runner in order to get an accurate picture.
-
- 06 May, 2024 2 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 05 May, 2024 3 commits
-
-
Daniel Hiltgen authored
This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
This also bumps up the default to be 50 queued requests instead of 10.
-
- 01 May, 2024 2 commits
-
-
Mark Ward authored
log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.
-
Mark Ward authored
fix runner expire during active use. Clearing the expire timer as it is used. Allowing the finish to assign an expire timer so that the runner will expire after no use.
-
- 28 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Prior refactoring passes accidentally removed the logic to bypass VRAM checks for CPU loads. This adds that back, along with test coverage. This also fixes loaded map access in the unit test to be behind the mutex which was likely the cause of various flakes in the tests.
-
- 25 Apr, 2024 2 commits
-
-
Jeffrey Morgan authored
* reload model if `num_gpu` changes * dont reload on -1 * fix tests
-
Daniel Hiltgen authored
-
- 24 Apr, 2024 2 commits
-
-
Bryce Reitano authored
-
Bryce Reitano authored
-
- 23 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-