- 28 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Prior refactoring passes accidentally removed the logic to bypass VRAM checks for CPU loads. This adds that back, along with test coverage. This also fixes loaded map access in the unit test to be behind the mutex which was likely the cause of various flakes in the tests.
-
- 25 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
* reload model if `num_gpu` changes * dont reload on -1 * fix tests
-
- 24 Apr, 2024 3 commits
-
-
Bryce Reitano authored
-
Bryce Reitano authored
-
Bryce Reitano authored
-
- 23 Apr, 2024 2 commits
-
-
Daniel Hiltgen authored
Give the go routine a moment to deliver the expired event
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-