- 22 Jul, 2024 8 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Daniel Hiltgen authored
The OLLAMA_MAX_VRAM env var was a temporary workaround for OOM scenarios. With Concurrency this was no longer wired up, and the simplistic value doesn't map to multi-GPU setups. Users can still set `num_gpu` to limit memory usage to avoid OOM if we get our predictions wrong.
-
- 03 Jul, 2024 2 commits
-
-
Anatoli Babenia authored
* Co-authored-by: Anatoli Babenia <anatoli@rainforce.org> Co-authored-by:Maas Lalani <maas@lalani.dev>
-
Daniel Hiltgen authored
This change fixes the handling of keep_alive so that if client request omits the setting, we only set this on initial load. Once the model is loaded, if new requests leave this unset, we'll keep whatever keep_alive was there.
-
- 01 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
This may confuse users thinking "auto" is an acceptable string - it must be numeric
-
- 21 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
- 19 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
This reverts commit 755b4e4f.
-
- 17 Jun, 2024 1 commit
-
-
Jeffrey Morgan authored
* gpu: add env var for detecting intel oneapi gpus * fix build error
-
- 14 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
This should aid in troubleshooting by capturing and reporting the GPU settings at startup in the logs along with all the other server settings.
-
Daniel Hiltgen authored
Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.
-
- 13 Jun, 2024 1 commit
-
-
Patrick Devine authored
-
- 12 Jun, 2024 1 commit
-
-
Patrick Devine authored
-
- 06 Jun, 2024 1 commit
-
-
royjhan authored
* API app/browser access * Add tauri (resolves #2291, #4791, #3799, #4388)
-
- 04 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 30 May, 2024 1 commit
-
-
Lei Jitang authored
* envconfig/config.go: Fix wrong description of OLLAMA_LLM_LIBRARY Signed-off-by:
Lei Jitang <leijitang@outlook.com> * serve: Add more env to help message of ollama serve Add more enviroment variables to `ollama serve --help` to let users know what can be configurated. Signed-off-by:
Lei Jitang <leijitang@outlook.com> --------- Signed-off-by:
Lei Jitang <leijitang@outlook.com>
-
- 24 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 23 May, 2024 1 commit
-
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 05 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs
-