- 29 May, 2024 1 commit
-
-
Michael Yang authored
-
- 28 May, 2024 2 commits
-
-
Daniel Hiltgen authored
On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.
-
Lei Jitang authored
Signed-off-by:Lei Jitang <leijitang@outlook.com>
-
- 25 May, 2024 1 commit
-
-
Daniel Hiltgen authored
If the client closes the connection before we finish loading the model we abort, so lets make the log message clearer why to help users understand this failure mode
-
- 24 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 23 May, 2024 2 commits
-
-
Daniel Hiltgen authored
This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 20 May, 2024 1 commit
-
-
Sam authored
* feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support
-
- 15 May, 2024 2 commits
-
-
Patrick Devine authored
-
Daniel Hiltgen authored
Only dump env vars we care about in the logs
-
- 14 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 11 May, 2024 1 commit
-
- 10 May, 2024 2 commits
-
-
Daniel Hiltgen authored
-
Jeffrey Morgan authored
* dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning
-
- 09 May, 2024 5 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Bruce MacDonald authored
-
Daniel Hiltgen authored
-
- 08 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This records more GPU usage information for eventual UX inclusion.
-
- 07 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This will bubble up a much more informative error message if noexec is preventing us from running the subprocess
-
- 06 May, 2024 3 commits
-
-
Daniel Hiltgen authored
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* fix llava models not working after first request * individual requests only for llava models
-
- 05 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs
-
- 01 May, 2024 4 commits
-
-
Mark Ward authored
-
Mark Ward authored
-
Mark Ward authored
log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.
-
Mark Ward authored
-
- 29 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 26 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 25 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
* llm: limit generation to 10x context size to avoid run on generations * add comment * simplify condition statement
-
- 23 Apr, 2024 4 commits
-
-
Daniel Hiltgen authored
Tmp cleaners can nuke the file out from underneath us. This detects the missing runner, and re-initializes the payloads.
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
- 17 Apr, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
ManniX-ITA authored
-
- 16 Apr, 2024 1 commit
-
-
Michael Yang authored
-