- 10 May, 2024 2 commits
-
-
Daniel Hiltgen authored
-
Jeffrey Morgan authored
* dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning
-
- 09 May, 2024 5 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Bruce MacDonald authored
-
Daniel Hiltgen authored
-
- 08 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This records more GPU usage information for eventual UX inclusion.
-
- 07 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This will bubble up a much more informative error message if noexec is preventing us from running the subprocess
-
- 06 May, 2024 3 commits
-
-
Daniel Hiltgen authored
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* fix llava models not working after first request * individual requests only for llava models
-
- 05 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs
-
- 01 May, 2024 4 commits
-
-
Mark Ward authored
-
Mark Ward authored
-
Mark Ward authored
log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.
-
Mark Ward authored
-
- 29 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 26 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 25 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
* llm: limit generation to 10x context size to avoid run on generations * add comment * simplify condition statement
-
- 23 Apr, 2024 4 commits
-
-
Daniel Hiltgen authored
Tmp cleaners can nuke the file out from underneath us. This detects the missing runner, and re-initializes the payloads.
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
- 17 Apr, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
ManniX-ITA authored
-
- 16 Apr, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 15 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading * use `unload` in signal handler
-
- 10 Apr, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 09 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
During testing, we're seeing some models take over 3 minutes.
-
- 06 Apr, 2024 1 commit
-
-
Michael Yang authored
-
- 03 Apr, 2024 1 commit
-
-
Michael Yang authored
-
- 02 Apr, 2024 2 commits
-
-
Daniel Hiltgen authored
-
Michael Yang authored
-
- 01 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.
-