- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.
-
- 16 Apr, 2025 1 commit
-
-
Daniel Hiltgen authored
Add some new test coverage for various model architectures, and switch from orca-mini to the small llama model.
-
- 02 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
-
- 31 Oct, 2024 1 commit
-
-
Daniel Hiltgen authored
* Give unicode test more time to run Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test * Give more time for concurrency test CPU inference can be very slow under stress
-
- 29 Oct, 2024 1 commit
-
-
Jesse Gross authored
-
- 22 Oct, 2024 1 commit
-
-
Jesse Gross authored
We check for partial unicode characters and accumulate them before sending. However, when we did send, we still sent each individual piece separately, leading to broken output. This combines everything into a single group, which is also more efficient. This also switches to the built-in check for valid unicode characters, which is stricter. After this, we should never send back an invalid sequence. Fixes #7290
-
- 22 Jul, 2024 1 commit
-
-
Michael Yang authored
-
- 23 Apr, 2024 2 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-
- 01 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Cleaner shutdown logic, a bit of response hardening
-
- 26 Mar, 2024 1 commit
-
-
Patrick Devine authored
-
- 25 Mar, 2024 1 commit
-
-
Daniel Hiltgen authored
If images aren't present, pull them. Also fixes the expected responses
-
- 23 Mar, 2024 1 commit
-
-
Daniel Hiltgen authored
This uplevels the integration tests to run the server which can allow testing an existing server, or a remote server.
-