- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.
-
- 15 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* test: improve scheduler/concurrency stress tests The scheduler test used to use approximate memory figures and would often over or under shoot a systems capcity leading to flaky test results. This should improve the reliability of this scenario by leveraging ps output to determinie exactly how many models it takes to trigger thrashing. The concurrency test is also refined to target num_parallel + 1 and handle timeouts better. With these refinements, TestMultiModelConcurrency was redundant * test: add parallel generate with history TestGenerateWithHistory will help verify caching and context are properly handled while making requests * test: focus embed tests on embedding models remove non-embedding models from the embedding tests
-
- 02 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
-
- 20 Nov, 2024 1 commit
-
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
- 09 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
On the smaller GPUs, the initial model load of llama2 took over 30s (the default timeout for the DoGenerate helper)
-
- 14 Jun, 2024 3 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
adjust timing on some tests so they don't timeout on small/slow GPUs
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
- 23 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-
- 04 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Confirmed this fails on 0.1.30 with known regression but passes on main
-
- 01 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Cleaner shutdown logic, a bit of response hardening
-
- 26 Mar, 2024 1 commit
-
-
Patrick Devine authored
-
- 25 Mar, 2024 1 commit
-
-
Daniel Hiltgen authored
If images aren't present, pull them. Also fixes the expected responses
-
- 23 Mar, 2024 1 commit
-
-
Daniel Hiltgen authored
This uplevels the integration tests to run the server which can allow testing an existing server, or a remote server.
-