- 17 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
* test: harden scheduler tests This removes reschedDelay which was stale code, and adds a new configurable timeout for the waitForVRAMRecovery so tests can now set the timeout to be very short to avoid the scheduler getting stuck and hitting a test timeout. * test: tune tests for partial loads Give stress tests more time when the model is split between CPU/GPU
-
- 08 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
Remove some flaky scenarios, and switch to chat for better reliability
-
- 02 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported
-
- 22 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
* tests: add single threaded history test Also tidies up some existing tests to handle more model output variation * test: add support for testing specific architectures
-
- 12 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
Sometimes the context test results are pure emoji's Thanksgiving has too much variability, so swap for a more straight forward prompt.
-
- 09 Sep, 2025 1 commit
-
-
Jesse Gross authored
The context must always be able to store the current batch, so if the user requests a small context then we should also shrink the batch to match. This also fixes the TestLongInputContext test on the new engine. (The old engine already has this behavior.)
-
- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.
-
- 15 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* test: improve scheduler/concurrency stress tests The scheduler test used to use approximate memory figures and would often over or under shoot a systems capcity leading to flaky test results. This should improve the reliability of this scenario by leveraging ps output to determinie exactly how many models it takes to trigger thrashing. The concurrency test is also refined to target num_parallel + 1 and handle timeouts better. With these refinements, TestMultiModelConcurrency was redundant * test: add parallel generate with history TestGenerateWithHistory will help verify caching and context are properly handled while making requests * test: focus embed tests on embedding models remove non-embedding models from the embedding tests
-
- 02 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
-
- 20 Nov, 2024 1 commit
-
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
- 09 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
On the smaller GPUs, the initial model load of llama2 took over 30s (the default timeout for the DoGenerate helper)
-
- 14 Jun, 2024 3 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
adjust timing on some tests so they don't timeout on small/slow GPUs
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
- 23 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-
- 04 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Confirmed this fails on 0.1.30 with known regression but passes on main
-
- 01 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Cleaner shutdown logic, a bit of response hardening
-
- 26 Mar, 2024 1 commit
-
-
Patrick Devine authored
-
- 25 Mar, 2024 1 commit
-
-
Daniel Hiltgen authored
If images aren't present, pull them. Also fixes the expected responses
-
- 23 Mar, 2024 1 commit
-
-
Daniel Hiltgen authored
This uplevels the integration tests to run the server which can allow testing an existing server, or a remote server.
-