- 30 Oct, 2025 1 commit
-
-
Patrick Devine authored
-
- 29 Oct, 2025 3 commits
-
-
Daniel Hiltgen authored
this should reduce zombies during integration runs
-
Patrick Devine authored
-
Michael Yang authored
-
- 28 Oct, 2025 2 commits
-
-
Patrick Devine authored
-
Patrick Devine authored
This reverts commit 5d347f6d.
-
- 27 Oct, 2025 1 commit
-
-
nicole pardal authored
Currently, checking the length of prompts for embeddings to ensure they fit in the context window (and possible truncation) occurs in two places - the Ollama server and runner. This can lead to inconsistencies in both the checks and reported number of tokens processed. Since we have to do this processing in the runner, this consolidates all of the logic there.
-
- 20 Oct, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 17 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
* test: harden scheduler tests This removes reschedDelay which was stale code, and adds a new configurable timeout for the waitForVRAMRecovery so tests can now set the timeout to be very short to avoid the scheduler getting stuck and hitting a test timeout. * test: tune tests for partial loads Give stress tests more time when the model is split between CPU/GPU
-
- 16 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 08 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
Remove some flaky scenarios, and switch to chat for better reliability
-
- 02 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported
-
- 01 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.
-
- 22 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
* tests: add single threaded history test Also tidies up some existing tests to handle more model output variation * test: add support for testing specific architectures
-
- 18 Sep, 2025 1 commit
-
-
Michael Yang authored
-
- 12 Sep, 2025 1 commit
-
-
Daniel Hiltgen authored
Sometimes the context test results are pure emoji's Thanksgiving has too much variability, so swap for a more straight forward prompt.
-
- 09 Sep, 2025 3 commits
-
-
Parth Sareen authored
-
Daniel Hiltgen authored
* tests: reduce stress on CPU to 2 models This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently * tests: allow slow systems to pass on timeout If a slow system is still streaming a response, and the response will pass validation, don't fail just because the system is slow. * test: unload embedding models more quickly
-
Jesse Gross authored
The context must always be able to store the current batch, so if the user requests a small context then we should also shrink the batch to match. This also fixes the TestLongInputContext test on the new engine. (The old engine already has this behavior.)
-
- 29 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.
-
- 15 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
* test: improve scheduler/concurrency stress tests The scheduler test used to use approximate memory figures and would often over or under shoot a systems capcity leading to flaky test results. This should improve the reliability of this scenario by leveraging ps output to determinie exactly how many models it takes to trigger thrashing. The concurrency test is also refined to target num_parallel + 1 and handle timeouts better. With these refinements, TestMultiModelConcurrency was redundant * test: add parallel generate with history TestGenerateWithHistory will help verify caching and context are properly handled while making requests * test: focus embed tests on embedding models remove non-embedding models from the embedding tests
-
- 14 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
some of the new models need a few more valid responses to pass
-
- 13 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 07 Aug, 2025 1 commit
-
-
Daniel Hiltgen authored
Also wires up support to override the default "smol" model
-
- 11 Jul, 2025 1 commit
-
-
Daniel Hiltgen authored
* Only load supported models on new engine Verify the model is supported before trying to load * int: testcase for all library models
-
- 05 Jul, 2025 1 commit
-
-
Daniel Hiltgen authored
usage example: go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 | tee int.log cat int.log | grep MODEL_PERF_HEADER | cut -f2- -d: > perf.csv cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
-
- 19 Jun, 2025 1 commit
-
-
Daniel Hiltgen authored
Verified these fail on 0.9.1 and pass on HEAD.
-
- 24 May, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 22 May, 2025 1 commit
-
-
Daniel Hiltgen authored
Replace the older llava model with qwen2.5 for vision tests Skip split-batch test on small VRAM systems to avoid excessive test time
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 04 May, 2025 1 commit
-
-
湛露先生 authored
Signed-off-by:zhanluxianshen <zhanluxianshen@163.com>
-
- 29 Apr, 2025 1 commit
-
-
Daniel Hiltgen authored
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
-
- 16 Apr, 2025 1 commit
-
-
Daniel Hiltgen authored
Add some new test coverage for various model architectures, and switch from orca-mini to the small llama model.
-
- 08 Apr, 2025 1 commit
-
-
CYJiang authored
Signed-off-by:googs1025 <googs1025@gmail.com>
-
- 02 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
-
- 14 Mar, 2025 1 commit
-
-
Jesse Gross authored
Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes #9697
-
- 10 Dec, 2024 1 commit
-
-
Stefan Weil authored
-
- 22 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
This had fallen out of sync with the envconfig behavior, where max queue default was not zero.
-
- 20 Nov, 2024 1 commit
-
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
- 01 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
-