- 10 Oct, 2025 4 commits
-
-
Michael Yang authored
-
yajianggroup authored
Signed-off-by:yajianggroup <yajianggroup@outlook.com>
-
Jeffrey Morgan authored
-
Patrick Devine authored
-
- 09 Oct, 2025 9 commits
-
-
shengxinjing authored
-
shengxinjing authored
-
Michael Yang authored
-
Michael Yang authored
this change updates how metrics are collected. until now, performance metrics, specifically initial input processing and subsequent generation durations, were collected by taking the timestamp when creating a new sequence, the first token generation, and completing generation. the processing duration is taken as first token generation sub sequence creation while generation is taken as completing generation sub first token generation. while this approach is an accurate end-to-end metric of processing and generation, it's not comparable to other tools which only measure the active, i.e. decode, duration. this change updates the metrics to only capture decode duration so it can be more directly compared to other tools
-
Daniel Hiltgen authored
* logs: quiet down context canceled on completion If the client closes the connection before Completion finishes, we were logging at error level implying the runner crashed which was misleading. time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled" * quiet down scheduler log error on expected case Since we don't hold the lock while performing memory load calculations, other runners can unload in parallel, so finding no runner to unload is a valid scenario which we shouldn't log at error level.
-
Parth Sareen authored
-
Patrick Devine authored
-
Jeffrey Morgan authored
This reverts commit 6a62b894.
-
Jeffrey Morgan authored
-
- 08 Oct, 2025 4 commits
-
-
Patrick Devine authored
-
Jesse Gross authored
Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that are out of the cache's window each time we start a new forward pass. The cache storage needs to handle the window size for each sequence plus the batch size, since the batch needs to attend to the full window size. This means that we have greater than a window size stored while processing the batch. When the next batch comes, we are currently only looking at the sequences in the incoming batch to slide the window forward. However, we also need to clean up the other sequences that might be occupying space in the batch processing buffer to ensure each sequence is only using its window size of storage. Failure to do this can result in "no kv cache slot found" errors. Fixes: #10127
-
Jesse Gross authored
GGML picks the wrong kernel and these systems fail with: Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu was compiled for: __CUDA_ARCH_LIST__ Fixes #12442
-
Daniel Hiltgen authored
Remove some flaky scenarios, and switch to chat for better reliability
-
- 07 Oct, 2025 2 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
* Bring back escape valve for llm libraries If the new discovery logic picks the wrong library, this gives users the ability to force a specific one using the same pattern as before. This can also potentially speed up bootstrap discovery if one of the libraries takes a long time to load and ultimately bind to no devices. For example unsupported AMD iGPUS can sometimes take a while to discover and rule out. * Bypass extra discovery on jetpack systems On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in cublasInit so if we detect a Jetpack, short-circuit and use that variant.
-
- 06 Oct, 2025 3 commits
-
-
Devon Rifkin authored
openai: refactor to split compat layer and middleware
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This variable isn't currently documented or intended as something the user can override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling this in the subprocess environment which will cause problems with the new bootstrap discovery logic.
-
- 05 Oct, 2025 1 commit
-
-
Devon Rifkin authored
This makes the core openai compat layer independent of the middleware that adapts it to our particular gin routes
-
- 04 Oct, 2025 2 commits
-
-
Daniel Hiltgen authored
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
-
Daniel Hiltgen authored
-
- 03 Oct, 2025 10 commits
-
-
Jesse Gross authored
With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.
-
Grace authored
-
Daniel Hiltgen authored
The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU systems as they only return the kernels actual free memory and ignore buff/cache allocations which on a typical system will quickly fill up most of the free system memory. As a result, we incorrectly think there's very little available for GPU allocations which is wrong.
-
Patrick Devine authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This will likely yield builds that have problems with unicode characters but at least we can start testing the release while we try to find an alternate clang compiler for windows, or mingw ships a fixed version.
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Patrick Devine authored
-
Jesse Gross authored
-
- 02 Oct, 2025 4 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported
-
Jesse Gross authored
As we automatically enable flash attention for more models, there are likely some cases where we get it wrong. This allows setting OLLAMA_FLASH_ATTENTION=0 to disable it, even for models that usually have flash attention.
-
Daniel Hiltgen authored
Wrong index variable was used.
-
- 01 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.
-