- 11 Oct, 2025 2 commits
-
-
frob authored
-
Daniel Hiltgen authored
-
- 10 Oct, 2025 10 commits
-
-
Michael Yang authored
hardErrCh will deadlock since forwardBatch is blocked on computeStartedCh which never gets sent. since the response to hardErrCh is to panic, just panic instead
-
Daniel Hiltgen authored
* implement nvml for linux * Improve scheduler logging when VRAM doesn't recover
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
yajianggroup authored
Signed-off-by:yajianggroup <yajianggroup@outlook.com>
-
Jeffrey Morgan authored
-
Patrick Devine authored
-
- 09 Oct, 2025 9 commits
-
-
shengxinjing authored
-
shengxinjing authored
-
Michael Yang authored
-
Michael Yang authored
this change updates how metrics are collected. until now, performance metrics, specifically initial input processing and subsequent generation durations, were collected by taking the timestamp when creating a new sequence, the first token generation, and completing generation. the processing duration is taken as first token generation sub sequence creation while generation is taken as completing generation sub first token generation. while this approach is an accurate end-to-end metric of processing and generation, it's not comparable to other tools which only measure the active, i.e. decode, duration. this change updates the metrics to only capture decode duration so it can be more directly compared to other tools
-
Daniel Hiltgen authored
* logs: quiet down context canceled on completion If the client closes the connection before Completion finishes, we were logging at error level implying the runner crashed which was misleading. time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled" * quiet down scheduler log error on expected case Since we don't hold the lock while performing memory load calculations, other runners can unload in parallel, so finding no runner to unload is a valid scenario which we shouldn't log at error level.
-
Parth Sareen authored
-
Patrick Devine authored
-
Jeffrey Morgan authored
This reverts commit 6a62b894.
-
Jeffrey Morgan authored
-
- 08 Oct, 2025 4 commits
-
-
Patrick Devine authored
-
Jesse Gross authored
Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that are out of the cache's window each time we start a new forward pass. The cache storage needs to handle the window size for each sequence plus the batch size, since the batch needs to attend to the full window size. This means that we have greater than a window size stored while processing the batch. When the next batch comes, we are currently only looking at the sequences in the incoming batch to slide the window forward. However, we also need to clean up the other sequences that might be occupying space in the batch processing buffer to ensure each sequence is only using its window size of storage. Failure to do this can result in "no kv cache slot found" errors. Fixes: #10127
-
Jesse Gross authored
GGML picks the wrong kernel and these systems fail with: Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu was compiled for: __CUDA_ARCH_LIST__ Fixes #12442
-
Daniel Hiltgen authored
Remove some flaky scenarios, and switch to chat for better reliability
-
- 07 Oct, 2025 2 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
* Bring back escape valve for llm libraries If the new discovery logic picks the wrong library, this gives users the ability to force a specific one using the same pattern as before. This can also potentially speed up bootstrap discovery if one of the libraries takes a long time to load and ultimately bind to no devices. For example unsupported AMD iGPUS can sometimes take a while to discover and rule out. * Bypass extra discovery on jetpack systems On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in cublasInit so if we detect a Jetpack, short-circuit and use that variant.
-
- 06 Oct, 2025 3 commits
-
-
Devon Rifkin authored
openai: refactor to split compat layer and middleware
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This variable isn't currently documented or intended as something the user can override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling this in the subprocess environment which will cause problems with the new bootstrap discovery logic.
-
- 05 Oct, 2025 1 commit
-
-
Devon Rifkin authored
This makes the core openai compat layer independent of the middleware that adapts it to our particular gin routes
-
- 04 Oct, 2025 2 commits
-
-
Daniel Hiltgen authored
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
-
Daniel Hiltgen authored
-
- 03 Oct, 2025 7 commits
-
-
Jesse Gross authored
With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.
-
Grace authored
-
Daniel Hiltgen authored
The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU systems as they only return the kernels actual free memory and ignore buff/cache allocations which on a typical system will quickly fill up most of the free system memory. As a result, we incorrectly think there's very little available for GPU allocations which is wrong.
-
Patrick Devine authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This will likely yield builds that have problems with unicode characters but at least we can start testing the release while we try to find an alternate clang compiler for windows, or mingw ships a fixed version.
-
Daniel Hiltgen authored
-