- 13 Oct, 2025 5 commits
-
-
Grace authored
* working (other than tool call is the incorrect order) for tool calls and tools * Tests work, other than image tags (tests do not go through server) and tools (not in the correct order, but contents are the same) * testing for qwen3vl parser - toolparser is working * made changes to JSON tool parser, wraps the TollCallFunction with a TollCall object * Working parser for thinking models - assumes state of thinking, emits unambiguous content in thinking, does not call tool call in thinking * changed the parser to start with collecting content * thinking prefill * add hasThinkingSupport parameter to parser * qwen3-vl -> qwen3-vl-instruct for renderer/parser * Add hasThinkingSupport=false to QwenVLParser --------- Co-authored-by:Devon Rifkin <drifkin@drifkin.net>
-
Gabe Goodhart authored
Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) * feat: Bump llama.cpp to df1b612 Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(mtmd): Correctly encode text chunks during mtmd tokenization There can be text chunks that appear interspersed with the image embeddings that contain template delimiter tokens for some models. These need to be correctly translated to text tokens. Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * tests: Use MtmdChunk in image_test Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * style: Fix unnecessary conversion linting Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(ggml): Revert changes to ggml_hip.cpp These changes were done largely by our code assistant and are likely wrong Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix: Revert changes in mem_nvml.cpp Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Update sync point to 1deee0 This brings in several more optimization commits and model support for EmbeddingGemma Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches for 1deee0 Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: sync for bump to 1deee0 Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix: Bad patch updates with errant `+` Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Bump llama.cpp/ggml to 7049736 Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix: format-patches after latest bump Branch: LlamaCPPBump-GraniteDocling Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com>
-
Jeffrey Morgan authored
-
Michael Yang authored
This reverts commit 3d32249c.
-
Michael Yang authored
deepseek's qwen3 distill uses a different rope scheme so support both
-
- 11 Oct, 2025 5 commits
-
-
Jeffrey Morgan authored
-
Devon Rifkin authored
routes: fix built-in renderers for `api/generate`
-
Devon Rifkin authored
Made it so when api/generate builds up a message array and generates the prompt it now goes through the same function as `api/chat` for consistency. This is where we hook the optional built-in renderers to bypass templates, which was missing for `api/generate` before this change. Closes: #12578
-
frob authored
-
Daniel Hiltgen authored
-
- 10 Oct, 2025 10 commits
-
-
Michael Yang authored
hardErrCh will deadlock since forwardBatch is blocked on computeStartedCh which never gets sent. since the response to hardErrCh is to panic, just panic instead
-
Daniel Hiltgen authored
* implement nvml for linux * Improve scheduler logging when VRAM doesn't recover
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
yajianggroup authored
Signed-off-by:yajianggroup <yajianggroup@outlook.com>
-
Jeffrey Morgan authored
-
Patrick Devine authored
-
- 09 Oct, 2025 9 commits
-
-
shengxinjing authored
-
shengxinjing authored
-
Michael Yang authored
-
Michael Yang authored
this change updates how metrics are collected. until now, performance metrics, specifically initial input processing and subsequent generation durations, were collected by taking the timestamp when creating a new sequence, the first token generation, and completing generation. the processing duration is taken as first token generation sub sequence creation while generation is taken as completing generation sub first token generation. while this approach is an accurate end-to-end metric of processing and generation, it's not comparable to other tools which only measure the active, i.e. decode, duration. this change updates the metrics to only capture decode duration so it can be more directly compared to other tools
-
Daniel Hiltgen authored
* logs: quiet down context canceled on completion If the client closes the connection before Completion finishes, we were logging at error level implying the runner crashed which was misleading. time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled" * quiet down scheduler log error on expected case Since we don't hold the lock while performing memory load calculations, other runners can unload in parallel, so finding no runner to unload is a valid scenario which we shouldn't log at error level.
-
Parth Sareen authored
-
Patrick Devine authored
-
Jeffrey Morgan authored
This reverts commit 6a62b894.
-
Jeffrey Morgan authored
-
- 08 Oct, 2025 4 commits
-
-
Patrick Devine authored
-
Jesse Gross authored
Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that are out of the cache's window each time we start a new forward pass. The cache storage needs to handle the window size for each sequence plus the batch size, since the batch needs to attend to the full window size. This means that we have greater than a window size stored while processing the batch. When the next batch comes, we are currently only looking at the sequences in the incoming batch to slide the window forward. However, we also need to clean up the other sequences that might be occupying space in the batch processing buffer to ensure each sequence is only using its window size of storage. Failure to do this can result in "no kv cache slot found" errors. Fixes: #10127
-
Jesse Gross authored
GGML picks the wrong kernel and these systems fail with: Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu was compiled for: __CUDA_ARCH_LIST__ Fixes #12442
-
Daniel Hiltgen authored
Remove some flaky scenarios, and switch to chat for better reliability
-
- 07 Oct, 2025 2 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
* Bring back escape valve for llm libraries If the new discovery logic picks the wrong library, this gives users the ability to force a specific one using the same pattern as before. This can also potentially speed up bootstrap discovery if one of the libraries takes a long time to load and ultimately bind to no devices. For example unsupported AMD iGPUS can sometimes take a while to discover and rule out. * Bypass extra discovery on jetpack systems On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in cublasInit so if we detect a Jetpack, short-circuit and use that variant.
-
- 06 Oct, 2025 3 commits
-
-
Devon Rifkin authored
openai: refactor to split compat layer and middleware
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This variable isn't currently documented or intended as something the user can override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling this in the subprocess environment which will cause problems with the new bootstrap discovery logic.
-
- 05 Oct, 2025 1 commit
-
-
Devon Rifkin authored
This makes the core openai compat layer independent of the middleware that adapts it to our particular gin routes
-
- 04 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
-