- 09 Dec, 2025 3 commits
-
-
Parth Sareen authored
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 08 Dec, 2025 5 commits
-
-
Michael Yang authored
change to a flatter directory structure and group the options with the function update models to call rope in one place
-
nicole pardal authored
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
-
Daniel Hiltgen authored
Prevent CGO from accidentally reusing old object files from the cache across vendor updates
-
JJ authored
-
Jeffrey Morgan authored
-
- 06 Dec, 2025 1 commit
-
-
Daniel Hiltgen authored
Follow up from #12992 Free all streams, and keep the alloc logic aligned across streams.
-
- 05 Dec, 2025 1 commit
-
-
Sos Pogosyan authored
fix(api): correct Content-Type header for /api/chat and /api/generate when using cloud models (#13279) --------- Co-authored-by:
Pogosyan Sos <sos_pogosyan@MacBook-Pro-Sos.local> Co-authored-by:
Patrick Devine <patrick@infrahq.com>
-
- 04 Dec, 2025 7 commits
-
-
Jesse Gross authored
-
Jesse Gross authored
Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.
-
Jesse Gross authored
We currently use cache padding of 32 when not using flash attention and 256 with flash attention, which is based on the historic alignment requirements of these kernels. The restrictions have since been loosened but there are still performance benefits, such as better CUDA graph reuse. Since the requirement is no longer kernel-specific, set the padding uniformly to 256, as llama.cpp has.
-
Patrick Devine authored
This change adds the ability for `ollama create` to convert models that use the DeepSeek2 architecture (specifically DeepSeekV3 and DeepSeek-R1).
-
Eloi Torrents authored
* cmd/bench: support writing benchmark output to file This changes Ollama to allow the bench command to write benchmark results to a user-specified output file instead of stdout when the --output flag is provided. --------- Co-authored-by:Patrick Devine <patrick@infrahq.com>
-
Daniel Hiltgen authored
* Revert "vulkan: temporary cary of vulkan fixes (#12971)" This reverts commit 3a9e8e9f. * ggml update to b7087 * fix argsort on metal * update to b7108 * fix bakllava regression This model lacks the metadata for the projector type. * update to b7209 * fix TopK perf * only build arm code on arm
-
Jeffrey Morgan authored
-
- 03 Dec, 2025 2 commits
-
-
Bruce MacDonald authored
This fixes a bug where disabling thinking on deepseek-v3.1 did not stop the model from thinking. When thinking is not defined it should not be sent to the server since this will cause error responses in some cases where the model does not support thinking. However if it is defined as false it should still be sent.
-
Daniel Hiltgen authored
We now do a deeper probe of CUDA devices to verify the library version has the correct compute capability coverage for the device. Due to ROCm also interpreting the CUDA env var to filter AMD devices, we try to avoid setting it which leads to problems in mixed vendor systems. However without setting it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs and on systems with lots of GPUs, this can lead to hitting timeouts. The fix is to turn on the CUDA visibility env var just for this deeper probe use-case.
-
- 02 Dec, 2025 7 commits
-
-
Nathan Hook authored
-
hello_world authored
Added Vulkan SDK installation instructions and environment variable setup for building with Vulkan support.
-
Daniel Hiltgen authored
Avoid hitting test timeouts
-
Jesse Gross authored
Model eviction happens when we have at least one other model loaded and are unable to load all layers into VRAM. However, on CPU-only systems we can never load layers into VRAM, so this constantly triggered eviction. Fixes #13227
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Patrick Devine authored
This change: * fixes rope scaling in the mistral converter * updates ministral to include llama4 scaling * includes a new ministral parser for parsing reasoning and tool calling --------- Co-authored-by:jmorganca <jmorganca@gmail.com>
-
- 01 Dec, 2025 3 commits
-
-
Daniel Hiltgen authored
If the user has somehow installed another GGML based app which places a ggml-base lib somewhere in their PATH, we can experience runtime problems due to incompatibilities. This change adds a warning message if we detect a ggml-base outside of our install location to aid in troubleshooting.
-
Bruce MacDonald authored
While processing the response stream during a chat or generation if an error is occurred it is parsed and returned to the user. The issue with the existing code is that this assumed the response would be valid JSON, which is not a safe assumption and caused cryptic error messages to be displayed due to parsing failures: `invalid character 'i' looking for beginning of value` This change updates the stream function to return the raw error string if it cant be parsed as JSON. This should help with debugging issues by making sure the actual error reaches the user.
-
Daniel Hiltgen authored
The cuda_jetpack libs will enumerate discrete GPUs on SBSA systems which leads to runtime failures of missing kernels. This fix requires an exact match to enable jetpacks instead of relying on enumeration to filter out supported libraries.
-
- 30 Nov, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 29 Nov, 2025 1 commit
-
-
Ondrej Kokes authored
There were a few Markdown typos in one FAQ answer. It now renders as a proper ascii table.
-
- 26 Nov, 2025 1 commit
-
-
EntropyYue authored
-
- 20 Nov, 2025 6 commits
-
-
Eva H authored
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
Recent refactoring introduced a regression for filtering cuda overlap to favor newest supported version.
-
Grace authored
-
Michael Yang authored
the check for mla omits v3 and r1 which should not return unsupported. instead check the tokenizer for compatibility
-
Jesse Gross authored
The causal cache can store data differently depending on what is best for the backend. We should run tests both ways.
-
- 19 Nov, 2025 2 commits
-
-
nicole pardal authored
-
Michael Yang authored
-