- 02 Dec, 2025 5 commits
-
-
Daniel Hiltgen authored
Avoid hitting test timeouts
-
Jesse Gross authored
Model eviction happens when we have at least one other model loaded and are unable to load all layers into VRAM. However, on CPU-only systems we can never load layers into VRAM, so this constantly triggered eviction. Fixes #13227
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Patrick Devine authored
This change: * fixes rope scaling in the mistral converter * updates ministral to include llama4 scaling * includes a new ministral parser for parsing reasoning and tool calling --------- Co-authored-by:jmorganca <jmorganca@gmail.com>
-
- 01 Dec, 2025 3 commits
-
-
Daniel Hiltgen authored
If the user has somehow installed another GGML based app which places a ggml-base lib somewhere in their PATH, we can experience runtime problems due to incompatibilities. This change adds a warning message if we detect a ggml-base outside of our install location to aid in troubleshooting.
-
Bruce MacDonald authored
While processing the response stream during a chat or generation if an error is occurred it is parsed and returned to the user. The issue with the existing code is that this assumed the response would be valid JSON, which is not a safe assumption and caused cryptic error messages to be displayed due to parsing failures: `invalid character 'i' looking for beginning of value` This change updates the stream function to return the raw error string if it cant be parsed as JSON. This should help with debugging issues by making sure the actual error reaches the user.
-
Daniel Hiltgen authored
The cuda_jetpack libs will enumerate discrete GPUs on SBSA systems which leads to runtime failures of missing kernels. This fix requires an exact match to enable jetpacks instead of relying on enumeration to filter out supported libraries.
-
- 30 Nov, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 29 Nov, 2025 1 commit
-
-
Ondrej Kokes authored
There were a few Markdown typos in one FAQ answer. It now renders as a proper ascii table.
-
- 26 Nov, 2025 1 commit
-
-
EntropyYue authored
-
- 20 Nov, 2025 6 commits
-
-
Eva H authored
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
Recent refactoring introduced a regression for filtering cuda overlap to favor newest supported version.
-
Grace authored
-
Michael Yang authored
the check for mla omits v3 and r1 which should not return unsupported. instead check the tokenizer for compatibility
-
Jesse Gross authored
The causal cache can store data differently depending on what is best for the backend. We should run tests both ways.
-
- 19 Nov, 2025 10 commits
-
-
nicole pardal authored
-
Michael Yang authored
-
Patrick Devine authored
-
Jesse Gross authored
We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation. However, GGML recently removed an optimization for ggml_cpy() and we picked it up in 544b6739 "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored. Fixes #13112
-
Jesse Gross authored
GGML requires tensors to be contiguous for reshape and if this is not the case, it will assert fail. Contiguous is an expensive operation, so it's best to do it lazily when it is actually required rather than ahead of time when it may not be needed.
-
Grace authored
-
nicole pardal authored
-
Daniel Hiltgen authored
Calling abort on windows triggers the C++ runtime to attempt a debugger attach, which causes the crashed runners to hang instead of exit, leading to a timeout instead of a fast failure during discovery.
-
Michael Yang authored
cuda panics on batches larger than 1024 so skip those and fallback to cpu
-
Michael Yang authored
-
- 18 Nov, 2025 7 commits
-
-
Lhiam Andrei Lingco authored
-
Michael Yang authored
-
Michael Yang authored
* migrate to golangci-lint v2 * copyloopvar
-
SamareshSingh authored
Void is an open source AI code editor and Cursor alternative that supports Ollama. It's built on VS Code and allows users to connect directly to Ollama for private LLM usage without going through a middleman backend. Key features: - Open source Cursor alternative - Direct Ollama integration - VS Code fork with full compatibility - Agent mode and MCP support - Works with any open source model Fixes #12919 Signed-off-by:Samaresh Kumar Singh <ssam3003@gmail.com>
-
Grace authored
* Add mla for flash attention * Revert to using chunks
-
Eva H authored
-
Cerussite authored
* Add supports for cgroups cores and memory limitations * fix compile error and add logs * remove cpu info log
-
- 17 Nov, 2025 4 commits
-
-
Daniel Hiltgen authored
* build: optimize dockerfile context for iterating This moves the copy of the source into the layer AFTER doing software installs so we don't have to go through the RPM install for cuda, etc. every time you touch a source file. * amd: implement linux sysfs based VRAM lookup This adds a C++ implementation of sysfs DRM VRAM discovery for more accurate free VRAM data on linux for AMD GPUs.
-
Daniel Hiltgen authored
-
Eva H authored
-
Jeffrey Morgan authored
-
- 16 Nov, 2025 2 commits
-
-
omahs authored
-
Joel Bryan Juliano authored
Kdeps is an AI framework for building Dockerized full-stack AI applications declaratively and uses Ollama LLM models on the backend
-