- 20 Nov, 2025 5 commits
-
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
Recent refactoring introduced a regression for filtering cuda overlap to favor newest supported version.
-
Grace authored
-
Michael Yang authored
the check for mla omits v3 and r1 which should not return unsupported. instead check the tokenizer for compatibility
-
Jesse Gross authored
The causal cache can store data differently depending on what is best for the backend. We should run tests both ways.
-
- 19 Nov, 2025 10 commits
-
-
nicole pardal authored
-
Michael Yang authored
-
Patrick Devine authored
-
Jesse Gross authored
We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation. However, GGML recently removed an optimization for ggml_cpy() and we picked it up in 544b6739 "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored. Fixes #13112
-
Jesse Gross authored
GGML requires tensors to be contiguous for reshape and if this is not the case, it will assert fail. Contiguous is an expensive operation, so it's best to do it lazily when it is actually required rather than ahead of time when it may not be needed.
-
Grace authored
-
nicole pardal authored
-
Daniel Hiltgen authored
Calling abort on windows triggers the C++ runtime to attempt a debugger attach, which causes the crashed runners to hang instead of exit, leading to a timeout instead of a fast failure during discovery.
-
Michael Yang authored
cuda panics on batches larger than 1024 so skip those and fallback to cpu
-
Michael Yang authored
-
- 18 Nov, 2025 7 commits
-
-
Lhiam Andrei Lingco authored
-
Michael Yang authored
-
Michael Yang authored
* migrate to golangci-lint v2 * copyloopvar
-
SamareshSingh authored
Void is an open source AI code editor and Cursor alternative that supports Ollama. It's built on VS Code and allows users to connect directly to Ollama for private LLM usage without going through a middleman backend. Key features: - Open source Cursor alternative - Direct Ollama integration - VS Code fork with full compatibility - Agent mode and MCP support - Works with any open source model Fixes #12919 Signed-off-by:Samaresh Kumar Singh <ssam3003@gmail.com>
-
Grace authored
* Add mla for flash attention * Revert to using chunks
-
Eva H authored
-
Cerussite authored
* Add supports for cgroups cores and memory limitations * fix compile error and add logs * remove cpu info log
-
- 17 Nov, 2025 4 commits
-
-
Daniel Hiltgen authored
* build: optimize dockerfile context for iterating This moves the copy of the source into the layer AFTER doing software installs so we don't have to go through the RPM install for cuda, etc. every time you touch a source file. * amd: implement linux sysfs based VRAM lookup This adds a C++ implementation of sysfs DRM VRAM discovery for more accurate free VRAM data on linux for AMD GPUs.
-
Daniel Hiltgen authored
-
Eva H authored
-
Jeffrey Morgan authored
-
- 16 Nov, 2025 6 commits
-
-
omahs authored
-
Joel Bryan Juliano authored
Kdeps is an AI framework for building Dockerized full-stack AI applications declaratively and uses Ollama LLM models on the backend
-
pierwill authored
Co-authored-by:pierwill <pierwill@users.noreply.github.com>
-
Vignesh Skanda authored
-
Laurențiu Nicola authored
-
Patrick Devine authored
This change adds a basic benchmarking test framework for Ollama which can be used to determine the prefill, eval, load duration, and total duration for running a given model or models.
-
- 14 Nov, 2025 2 commits
-
-
Daniel Hiltgen authored
Many failed GPU discovery issues recently can be traced to incorrect override settings. This extra logging should help quickly spot these and guide users to try unsetting them first.
-
Parth Sareen authored
-
- 13 Nov, 2025 6 commits
-
-
Michael Yang authored
-
Michael Yang authored
* use slice/chunks * bert * llama4 * gemma3n * gptoss * mistral3 * qwen3vl * qwen25vl * deepseek2 * remove unused ops
-
Parth Sareen authored
-
Michael Yang authored
* slice * chunk, chunksections
-
nicole pardal authored
-
Kowyo authored
-