- 20 Oct, 2025 1 commit
-
-
Michael Yang authored
-
- 16 Oct, 2025 1 commit
-
-
zhetaicheleba authored
-
- 15 Oct, 2025 1 commit
-
-
Jesse Gross authored
-
- 13 Oct, 2025 1 commit
-
-
Michael Yang authored
This reverts commit 3d32249c.
-
- 10 Oct, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 03 Oct, 2025 2 commits
-
-
Jesse Gross authored
With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.
-
Jesse Gross authored
-
- 24 Sep, 2025 1 commit
-
-
Michael Yang authored
-
- 17 Sep, 2025 1 commit
-
-
Michael Yang authored
-
- 10 Sep, 2025 2 commits
-
-
Jesse Gross authored
Our new engine implementation of gemma2 doesn't support flash attention, which means that it also doesn't support KV cache quantization. Currently, it is possible to turn these two on, which will result in a crash.
-
Jesse Gross authored
If flash attention is enabled without KV cache quanitization, we will currently always get this warning: level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
-
- 08 Sep, 2025 1 commit
-
-
Gabe Goodhart authored
This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers. The logic for the sizing of the recurrent layers comes from the llama.cpp implementation ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size); ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size); Signed-off-by:Gabe Goodhart <ghart@us.ibm.com>
-
- 26 Aug, 2025 3 commits
-
-
Michael Yang authored
* convert: return bytes written * ggml flavor mxfp4 * simplify jit conversion * comment
-
Michael Yang authored
there's two bugs here. 1. the check for a layer id is incorrect and should be >= 0 since layer 0 is valid 2. if both tensors have an layer identifier, it will only compare the layer id which will return 0 if the tensors are in the same layer. instead it should fallback to comparing the full tensor name
-
Michael Yang authored
-
- 15 Aug, 2025 1 commit
-
-
Michael Yang authored
-
- 14 Aug, 2025 2 commits
-
-
Jesse Gross authored
This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.
-
Michael Yang authored
* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore *.patched in the patch directory * fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to explicitly include the .c|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by:Gabe Goodhart <ghart@us.ibm.com> Co-authored-by:
Gabe Goodhart <ghart@us.ibm.com> Co-authored-by:
Daniel Hiltgen <daniel@ollama.com>
-
- 05 Aug, 2025 3 commits
-
-
Michael Yang authored
-
Jesse Gross authored
KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 26 Jun, 2025 3 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Michael Yang authored
* update patches * cherry pick metal mean kernel * cherry pick cuda mean kernel * gemma3n
-
- 20 Jun, 2025 1 commit
-
-
Michael Yang authored
* Reapply "feat: incremental gguf parser (#10822)" (#11114) This reverts commit a6e64fbd. * fix older ggufs
-
- 18 Jun, 2025 1 commit
-
-
Jeffrey Morgan authored
This reverts commit 6b04cad7.
-
- 16 Jun, 2025 1 commit
-
-
Michael Yang authored
* ggml: test write gguf order * ggml: fix write tensor order
-
- 12 Jun, 2025 1 commit
-
-
Michael Yang authored
* incremental gguf parser * gguf: update test to not rely on gguf on disc * re-use existing create gguf * read capabilities from gguf kv * kv exists * update tests * s/doneFunc/successFunc/g * new buffered reader --------- Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 19 May, 2025 1 commit
-
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 14 May, 2025 3 commits
-
-
Bruce MacDonald authored
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 12 May, 2025 1 commit
-
-
Daniel Hiltgen authored
The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.
-
- 07 May, 2025 1 commit
-
-
Daniel Hiltgen authored
err in the go routine should not be shared with the outer scope
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 05 May, 2025 1 commit
-
-
Jesse Gross authored
Most of the time this is not an error.
-
- 01 May, 2025 1 commit
-
-
Michael Yang authored
* add gguf_test * fix padding padding was being added to offset but not to the running count
-
- 27 Apr, 2025 1 commit
-
-
Devon Rifkin authored
If it's an array, it uses the max value in the array If array values for head counts becomes more popular, we can consider a more invasive change like #10225 to calculate more accurate estimates. Fixes: #9984
-
- 25 Apr, 2025 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-