1. 09 Jan, 2026 1 commit
    • Daniel Hiltgen's avatar
      Add experimental MLX backend and engine with imagegen support (#13648) · 33ee7168
      Daniel Hiltgen authored
      
      
      * WIP - MLX backend with gemma3
      
      * MLX: add cmake and go tag build toggles
      
      To build the new MLX backend code:
        cmake --preset MLX
        cmake --build --preset MLX --parallel
        cmake --install build --component MLX
        go build -tags mlx .
      
      Note: the main.go entrypoint for the MLX engine will change in a follow up commit.
      
      * add experimental image generation runtime
      
      * add experimental image generation runtime
      
      * MLX: wire up cuda build for linux
      
      * MLX: get dependencies correct and dedup
      
      This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
      directory.
      
      * fix relative link bug in dedup
      
      * Add darwin build and readme
      
      * add go build tag for mlx dependent code and wire up build_darwin.sh
      
      * lint cleanup
      
      * macos: build mlx for x86
      
      This will be CPU only.
      
      * cuda build instructions and fix drift from mlx bump
      
      * stale comment
      
      * Delete agent helper doc
      
      * Clean up readme.md
      
      * Revise README for tokenizer clarity and details
      
      Updated README to clarify tokenizer functionality and removed correctness section.
      
      ---------
      Co-authored-by: default avatarjmorganca <jmorganca@gmail.com>
      33ee7168
  2. 16 Dec, 2025 1 commit
  3. 15 Dec, 2025 1 commit
  4. 12 Dec, 2025 1 commit
    • Daniel Hiltgen's avatar
      flash attn: add auto mode for llama engine (#13052) · bd6c1d6b
      Daniel Hiltgen authored
      * flash attn: add auto mode for llama engine
      
      If the user does not specify fa in the environment, use auto-mode.
      
      * review comments
      
      * ensure kv cache quantized types have FA explicitly enabled
      
      additional review comments
      bd6c1d6b
  5. 08 Dec, 2025 1 commit
  6. 04 Dec, 2025 1 commit
  7. 19 Nov, 2025 3 commits
  8. 18 Nov, 2025 1 commit
  9. 11 Nov, 2025 1 commit
    • Jesse Gross's avatar
      llm: Use Ollama engine memory layouts for both old and new engines · f560bd07
      Jesse Gross authored
      Currently for both the old and new engines, there is code to
      calculate how much memory is required for a model and lay out
      the layers onto GPUs. This reuses the new engine's lay out code
      for the old engine as well, bringing them closer together. The
      old engine continues to use its current method of estimating
      required memory.
      
      This reduces maintainence effort and improves consistency, as new
      features only need to be implemented in one place. The newer code
      is also more accurate, especially with multiple GPUs.
      f560bd07
  10. 30 Oct, 2025 1 commit
  11. 29 Oct, 2025 1 commit
  12. 20 Oct, 2025 1 commit
  13. 16 Oct, 2025 1 commit
  14. 15 Oct, 2025 1 commit
  15. 13 Oct, 2025 1 commit
  16. 10 Oct, 2025 1 commit
  17. 03 Oct, 2025 2 commits
  18. 24 Sep, 2025 1 commit
  19. 17 Sep, 2025 1 commit
  20. 10 Sep, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Disable flash attention for gemma2 · 29ddfc2c
      Jesse Gross authored
      Our new engine implementation of gemma2 doesn't support flash
      attention, which means that it also doesn't support KV cache
      quantization. Currently, it is possible to turn these two on,
      which will result in a crash.
      29ddfc2c
    • Jesse Gross's avatar
      llm: Remove unneeded warning with flash attention enabled · 71cb86af
      Jesse Gross authored
      If flash attention is enabled without KV cache quanitization, we will
      currently always get this warning:
      level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
      71cb86af
  21. 08 Sep, 2025 1 commit
    • Gabe Goodhart's avatar
      Hybrid and recurrent memory estimates (#12186) · 7b91c9ce
      Gabe Goodhart authored
      
      
      This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.
      
      The logic for the sizing of the recurrent layers comes from the llama.cpp implementation
      
              ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
              ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      7b91c9ce
  22. 26 Aug, 2025 3 commits
  23. 15 Aug, 2025 1 commit
  24. 14 Aug, 2025 2 commits
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
    • Michael Yang's avatar
      update vendored llama.cpp and ggml (#11823) · 1a19df1f
      Michael Yang authored
      * TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch
      
      This will be redone once my branch is merged upstream in llama.cpp
      
      * feat: Update all patches
      
      There are a number that are no longer needed at all:
      
      - 0003-embeddings: Embeddings entirely overhauled on master
      - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
          overhauled on master
      - 0019-metal-add-mean-kernel-14267: Merged upstream
      - 0020-CUDA-add-mean-operation-14313: Merged upstream
      
      * feat: Sync llama.cpp and ggml
      
      * fix: Update rsync-filter for all moved/new/removed files
      
      * fix: Add files missing from sync
      
      * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs
      
      * fix: Add ggml files missing from sync
      
      * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files
      
      * fix: Remove mtmd main cpp files
      
      * fix: Add missing include in sampling_ext.cpp
      
      * fix: Update llama.go to use mtmd instead of clip/llava
      
      * fix: Add patch for mtmd_input_text
      
      * chore: Ignore *.patched in the patch directory
      
      * fix: Fix support for arch-specific ggml-cpu source files with new arrangement
      
      In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
      implementations were split out into a nested tree structure under
      ggml-cpu/arch. This conflicts with standard CGO layout where all
      arch-specific source files are expected to live in the same directory as
      the parent go module and use suffixes based on GOOS and GOARCH. As such,
      there were really two options for getting this to work:
      
      1. Add a patch on top of the GGML sync to rearrange the files to match the
      GO layout convention
      2. Use CGO directives to conditionally include the nested source files in
      the compilation units
      
      This commit does (2) in order to minimize the set of changes needed on top
      of the upstream file layout. To get this to work, there are two key things
      needed:
      
      1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
      the preprocessor directives
      2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
      explicitly include the .c|.cpp files for the given architecture from the
      nested directory
      
      * fix: Use mtmd_helper to correctly load the bitmap for the image
      
      * fix: Apply patch for mtmd_text_input
      
      * fix: Add missing stb to llama.cpp rsync-filter
      
      * fix: Add sync'ed stb vendored header
      
      * fix: Use c++17 and include vendor for go wrapper modules
      
      * fix: Update patch 0015 for upstream implementation of uuid
      
      * feat: Bump to the latest tip of the branch
      
      * fix: Update patches for bump
      
      * feat: Bump back to the cenral repo and point at the latest master
      
      This includes granite 4 and a number of other model architectures!
      
      * fix: Revert changes to ggml export GPU UUID patch
      
      * fix: Add patch for GGML_VERSION and GGML_COMMIT constants
      
      * feat: Sync all patched code
      
      * build: Include cmake/common.cmake in ggml sync
      
      * build: Add top-level include for GNUINstallDirs in CMakeLists.txt
      
      This is used to populate CMAKE_INSTALL_BINDIR
      
      * fix: Add a patch to avoid power throttling API on non-msvc windows builds
      
      * fix: Sync patch changes for ggml-cpu.c
      
      * feat: Bump llama.cpp to 4a4f42
      
      This picks up support for Kimi K2 and PLaMO-2
      
      * feat: Sync llama.cpp
      
      * fix: Handle multi-chunk image encodings from mtmd
      
      * fix: Re-number patches after merge with `main`
      
      * feat: Bump to 41e78c in the makefile
      
      * fix: Fix Solar and argsort/copy patches after bump
      
      * fix: Remove Gemma3n CUDA Graphs patch
      
      It was implemented upstream:
      https://github.com/ggml-org/llama.cpp/pull/14741
      
      * feat: Sync llama.cpp / ggml after latest bump
      
      * build: Remove unnecessary CFLAGS definitions in cpu.go
      
      * fix: Remove unnecessary additions in the rsync-filter
      
      * fix: Remove unused vendored code for chat template parsing
      
      * Revert "fix: Remove Gemma3n CUDA Graphs patch"
      
      This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea.
      
      * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes
      
      https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394
      
      
      
      * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n
      
      * unwind mxfp4 patch
      
      Prepare to bump ggml with their impl for mxfp4
      
      * bump
      
      * fix windows build error
      
      * Convert tensors at load time
      
      Repack the mxfp4 tensors as ggmls kernels expect them to be.
      
      * convert mlp bf16 to f32
      
      * buffer the conversion better
      
      * reshape earlier
      
      * openai swiglu
      
      * add ids
      
      * split qkv, gate_up
      
      * fix nested alt tags
      
      * fast attention
      
      * remove debug messages
      
      * fix lint
      
      * remove redundant test
      
      * remap values only if source/target are different
      
      * add back i32->i32 copy
      
      * refactor cpu quants
      
      * clean up vendor
      
      * update patch instructions
      
      * clean up patches
      
      * remove webgpu
      
      * update mem
      
      * also handle gpt-oss
      
      * revert convert changes
      
      ---------
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      Co-authored-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      1a19df1f
  25. 05 Aug, 2025 3 commits
    • Michael Yang's avatar
      gptoss: fix memory calc (#11700) · fcec04bf
      Michael Yang authored
      fcec04bf
    • Jesse Gross's avatar
      ggml: Prevent kv cache quanitization on gpt-oss · 8253ad4d
      Jesse Gross authored
      KV cache quantization has a dependency on the flash attention kernel.
      We currently cannot use flash attention with gpt-oss as it requires
      additional operations.
      
      The model definition does not call flash attention, so it works
      regardless of the setting but the cache will pick up the
      quantization type. This updates the flash attention setting earlier
      in the loading flow so that all downstream settings are also set correctly.
      
      Fixes: #11671
      8253ad4d
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  26. 26 Jun, 2025 3 commits
  27. 20 Jun, 2025 1 commit
  28. 18 Jun, 2025 1 commit
  29. 16 Jun, 2025 1 commit