update vendored llama.cpp and ggml (#11823)

* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore *.patched in the patch directory * fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to explicitly include the .c|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

update vendored llama.cpp and ggml (#11823)
* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore *.patched in the patch directory * fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to explicitly include the .c|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
1a19df1f · Michael Yang · GitHub · 7ccfd97a · 1a19df1f · 7ccfd97a
Unverified Commit 1a19df1f authored Aug 14, 2025 by Michael Yang Committed by GitHub Aug 14, 2025
20 changed files
--- a/llama/patches/0010-add-phony-target-ggml-cpu-for-all-cpu-variants.patch
+++ b/llama/patches/0010-add-phony-target-ggml-cpu-for-all-cpu-variants.patch
@@ -8,22 +8,22 @@ Subject: [PATCH] add phony target ggml-cpu for all cpu variants
 1 file changed, 2 insertions(+)

 diff --git a/ggml/src/CMakeLists.txt b/ggml/src/CMakeLists.txt
-index ddea5ad3..45918bf6 100644
+index 177fb282..f5a5079a 100644
 --- a/ggml/src/CMakeLists.txt
 +++ b/ggml/src/CMakeLists.txt
-@@ -279,6 +279,7 @@ function(ggml_add_cpu_backend_variant tag_name)
-     endforeach()
+@@ -304,6 +304,7 @@ function(ggml_add_cpu_backend_variant tag_name)
+     endif()
 
     ggml_add_cpu_backend_variant_impl(${tag_name})
 +    add_dependencies(ggml-cpu ggml-cpu-${tag_name})
 endfunction()
 
 ggml_add_backend(CPU)
-@@ -287,6 +288,7 @@ if (GGML_CPU_ALL_VARIANTS)
-     if (NOT GGML_BACKEND_DL)
-         message(FATAL_ERROR "GGML_CPU_ALL_VARIANTS requires GGML_BACKEND_DL")
+@@ -314,6 +315,7 @@ if (GGML_CPU_ALL_VARIANTS)
+     elseif (GGML_CPU_ARM_ARCH)
+         message(FATAL_ERROR "Cannot use both GGML_CPU_ARM_ARCH and GGML_CPU_ALL_VARIANTS")
     endif()
 +    add_custom_target(ggml-cpu)
-     ggml_add_cpu_backend_variant(x64)
-     ggml_add_cpu_backend_variant(sse42        SSE42)
-     ggml_add_cpu_backend_variant(sandybridge  SSE42 AVX)
+     if (GGML_SYSTEM_ARCH STREQUAL "x86")
+         ggml_add_cpu_backend_variant(x64)
+         ggml_add_cpu_backend_variant(sse42        SSE42)
--- a/llama/patches/0008-ensure-KV-cache-is-fully-defragmented.patch
+++ b/llama/patches/0008-ensure-KV-cache-is-fully-defragmented.patch
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: jmorganca <jmorganca@gmail.com>
-Date: Tue, 15 Apr 2025 14:27:40 -0400
-Subject: [PATCH] ensure KV cache is fully defragmented
-
-Sometimes the KV cache requires defragmentation even without
-triggering the threshold heuristic. In this case, decoding
-will not being able to find a KV cache slot. This is particularly
-difficult for the caller to handle if it happens in between
-ubatches. To avoid this, we should immediately trigger a defrag.
-
-In addition, a heavily fragmented cache can require more than
-max_moves to defragment. Currently, we stop when we hit the limit
-but this can leave a cache that still does not have adequate space
-even after defragmentation is triggered. Instead, we should do
-multiple batches of processing until everything is complete.
---
- src/llama-context.cpp  |  18 ++++---
- src/llama-context.h    |   1 +
- src/llama-kv-cache.cpp | 107 ++++++++++++++---------------------------
- src/llama-kv-cache.h   |  12 ++++-
- 4 files changed, 59 insertions(+), 79 deletions(-)
-
-diff --git a/src/llama-context.cpp b/src/llama-context.cpp
-index dca22d8b..1f3a3956 100644
--- a/src/llama-context.cpp
-+++ b/src/llama-context.cpp
-@@ -947,9 +947,12 @@ int llama_context::decode(llama_batch & inp_batch) {
- 
-         // find KV slot
-         if (!kv_self->find_slot(ubatch)) {
-            LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-
-            return 1;
-+            kv_self->defrag_sched(-1.0f);
-+            kv_self->update(*this);
-+            if (!kv_self->find_slot(ubatch)) {
-+                LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-+                return 1;
-+            }
-         }
- 
-         ggml_backend_sched_reset(sched.get());
-@@ -1965,9 +1968,12 @@ void llama_context::opt_epoch_iter(
- 
-             // TODO: not sure if this is needed
-             if (!kv_self->find_slot(ubatch)) {
-                LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-
-                GGML_ABORT("TODO: handle this error");
-+                kv_self->defrag_sched(-1.0f);
-+                kv_self->update(*this);
-+                if (!kv_self->find_slot(ubatch)) {
-+                    LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-+                    GGML_ABORT("TODO: handle this error");
-+                }
-             }
- 
-             auto * gf = graph_init();
-diff --git a/src/llama-context.h b/src/llama-context.h
-index c0ceacb1..0264e937 100644
--- a/src/llama-context.h
-+++ b/src/llama-context.h
-@@ -5,6 +5,7 @@
- #include "llama-cparams.h"
- #include "llama-graph.h"
- #include "llama-adapter.h"
-+#include "llama-kv-cache.h"
- 
- #include "ggml-cpp.h"
- #include "ggml-opt.h"
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 3dcad65b..60e67b03 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -364,8 +364,6 @@ void llama_kv_cache_unified::commit() {
- }
- 
- bool llama_kv_cache_unified::update(llama_context & lctx) {
-    bool need_reserve = false;
-
-     auto * sched = lctx.get_sched();
- 
-     if (has_shift) {
-@@ -388,8 +386,6 @@ bool llama_kv_cache_unified::update(llama_context & lctx) {
-             res->set_inputs(nullptr);
- 
-             lctx.graph_compute(gf, false);
-
-            need_reserve = true;
-         }
- 
-         {
-@@ -403,27 +399,36 @@ bool llama_kv_cache_unified::update(llama_context & lctx) {
- 
-     if (do_defrag) {
-         LLAMA_LOG_DEBUG("%s: defragmenting KV cache\n", __func__);
-+        const uint32_t n_max_nodes = lctx.graph_max_nodes();
-+        const uint32_t max_moves = (n_max_nodes - 2*model.hparams.n_layer)/(6*model.hparams.n_layer);
-+        if (!defrag_prepare(n_max_nodes)) {
-+            LLAMA_LOG_ERROR("%s: failed to prepare defragmentation\n", __func__);
-+            return false;
-+        }
-+
-+        for (std::size_t i = 0; i < defrag_info.moves.size(); i += max_moves) {
-+            std::vector<struct llama_kv_defrag_move> chunk;
-+            auto end = std::min(i + max_moves, defrag_info.moves.size());
-+            chunk.assign(defrag_info.moves.begin() + i, defrag_info.moves.begin() + end);
- 
-        if (defrag_prepare(lctx.graph_max_nodes())) {
-             ggml_backend_sched_reset(sched);
- 
-             auto * gf = lctx.graph_init();
- 
-            auto res = build_graph_defrag(lctx.get_cparams(), lctx.get_ctx_compute(), gf);
-+            auto res = build_graph_defrag(lctx.get_cparams(), lctx.get_ctx_compute(), gf, chunk);
- 
-             ggml_backend_sched_alloc_graph(sched, gf);
- 
-             res->set_inputs(nullptr);
- 
-             lctx.graph_compute(gf, false);
-
-            need_reserve = true;
-         }
- 
-         do_defrag = false;
-     }
- 
-    return need_reserve;
-+    // we never need to reserve a worst case graph
-+    return false;
- }
- 
- void llama_kv_cache_unified::defrag_sched(float thold) {
-@@ -707,11 +712,10 @@ llm_graph_result_ptr llama_kv_cache_unified::build_graph_shift(
- llm_graph_result_ptr llama_kv_cache_unified::build_graph_defrag(
-         const llama_cparams & cparams,
-                ggml_context * ctx,
-                ggml_cgraph * gf) const {
-+                ggml_cgraph * gf,
-+                const std::vector<struct llama_kv_defrag_move> & moves) const {
-     auto res = std::make_unique<llm_graph_result>();
- 
-    const auto & ids = defrag_info.ids;
-
- #if 0
-     // CPU defrag
-     //
-@@ -783,32 +787,20 @@ llm_graph_result_ptr llama_kv_cache_unified::build_graph_defrag(
-         ggml_backend_tensor_set(v_l[il], buf_v.data(), 0, buf_v.size());
-     }
- #else
-    for (uint32_t i = 0; i < ids.size(); ++i) {
-        const uint32_t id = ids[i];
-
-        if (i == id || id == ids.size()) {
-            continue;
-        }
-
-        uint32_t nm = 1;
-
-        while (i + nm < ids.size() && ids[i + nm] == id + nm) {
-            nm++;
-        }
-
-+    for (const auto & move : moves) {
-         for (uint32_t il = 0; il < hparams.n_layer; ++il) { // NOLINT
-             const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa(il);
-             const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa(il);
- 
-             ggml_tensor * view_k_src = ggml_view_2d(ctx, k_l[il],
-                    n_embd_k_gqa, nm,
-+                    n_embd_k_gqa, move.len,
-                     ggml_row_size(k_l[il]->type, n_embd_k_gqa),
-                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*i));
-+                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*move.src));
- 
-             ggml_tensor * view_k_dst = ggml_view_2d(ctx, k_l[il],
-                    n_embd_k_gqa, nm,
-+                    n_embd_k_gqa, move.len,
-                     ggml_row_size(k_l[il]->type, n_embd_k_gqa),
-                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*id));
-+                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*move.dst));
- 
-             ggml_tensor * view_v_src;
-             ggml_tensor * view_v_dst;
-@@ -816,31 +808,29 @@ llm_graph_result_ptr llama_kv_cache_unified::build_graph_defrag(
-             if (cparams.flash_attn) {
-                 // NOTE: the V cache is not transposed when using flash attention
-                 view_v_src = ggml_view_2d(ctx, v_l[il],
-                        n_embd_v_gqa, nm,
-+                        n_embd_v_gqa, move.len,
-                         ggml_row_size(v_l[il]->type, n_embd_v_gqa),
-                        ggml_row_size(v_l[il]->type, n_embd_v_gqa*i));
-+                        ggml_row_size(v_l[il]->type, n_embd_v_gqa*move.dst));
- 
-                 view_v_dst = ggml_view_2d(ctx, v_l[il],
-                        n_embd_v_gqa, nm,
-+                        move.len, n_embd_v_gqa,
-                         ggml_row_size(v_l[il]->type, n_embd_v_gqa),
-                        ggml_row_size(v_l[il]->type, n_embd_v_gqa*id));
-+                        ggml_row_size(v_l[il]->type, move.src));
-             } else {
-                 view_v_src = ggml_view_2d(ctx, v_l[il],
-                        nm, n_embd_v_gqa,
-+                        move.len, n_embd_v_gqa,
-                         ggml_row_size(v_l[il]->type, size),
-                        ggml_row_size(v_l[il]->type, i));
-+                        ggml_row_size(v_l[il]->type, move.src));
- 
-                 view_v_dst = ggml_view_2d(ctx, v_l[il],
-                        nm, n_embd_v_gqa,
-+                        move.len, n_embd_v_gqa,
-                         ggml_row_size(v_l[il]->type, size),
-                        ggml_row_size(v_l[il]->type, id));
-+                        ggml_row_size(v_l[il]->type, move.dst));
-             }
- 
-             ggml_build_forward_expand(gf, ggml_cpy(ctx, view_k_src, view_k_dst));
-             ggml_build_forward_expand(gf, ggml_cpy(ctx, view_v_src, view_v_dst));
-         }
-
-        i += nm - 1;
-     }
- 
-     //LLAMA_LOG_INFO("gf->n_nodes = %d\n", gf->n_nodes);
-@@ -857,17 +847,7 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
- 
-     assert(n_used <= n_kv);
- 
-    //const int64_t t_start = ggml_time_us();
-
-    // number of cells moved
-    uint32_t n_moves = 0;
-
-    // each move requires 6*n_layer tensors (see graph_build_kv_self_defrag)
-    //   - source view, destination view, copy operation
-    //   - x2 for keys and values
-    //const uint32_t max_moves = max_nodes()/(6*n_layer);
-    // TODO: tmp fix https://github.com/ggerganov/llama.cpp/issues/6685#issuecomment-2057579516
-    const uint32_t max_moves = (n_max_nodes - 2*n_layer)/(6*n_layer);
-+    defrag_info.moves.clear();
- 
-     // determine which KV cells to move where
-     //
-@@ -875,10 +855,7 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-     //
-     //  if ids[i] == i || ids[i] == n_kv, then cell i is not moved
-     //
-    auto & ids = defrag_info.ids;
-
-    ids.clear();
-    ids.resize(n_kv, n_kv);
-+    std::vector<uint32_t> ids(n_kv, n_kv);
- 
-     for (uint32_t i0 = 0; i0 < n_used; ++i0) {
-         const auto & cell0 = cells[i0];
-@@ -927,19 +904,11 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-         // are we moving a continuous block of memory?
-         bool cont = false;
- 
-        // should we stop searching for the next move?
-        bool stop = false;
-
-         // go back and move the nf cells to the hole
-         for (; i1 < n_kv; ++i1) {
-             auto & cell1 = cells[i1];
- 
-             if (cell1.is_empty() || ids[i1] != n_kv) {
-                if (n_moves == max_moves) {
-                    stop = true;
-                    break;
-                }
-
-                 cont = false;
-                 continue;
-             }
-@@ -955,8 +924,10 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-             head = n_used;
- 
-             if (!cont) {
-                n_moves++;
-+                defrag_info.moves.push_back({i1, i0 + nf, 1});
-                 cont = true;
-+            } else {
-+                defrag_info.moves.back().len++;
-             }
- 
-             nf++;
-@@ -966,22 +937,16 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-             }
-         }
- 
-        if (stop || n_moves == max_moves) {
-            break;
-        }
-
-         //LLAMA_LOG_INFO("(tmp log) KV defrag: move [%u, %u) to [%u, %u)\n", is, i1 + 1, i0, i0 + nh);
- 
-         i0 += nh - 1;
-     }
- 
-    if (n_moves == 0) {
-+    if (defrag_info.moves.size() == 0) {
-         return false;
-     }
- 
-    LLAMA_LOG_DEBUG("%s: (tmp log) KV defrag cell moves: %u\n", __func__, n_moves);
-
-    LLAMA_LOG_DEBUG("%s: expected gf nodes: %u\n", __func__, 6*n_moves*n_layer);
-+    // LLAMA_LOG_DEBUG("(tmp log) KV defrag cell moves: %u\n", n_moves);
- 
-     return true;
- }
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index bf3b4b6a..928b9712 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -82,6 +82,13 @@ struct llama_kv_cache_guard {
- private:
-     llama_kv_cache * kv;
- };
-+ 
-+// block of KV slots to move when defragging
-+struct llama_kv_defrag_move {
-+    uint32_t src;
-+    uint32_t dst;
-+    uint32_t len;
-+};
- 
- //
- // llama_kv_cache_unified
-@@ -207,7 +214,7 @@ private:
- 
-     // defrag
-     struct {
-        std::vector<uint32_t> ids;
-+        std::vector<llama_kv_defrag_move> moves;
-     } defrag_info;
- 
-     // return true if cells have been moved
-@@ -249,7 +256,8 @@ private:
-     llm_graph_result_ptr build_graph_defrag(
-             const llama_cparams & cparams,
-                    ggml_context * ctx,
-                    ggml_cgraph * gf) const;
-+                    ggml_cgraph * gf,
-+                    const std::vector<llama_kv_defrag_move> & moves) const;
- 
-     void state_write_meta(llama_io_write_i & io, const std::vector<std::pair<uint32_t, uint32_t>> & cell_ranges, llama_seq_id seq_id = -1) const;
-     void state_write_data(llama_io_write_i & io, const std::vector<std::pair<uint32_t, uint32_t>> & cell_ranges) const;
--- a/llama/patches/0011-remove-amx.patch
+++ b/llama/patches/0011-remove-amx.patch
@@ -9,17 +9,17 @@ disable amx as it reduces performance on some systems
 1 file changed, 4 deletions(-)

 diff --git a/ggml/src/CMakeLists.txt b/ggml/src/CMakeLists.txt
-index 45918bf6..0beaed86 100644
+index f5a5079a..5158acd6 100644
 --- a/ggml/src/CMakeLists.txt
 +++ b/ggml/src/CMakeLists.txt
-@@ -296,10 +296,6 @@ if (GGML_CPU_ALL_VARIANTS)
-     ggml_add_cpu_backend_variant(skylakex     SSE42 AVX F16C AVX2 BMI2 FMA AVX512)
-     ggml_add_cpu_backend_variant(icelake      SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
-     ggml_add_cpu_backend_variant(alderlake    SSE42 AVX F16C AVX2 BMI2 FMA AVX_VNNI)
-    if (NOT MSVC)
-        # MSVC doesn't support AMX
-        ggml_add_cpu_backend_variant(sapphirerapids SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AMX_TILE AMX_INT8)
-    endif()
- elseif (GGML_CPU)
-     ggml_add_cpu_backend_variant_impl("")
- endif()
+@@ -324,10 +324,6 @@ if (GGML_CPU_ALL_VARIANTS)
+         ggml_add_cpu_backend_variant(skylakex     SSE42 AVX F16C AVX2 BMI2 FMA AVX512)
+         ggml_add_cpu_backend_variant(icelake      SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
+         ggml_add_cpu_backend_variant(alderlake    SSE42 AVX F16C AVX2 BMI2 FMA AVX_VNNI)
+-        if (NOT MSVC)
+-            # MSVC doesn't support AMX
+-            ggml_add_cpu_backend_variant(sapphirerapids SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AMX_TILE AMX_INT8)
+-        endif()
+     elseif(GGML_SYSTEM_ARCH STREQUAL "ARM")
+         if (CMAKE_SYSTEM_NAME MATCHES "Linux")
+             # Many of these features are optional so we build versions with popular
--- a/llama/patches/0012-fix-string-arr-kv-loading.patch
+++ b/llama/patches/0012-fix-string-arr-kv-loading.patch
@@ -25,10 +25,10 @@ index 79ee2020..3efb22f0 100644
     // get ith C string from array with given key_id
     GGML_API const char * gguf_get_arr_str (const struct gguf_context * ctx, int64_t key_id, size_t i);
 diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
-index 381a9c7d..e45b453d 100644
+index 53504399..0f71d5f3 100644
 --- a/ggml/src/gguf.cpp
 +++ b/ggml/src/gguf.cpp
-@@ -777,10 +777,14 @@ enum gguf_type gguf_get_arr_type(const struct gguf_context * ctx, int64_t key_id
+@@ -805,10 +805,14 @@ enum gguf_type gguf_get_arr_type(const struct gguf_context * ctx, int64_t key_id
 
 const void * gguf_get_arr_data(const struct gguf_context * ctx, int64_t key_id) {
     GGML_ASSERT(key_id >= 0 && key_id < gguf_get_n_kv(ctx));
@@ -44,7 +44,7 @@ index 381a9c7d..e45b453d 100644
 const char * gguf_get_arr_str(const struct gguf_context * ctx, int64_t key_id, size_t i) {
     GGML_ASSERT(key_id >= 0 && key_id < gguf_get_n_kv(ctx));
     GGML_ASSERT(ctx->kv[key_id].get_type() == GGUF_TYPE_STRING);
-@@ -874,7 +878,6 @@ const char * gguf_get_val_str(const struct gguf_context * ctx, int64_t key_id) {
+@@ -902,7 +906,6 @@ const char * gguf_get_val_str(const struct gguf_context * ctx, int64_t key_id) {
 const void * gguf_get_val_data(const struct gguf_context * ctx, int64_t key_id) {
     GGML_ASSERT(key_id >= 0 && key_id < gguf_get_n_kv(ctx));
     GGML_ASSERT(ctx->kv[key_id].get_ne() == 1);
@@ -53,10 +53,10 @@ index 381a9c7d..e45b453d 100644
 }
 
 diff --git a/src/llama-vocab.cpp b/src/llama-vocab.cpp
-index 10f34d33..9f5fd57b 100644
+index c011008f..fa388b03 100644
 --- a/src/llama-vocab.cpp
 +++ b/src/llama-vocab.cpp
-@@ -1469,9 +1469,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
+@@ -1760,9 +1760,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
             const int precompiled_charsmap_keyidx = gguf_find_key(ctx, kv(LLM_KV_TOKENIZER_PRECOMPILED_CHARSMAP).c_str());
             if (precompiled_charsmap_keyidx != -1) {
                 const gguf_type pc_type = gguf_get_arr_type(ctx, precompiled_charsmap_keyidx);

--- a/llama/patches/0013-ollama-debug-tensor.patch
+++ b/llama/patches/0013-ollama-debug-tensor.patch
@@ -8,7 +8,7 @@ Subject: [PATCH] ollama debug tensor
 1 file changed, 6 insertions(+)

 diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
-index a30e67f2..2462d2b8 100644
+index d89cd8f4..a5689c18 100644
 --- a/ggml/src/ggml-cpu/ggml-cpu.c
 +++ b/ggml/src/ggml-cpu/ggml-cpu.c
 @@ -15,6 +15,8 @@
@@ -20,7 +20,7 @@ index a30e67f2..2462d2b8 100644
 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
 #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
-@@ -2841,6 +2843,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
+@@ -2858,6 +2860,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
 
         ggml_compute_forward(&params, node);
 

--- a/llama/patches/0014-add-ollama-vocab-for-grammar-support.patch
+++ b/llama/patches/0014-add-ollama-vocab-for-grammar-support.patch
@@ -10,7 +10,7 @@ Subject: [PATCH] add ollama vocab for grammar support
 3 files changed, 58 insertions(+), 9 deletions(-)

 diff --git a/src/llama-grammar.cpp b/src/llama-grammar.cpp
-index 973b47ae..60d58236 100644
+index bed706bb..b51cee09 100644
 --- a/src/llama-grammar.cpp
 +++ b/src/llama-grammar.cpp
 @@ -907,6 +907,7 @@ llama_grammar_candidates llama_grammar_reject_candidates_for_stack(
@@ -90,7 +90,7 @@ index 973b47ae..60d58236 100644
 
     if (grammar.awaiting_trigger) {
         if (std::find(grammar.trigger_tokens.begin(), grammar.trigger_tokens.end(), token) != grammar.trigger_tokens.end()) {
-@@ -1191,13 +1200,14 @@ void llama_grammar_accept_impl(struct llama_grammar & grammar, llama_token token
+@@ -1201,13 +1210,14 @@ void llama_grammar_accept_impl(struct llama_grammar & grammar, llama_token token
         }
     }
 
@@ -107,7 +107,7 @@ index 973b47ae..60d58236 100644
     }
 
     llama_grammar_accept_str(grammar, piece);
-@@ -1217,3 +1227,28 @@ void llama_grammar_accept_str(struct llama_grammar & grammar, const std::string
+@@ -1227,3 +1237,28 @@ void llama_grammar_accept_str(struct llama_grammar & grammar, const std::string
         throw std::runtime_error("Unexpected empty grammar stack after accepting piece: " + piece);
     }
 }
@@ -184,7 +184,7 @@ index f8c291de..2a3a62db 100644
                       const char * grammar_root,
                               bool lazy,
 diff --git a/src/llama-sampling.cpp b/src/llama-sampling.cpp
-index 804b11e0..15a10ca8 100644
+index bfbf5fa2..11f93f42 100644
 --- a/src/llama-sampling.cpp
 +++ b/src/llama-sampling.cpp
 @@ -1466,7 +1466,7 @@ static void llama_sampler_grammar_reset(struct llama_sampler * smpl) {

--- a/llama/patches/0015-add-argsort-and-cuda-copy-for-i32.patch
+++ b/llama/patches/0015-add-argsort-and-cuda-copy-for-i32.patch
@@ -4,16 +4,17 @@ Date: Thu, 1 May 2025 13:45:12 -0700
 Subject: [PATCH] add argsort and cuda copy for i32

 ---
- ggml/src/ggml-cpu/ops.cpp     |  43 ++++++++++++++
- ggml/src/ggml-cuda/argsort.cu | 102 +++++++++++++++++++++++++++++++++-
- ggml/src/ggml-cuda/cpy.cu     |  49 ++++++++++++++++
- 3 files changed, 192 insertions(+), 2 deletions(-)
+ ggml/src/ggml-cpu/ops.cpp        |  43 +++++++++++++
+ ggml/src/ggml-cuda/argsort.cu    | 102 ++++++++++++++++++++++++++++++-
+ ggml/src/ggml-cuda/cpy-utils.cuh |   6 ++
+ ggml/src/ggml-cuda/cpy.cu        |  43 +++++++++++++
+ 4 files changed, 192 insertions(+), 2 deletions(-)

 diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 955fec59..654e2f28 100644
+index 854f1c2b..a2924757 100644
 --- a/ggml/src/ggml-cpu/ops.cpp
 +++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -6822,6 +6822,45 @@ static void ggml_compute_forward_argsort_f32(
+@@ -8146,6 +8146,45 @@ static void ggml_compute_forward_argsort_f32(
     }
 }
 
@@ -59,7 +60,7 @@ index 955fec59..654e2f28 100644
 void ggml_compute_forward_argsort(
     const ggml_compute_params * params,
     ggml_tensor * dst) {
-@@ -6833,6 +6872,10 @@ void ggml_compute_forward_argsort(
+@@ -8157,6 +8196,10 @@ void ggml_compute_forward_argsort(
             {
                 ggml_compute_forward_argsort_f32(params, dst);
             } break;
@@ -194,84 +195,78 @@ index 607ded85..53b02634 100644
 +        argsort_f32_i32_cuda(src0_d, (int *)dst_d, ncols, nrows, order, stream);
 +    }
 }
-diff --git a/ggml/src/ggml-cuda/cpy.cu b/ggml/src/ggml-cuda/cpy.cu
-index d027271f..4abd01d7 100644
--- a/ggml/src/ggml-cuda/cpy.cu
-+++ b/ggml/src/ggml-cuda/cpy.cu
-@@ -38,6 +38,13 @@ static __device__ void cpy_1_f16_f32(const char * cxi, char * cdsti) {
-     *dsti = *xi;
+diff --git a/ggml/src/ggml-cuda/cpy-utils.cuh b/ggml/src/ggml-cuda/cpy-utils.cuh
+index 410c12b7..b8e9e107 100644
+--- a/ggml/src/ggml-cuda/cpy-utils.cuh
+++ b/ggml/src/ggml-cuda/cpy-utils.cuh
+@@ -223,3 +223,9 @@ template<typename src_t, typename dst_t>
+ static __device__ void cpy_1_flt(const char * cxi, char * cdsti) {
+     convert_flt((const src_t *)cxi, (dst_t *)cdsti);
 }
- 
-+static __device__ void cpy_1_i32_i32(const char * cxi, char * cdsti) {
-+    const int32_t * xi = (const int32_t *) cxi;
-+    int32_t * dsti = (int32_t *) cdsti;
 +
-+    *dsti = *xi;
+static __device__ void cpy_1_i32_i32(const char * cxi, char * cdsti) {
+    const int32_t * src = (const int32_t *)cxi;
+    int32_t * dst = (int32_t *)cdsti;
+    *dst = *src;
 +}
-+
- template <cpy_kernel_t cpy_1>
- static __global__ void cpy_f32_f16(const char * cx, char * cdst_direct, const int ne,
-                                    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02,
-@@ -68,6 +75,44 @@ static __global__ void cpy_f32_f16(const char * cx, char * cdst_direct, const in
-     cpy_1(cx + x_offset, cdst + dst_offset);
+diff --git a/ggml/src/ggml-cuda/cpy.cu b/ggml/src/ggml-cuda/cpy.cu
+index f9bb0256..9c3774e5 100644
+--- a/ggml/src/ggml-cuda/cpy.cu
+++ b/ggml/src/ggml-cuda/cpy.cu
+@@ -278,6 +278,47 @@ static void ggml_cpy_f32_iq4_nl_cuda(
+         (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, cdst_indirect, graph_cpynode_index++);
 }
 
-+// First, add this template function after the other template functions
 +template <cpy_kernel_t cpy_1>
-+static __global__ void cpy_i32_i32(const char * cx, char * cdst, const int ne,
-+                                 const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02,
-+                                 const int nb03, const int ne10, const int ne11, const int ne12, const int nb10, const int nb11,
-+                                 const int nb12, const int nb13) {
-+    const int64_t i = blockDim.x*blockIdx.x + threadIdx.x;
+static __global__ void cpy_i32_i32(
+    const char *cx, char *cdst, const int ne,
+    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02, const int nb03,
+    const int ne10, const int ne11, const int ne12, const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {
+
+    const int64_t i = blockDim.x * blockIdx.x + threadIdx.x;
 +
 +    if (i >= ne) {
 +        return;
 +    }
 +
-+    const int64_t i03 = i/(ne00 * ne01 * ne02);
-+    const int64_t i02 = (i - i03*ne00*ne01*ne02 )/ (ne00*ne01);
-+    const int64_t i01 = (i - i03*ne00*ne01*ne02  -  i02*ne01*ne00) / ne00;
-+    const int64_t i00 = i - i03*ne00*ne01*ne02 - i02*ne01*ne00 - i01*ne00;
-+    const int64_t x_offset = i00*nb00 + i01*nb01 + i02*nb02 + i03 * nb03;
+    const int64_t i03 = i / (ne00 * ne01 * ne02);
+    const int64_t i02 = (i - i03 * ne00 * ne01 * ne02) / (ne00 * ne01);
+    const int64_t i01 = (i - i03 * ne00 * ne01 * ne02 - i02 * ne01 * ne00) / ne00;
+    const int64_t i00 = i - i03 * ne00 * ne01 * ne02 - i02 * ne01 * ne00 - i01 * ne00;
+    const int64_t x_offset = i00 * nb00 + i01 * nb01 + i02 * nb02 + i03 * nb03;
 +
-+    const int64_t i13 = i/(ne10 * ne11 * ne12);
-+    const int64_t i12 = (i - i13*ne10*ne11*ne12) / (ne10*ne11);
-+    const int64_t i11 = (i - i13*ne10*ne11*ne12 - i12*ne10*ne11) / ne10;
-+    const int64_t i10 = i - i13*ne10*ne11*ne12 - i12*ne10*ne11 - i11*ne10;
-+    const int64_t dst_offset = i10*nb10 + i11*nb11 + i12*nb12 + i13 * nb13;
+    const int64_t i13 = i / (ne10 * ne11 * ne12);
+    const int64_t i12 = (i - i13 * ne10 * ne11 * ne12) / (ne10 * ne11);
+    const int64_t i11 = (i - i13 * ne10 * ne11 * ne12 - i12 * ne10 * ne11) / ne10;
+    const int64_t i10 = i - i13 * ne10 * ne11 * ne12 - i12 * ne10 * ne11 - i11 * ne10;
+    const int64_t dst_offset = i10 * nb10 + i11 * nb11 + i12 * nb12 + i13 * nb13;
 +
-+    cpy_1(cx + x_offset, cdst + dst_offset);
+    char * cdst_ptr = (cdst_indirect != nullptr) ? cdst_indirect[graph_cpynode_index] : cdst;
+    cpy_1(cx + x_offset, cdst_ptr + dst_offset);
 +}
 +
-+// Then modify the ggml_cpy_i32_i32_cuda function to use the new template
+
 +static void ggml_cpy_i32_i32_cuda(
 +    const char * cx, char * cdst, const int ne,
-+    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02,
-+    const int nb03, const int ne10, const int ne11, const int ne12, const int nb10, const int nb11, const int nb12, const int nb13, cudaStream_t stream, char ** cdst_indirect, int graph_cpynode_index) {
+    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02, const int nb03,
+    const int ne10, const int ne11, const int ne12, const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream, char ** cdst_indirect, int graph_cpynode_index) {
 +
 +    const int num_blocks = (ne + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE;
 +    cpy_i32_i32<cpy_1_i32_i32><<<num_blocks, CUDA_CPY_BLOCK_SIZE, 0, stream>>>
-+        (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
+        (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, stream, cdst_indirect, graph_cpynode_index);
 +}
 +
- static __device__ void cpy_blck_f32_q8_0(const char * cxi, char * cdsti) {
-     const float * xi = (const float *) cxi;
-     block_q8_0 * dsti = (block_q8_0 *) cdsti;
-@@ -633,6 +678,8 @@ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, gg
-         ggml_cpy_f16_f16_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
+ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, ggml_tensor * src1, bool disable_indirection_for_this_node) {
+     const int64_t ne = ggml_nelements(src0);
+     GGML_ASSERT(ne == ggml_nelements(src1));
+@@ -369,6 +410,8 @@ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, gg
+         ggml_cpy_flt_cuda<half, nv_bfloat16> (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
     } else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F32) {
-         ggml_cpy_f16_f32_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
+         ggml_cpy_flt_cuda<half, float> (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
 +    } else if (src0->type == GGML_TYPE_I32 && src1->type == GGML_TYPE_I32) {
 +        ggml_cpy_i32_i32_cuda(src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
-     } else {
-         GGML_ABORT("%s: unsupported type combination (%s to %s)\n", __func__,
-                 ggml_type_name(src0->type), ggml_type_name(src1->type));
-@@ -688,6 +735,8 @@ void* ggml_cuda_cpy_fn(const ggml_tensor * src0, ggml_tensor * src1) {
-         return (void*) cpy_f32_f16<cpy_1_f32_f16>;
-     } else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F32) {
-         return (void*) cpy_f32_f16<cpy_1_f16_f32>;
-+    } else if (src0->type == GGML_TYPE_I32 && src1->type == GGML_TYPE_I32) {
-+        return (void*) cpy_i32_i32<cpy_1_i32_i32>;
-     } else {
-         GGML_ABORT("%s: unsupported type combination (%s to %s)\n", __func__,
-                 ggml_type_name(src0->type), ggml_type_name(src1->type));
+     } else if (src0->type == GGML_TYPE_BF16 && src1->type == GGML_TYPE_BF16) {
+         ggml_cpy_flt_cuda<nv_bfloat16, nv_bfloat16> (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
+     } else if (src0->type == GGML_TYPE_BF16 && src1->type == GGML_TYPE_F16) {
--- a/llama/patches/0016-graph-memory-reporting-on-failure.patch
+++ b/llama/patches/0016-graph-memory-reporting-on-failure.patch
@@ -28,7 +28,7 @@ index 2cb150fd..781b1e10 100644
 // Create a buffer and allocate all the tensors in a ggml_context
 GGML_API struct ggml_backend_buffer * ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft);
 diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h
-index 778927f6..74e46716 100644
+index a2977ea2..8a91b381 100644
 --- a/ggml/include/ggml-backend.h
 +++ b/ggml/include/ggml-backend.h
 @@ -304,6 +304,12 @@ extern "C" {
@@ -45,10 +45,10 @@ index 778927f6..74e46716 100644
     GGML_API ggml_backend_t       ggml_backend_sched_get_tensor_backend(ggml_backend_sched_t sched, struct ggml_tensor * node);
 
 diff --git a/ggml/src/ggml-alloc.c b/ggml/src/ggml-alloc.c
-index 5fd379f6..04812990 100644
+index 8b6e6028..41c8c4a2 100644
 --- a/ggml/src/ggml-alloc.c
 +++ b/ggml/src/ggml-alloc.c
-@@ -364,6 +364,7 @@ struct node_alloc {
+@@ -350,6 +350,7 @@ struct node_alloc {
 struct ggml_gallocr {
     ggml_backend_buffer_type_t * bufts; // [n_buffers]
     ggml_backend_buffer_t * buffers; // [n_buffers]
@@ -56,7 +56,7 @@ index 5fd379f6..04812990 100644
     struct ggml_dyn_tallocr ** buf_tallocs; // [n_buffers]
     int n_buffers;
 
-@@ -387,6 +388,9 @@ ggml_gallocr_t ggml_gallocr_new_n(ggml_backend_buffer_type_t * bufts, int n_bufs
+@@ -373,6 +374,9 @@ ggml_gallocr_t ggml_gallocr_new_n(ggml_backend_buffer_type_t * bufts, int n_bufs
     galloc->buffers = calloc(n_bufs, sizeof(ggml_backend_buffer_t));
     GGML_ASSERT(galloc->buffers != NULL);
 
@@ -66,7 +66,7 @@ index 5fd379f6..04812990 100644
     galloc->buf_tallocs = calloc(n_bufs, sizeof(struct ggml_dyn_tallocr *));
     GGML_ASSERT(galloc->buf_tallocs != NULL);
 
-@@ -453,6 +457,7 @@ void ggml_gallocr_free(ggml_gallocr_t galloc) {
+@@ -439,6 +443,7 @@ void ggml_gallocr_free(ggml_gallocr_t galloc) {
     ggml_hash_set_free(&galloc->hash_set);
     free(galloc->hash_values);
     free(galloc->bufts);
@@ -74,7 +74,7 @@ index 5fd379f6..04812990 100644
     free(galloc->buffers);
     free(galloc->buf_tallocs);
     free(galloc->node_allocs);
-@@ -748,6 +753,8 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
+@@ -734,6 +739,8 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
         }
     }
 
@@ -83,7 +83,7 @@ index 5fd379f6..04812990 100644
     // reallocate buffers if needed
     for (int i = 0; i < galloc->n_buffers; i++) {
         // if the buffer type is used multiple times, we reuse the same buffer
-@@ -769,15 +776,20 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
+@@ -755,15 +762,20 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
 
             ggml_backend_buffer_free(galloc->buffers[i]);
             galloc->buffers[i] = ggml_backend_buft_alloc_buffer(galloc->bufts[i], new_size);
@@ -108,7 +108,7 @@ index 5fd379f6..04812990 100644
 }
 
 bool ggml_gallocr_reserve(ggml_gallocr_t galloc, struct ggml_cgraph *graph) {
-@@ -934,6 +946,24 @@ size_t ggml_gallocr_get_buffer_size(ggml_gallocr_t galloc, int buffer_id) {
+@@ -920,6 +932,24 @@ size_t ggml_gallocr_get_buffer_size(ggml_gallocr_t galloc, int buffer_id) {
     return ggml_backend_buffer_get_size(galloc->buffers[buffer_id]);
 }
 
@@ -134,10 +134,10 @@ index 5fd379f6..04812990 100644
 
 static void free_buffers(ggml_backend_buffer_t ** buffers, const size_t * n_buffers) {
 diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
-index 0ce73a99..be335e8c 100644
+index 97f47abd..eded0291 100644
 --- a/ggml/src/ggml-backend.cpp
 +++ b/ggml/src/ggml-backend.cpp
-@@ -1629,6 +1629,16 @@ size_t ggml_backend_sched_get_buffer_size(ggml_backend_sched_t sched, ggml_backe
+@@ -1631,6 +1631,16 @@ size_t ggml_backend_sched_get_buffer_size(ggml_backend_sched_t sched, ggml_backe
     return ggml_gallocr_get_buffer_size(sched->galloc, backend_index);
 }
 

--- a/llama/patches/0017-ggml-Export-GPU-UUIDs.patch
+++ b/llama/patches/0017-ggml-Export-GPU-UUIDs.patch
@@ -12,7 +12,7 @@ with tools (e.g. nvidia-smi) and system management libraries (e.g. nvml).
 3 files changed, 63 insertions(+), 6 deletions(-)

 diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h
-index 74e467163..48839339d 100644
+index 8a91b381..9424394e 100644
 --- a/ggml/include/ggml-backend.h
 +++ b/ggml/include/ggml-backend.h
 @@ -152,6 +152,7 @@ extern "C" {
@@ -24,17 +24,17 @@ index 74e467163..48839339d 100644
         size_t memory_total;
         enum ggml_backend_dev_type type;
 diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index cb0d8528d..1492368de 100644
+index 37ee2a6d..57eae461 100644
 --- a/ggml/src/ggml-cuda/ggml-cuda.cu
 +++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -173,6 +173,51 @@ static int ggml_cuda_parse_id(char devName[]) {
+@@ -179,6 +179,51 @@ static int ggml_cuda_parse_id(char devName[]) {
 }
- #endif // defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)
+ #endif // defined(GGML_USE_HIP)
 
 +static std::string ggml_cuda_parse_uuid(cudaDeviceProp prop, int device_num) {
 +    char id[64];
 +
-+    #if !defined(GGML_USE_HIP)
+#if !defined(GGML_USE_HIP)
 +    snprintf(id, sizeof(id),
 +        "GPU-%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x",
 +        (unsigned char)prop.uuid.bytes[0],
@@ -54,10 +54,10 @@ index cb0d8528d..1492368de 100644
 +        (unsigned char)prop.uuid.bytes[14],
 +        (unsigned char)prop.uuid.bytes[15]
 +        );
-+    #else
-+    #ifdef _WIN32
+#else
+#ifdef _WIN32
 +        snprintf(id, sizeof(id), "%d", device_num);
-+    #else
+#else
 +    try {
 +        std::string uuid = std::string(prop.uuid.bytes, 16);
 +
@@ -70,16 +70,16 @@ index cb0d8528d..1492368de 100644
 +    } catch (const std::exception &e) {
 +        snprintf(id, sizeof(id), "%d", device_num);
 +    }
-+    #endif
-+    #endif
+#endif
+#endif
 +
 +    return id;
 +}
 +
 static ggml_cuda_device_info ggml_cuda_init() {
- #ifdef __HIP_PLATFORM_AMD__
+ #if defined(GGML_USE_HIP)
     // Workaround for a rocBLAS bug when using multiple graphics cards:
-@@ -261,22 +306,24 @@ static ggml_cuda_device_info ggml_cuda_init() {
+@@ -267,22 +312,24 @@ static ggml_cuda_device_info ggml_cuda_init() {
                 info.devices[id].cc += prop.minor * 0x10;
             }
         }
@@ -107,10 +107,10 @@ index cb0d8528d..1492368de 100644
 +        GGML_LOG_INFO("  Device %d: %s, compute capability %d.%d, VMM: %s, ID: %s\n",
 +                        id, prop.name, prop.major, prop.minor, device_vmm ? "yes" : "no",
 +                        ggml_cuda_parse_uuid(prop, id).c_str());
- #endif // defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)
+ #endif // defined(GGML_USE_HIP)
     }
 
-@@ -2884,6 +2931,7 @@ struct ggml_backend_cuda_device_context {
+@@ -3144,6 +3191,7 @@ struct ggml_backend_cuda_device_context {
     int device;
     std::string name;
     std::string description;
@@ -118,7 +118,7 @@ index cb0d8528d..1492368de 100644
 };
 
 static const char * ggml_backend_cuda_device_get_name(ggml_backend_dev_t dev) {
-@@ -2896,6 +2944,11 @@ static const char * ggml_backend_cuda_device_get_description(ggml_backend_dev_t
+@@ -3156,6 +3204,11 @@ static const char * ggml_backend_cuda_device_get_description(ggml_backend_dev_t
     return ctx->description.c_str();
 }
 
@@ -130,7 +130,7 @@ index cb0d8528d..1492368de 100644
 static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
     ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *)dev->context;
     ggml_cuda_set_device(ctx->device);
-@@ -2910,6 +2963,7 @@ static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend
+@@ -3170,6 +3223,7 @@ static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend
 static void ggml_backend_cuda_device_get_props(ggml_backend_dev_t dev, ggml_backend_dev_props * props) {
     props->name        = ggml_backend_cuda_device_get_name(dev);
     props->description = ggml_backend_cuda_device_get_description(dev);
@@ -138,7 +138,7 @@ index cb0d8528d..1492368de 100644
     props->type        = ggml_backend_cuda_device_get_type(dev);
     ggml_backend_cuda_device_get_memory(dev, &props->memory_free, &props->memory_total);
 
-@@ -3457,6 +3511,7 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
+@@ -3767,6 +3821,7 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
                 cudaDeviceProp prop;
                 CUDA_CHECK(cudaGetDeviceProperties(&prop, i));
                 dev_ctx->description = prop.name;
@@ -147,10 +147,10 @@ index cb0d8528d..1492368de 100644
                 ggml_backend_dev_t dev = new ggml_backend_device {
                     /* .iface   = */ ggml_backend_cuda_device_interface,
 diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index 1b56f858c..a9eeebc6a 100644
+index 7bccc7bf..fe7b2f0a 100644
 --- a/ggml/src/ggml-metal/ggml-metal.m
 +++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -5703,6 +5703,7 @@ static enum ggml_backend_dev_type ggml_backend_metal_device_get_type(ggml_backen
+@@ -6522,6 +6522,7 @@ static enum ggml_backend_dev_type ggml_backend_metal_device_get_type(ggml_backen
 static void ggml_backend_metal_device_get_props(ggml_backend_dev_t dev, struct ggml_backend_dev_props * props) {
     props->name        = ggml_backend_metal_device_get_name(dev);
     props->description = ggml_backend_metal_device_get_description(dev);

--- a/llama/patches/0018-temporary-prevent-rocm-cuda-mixed-loading.patch
+++ b/llama/patches/0018-temporary-prevent-rocm-cuda-mixed-loading.patch
@@ -8,10 +8,10 @@ Subject: [PATCH] temporary prevent rocm+cuda mixed loading
 1 file changed, 10 insertions(+), 2 deletions(-)

 diff --git a/ggml/src/ggml-backend-reg.cpp b/ggml/src/ggml-backend-reg.cpp
-index 4e67d243..8f49f084 100644
+index 3040b2aa..f1e9c180 100644
 --- a/ggml/src/ggml-backend-reg.cpp
 +++ b/ggml/src/ggml-backend-reg.cpp
-@@ -573,8 +573,16 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
+@@ -581,8 +581,16 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
 
     ggml_backend_load_best("blas", silent, dir_path);
     ggml_backend_load_best("cann", silent, dir_path);
@@ -20,13 +20,13 @@ index 4e67d243..8f49f084 100644
 +
 +    // Avoid mixed hip+cuda configurations
 +    const char * hip_devices = std::getenv("HIP_VISIBLE_DEVICES");
-+    const char * rocr_devices = std::getenv("ROCR_VISIBLE_DEVICES"); 
+    const char * rocr_devices = std::getenv("ROCR_VISIBLE_DEVICES");
 +    if (!hip_devices && !rocr_devices) {
 +        ggml_backend_load_best("cuda", silent, dir_path);
 +    } else {
 +        ggml_backend_load_best("hip", silent, dir_path);
 +    }
-+    
-     ggml_backend_load_best("kompute", silent, dir_path);
+
     ggml_backend_load_best("metal", silent, dir_path);
     ggml_backend_load_best("rpc", silent, dir_path);
+     ggml_backend_load_best("sycl", silent, dir_path);
--- a/llama/patches/0017-add-C-API-for-mtmd_input_text.patch
+++ b/llama/patches/0017-add-C-API-for-mtmd_input_text.patch
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gabe Goodhart <ghart@us.ibm.com>
+Date: Tue, 24 Jun 2025 16:55:31 -0600
+Subject: [PATCH] add C API for mtmd_input_text
+
+Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
+---
+ tools/mtmd/mtmd.cpp | 10 ++++++++++
+ tools/mtmd/mtmd.h   |  3 +++
+ 2 files changed, 13 insertions(+)
+
+diff --git a/tools/mtmd/mtmd.cpp b/tools/mtmd/mtmd.cpp
+index a05373d5..6f70f7f4 100644
+--- a/tools/mtmd/mtmd.cpp
+++ b/tools/mtmd/mtmd.cpp
+@@ -79,6 +79,16 @@ enum mtmd_slice_tmpl {
+     // TODO @ngxson : add support for idefics (SmolVLM)
+ };
+ 
+mtmd_input_text* mtmd_input_text_init(const char * text, bool add_special, bool parse_special) {
+    return new mtmd_input_text{text, add_special, parse_special};
+}
+
+void mtmd_input_text_free(mtmd_input_text* input_text) {
+    if (input_text) {
+        delete input_text;
+    }
+}
+
+ const char * mtmd_default_marker() {
+     return "<__media__>";
+ }
+diff --git a/tools/mtmd/mtmd.h b/tools/mtmd/mtmd.h
+index f4ea07d3..cf287224 100644
+--- a/tools/mtmd/mtmd.h
+++ b/tools/mtmd/mtmd.h
+@@ -75,6 +75,9 @@ typedef struct mtmd_input_chunk  mtmd_input_chunk;
+ typedef struct mtmd_input_chunks mtmd_input_chunks;
+ typedef struct mtmd_input_text   mtmd_input_text;
+ 
+MTMD_API mtmd_input_text* mtmd_input_text_init(const char * text, bool add_special, bool parse_special);
+MTMD_API void mtmd_input_text_free(mtmd_input_text* input_text);
+
+ struct mtmd_context_params {
+     bool use_gpu;
+     bool print_timings;
--- a/llama/patches/0018-no-power-throttling-win32-with-gnuc.patch
+++ b/llama/patches/0018-no-power-throttling-win32-with-gnuc.patch
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gabe Goodhart <ghart@us.ibm.com>
+Date: Fri, 11 Jul 2025 15:59:19 -0600
+Subject: [PATCH] no power throttling win32 with gnuc
+
+---
+ ggml/src/ggml-cpu/ggml-cpu.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
+index a5689c18..85af19a3 100644
+--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
+@@ -2412,7 +2412,7 @@ static bool ggml_thread_apply_priority(int32_t prio) {
+         // Newer Windows 11 versions aggresively park (offline) CPU cores and often place
+         // all our threads onto the first 4 cores which results in terrible performance with
+         // n_threads > 4
+-        #if _WIN32_WINNT >= 0x0602
+        #if (_WIN32_WINNT >= 0x0602) && !defined(__GNUC__)
+         THREAD_POWER_THROTTLING_STATE t;
+         ZeroMemory(&t, sizeof(t));
+         t.Version     = THREAD_POWER_THROTTLING_CURRENT_VERSION;
--- a/llama/patches/0022-BF16-macos-version-guard.patch
+++ b/llama/patches/0022-BF16-macos-version-guard.patch
@@ -9,10 +9,10 @@ Only enable BF16 on supported MacOS versions (v14+)
 1 file changed, 5 insertions(+), 1 deletion(-)

 diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index 110c9ece..ab46f6e3 100644
+index fe7b2f0a..e4c31268 100644
 --- a/ggml/src/ggml-metal/ggml-metal.m
 +++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -89,7 +89,11 @@ static id<MTLDevice> ggml_backend_metal_device_acq(struct ggml_backend_metal_dev
+@@ -106,7 +106,11 @@ static id<MTLDevice> ggml_backend_metal_device_acq(struct ggml_backend_metal_dev
         ctx->has_bfloat |= [ctx->mtl_device supportsFamily:MTLGPUFamilyApple6];
 
 #if defined(GGML_METAL_USE_BF16)

--- a/llama/patches/0019-metal-add-mean-kernel-14267.patch
+++ b/llama/patches/0019-metal-add-mean-kernel-14267.patch
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Georgi Gerganov <ggerganov@gmail.com>
-Date: Thu, 19 Jun 2025 08:05:21 +0300
-Subject: [PATCH] metal : add mean kernel (#14267)
-
-* metal : add mean kernel
-
-ggml-ci
-
-* cont : dedup implementation
-
-ggml-ci
---
- ggml/src/ggml-metal/ggml-metal.m     | 33 ++++++++++++++++---
- ggml/src/ggml-metal/ggml-metal.metal | 48 ++++++++++++++++++++++------
- 2 files changed, 67 insertions(+), 14 deletions(-)
-
-diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index a9eeebc6..110c9ece 100644
--- a/ggml/src/ggml-metal/ggml-metal.m
-+++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -489,6 +489,7 @@ enum ggml_metal_kernel_type {
-     GGML_METAL_KERNEL_TYPE_COS,
-     GGML_METAL_KERNEL_TYPE_NEG,
-     GGML_METAL_KERNEL_TYPE_SUM_ROWS,
-+    GGML_METAL_KERNEL_TYPE_MEAN,
-     GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,
-     GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,
-     GGML_METAL_KERNEL_TYPE_ARGMAX,
-@@ -1436,6 +1437,7 @@ static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t de
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_COS,                             cos,                             true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_NEG,                             neg,                             true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_SUM_ROWS,                        sum_rows,                        true);
-+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MEAN,                            mean,                            true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_ARGMAX,                          argmax,                          true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,                 pool_2d_avg_f32,                 true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,                 pool_2d_max_f32,                 true);
-@@ -1634,6 +1636,7 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
-         case GGML_OP_LOG:
-             return false; // TODO: implement
-         case GGML_OP_SUM_ROWS:
-+        case GGML_OP_MEAN:
-         case GGML_OP_SOFT_MAX:
-         case GGML_OP_GROUP_NORM:
-             return has_simdgroup_reduction && ggml_is_contiguous(op->src[0]);
-@@ -2362,11 +2365,30 @@ static bool ggml_metal_encode_node(
-                 [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
-             } break;
-         case GGML_OP_SUM_ROWS:
-+        case GGML_OP_MEAN:
-             {
-                 GGML_ASSERT(src0->nb[0] == ggml_type_size(src0->type));
- 
-                id<MTLComputePipelineState> pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_SUM_ROWS].pipeline;
-+                id<MTLComputePipelineState> pipeline = nil;
-+
-+                switch (dst->op) {
-+                    case GGML_OP_SUM_ROWS:
-+                        pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_SUM_ROWS].pipeline;
-+                        break;
-+                    case GGML_OP_MEAN:
-+                        pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MEAN].pipeline;
-+                        break;
-+                    default:
-+                        GGML_ABORT("fatal error");
-+                }
-+
-+                int nth = 32; // SIMD width
-+
-+                while (nth < ne00 && nth < (int) pipeline.maxTotalThreadsPerThreadgroup) {
-+                    nth *= 2;
-+                }
- 
-+                nth = MIN(nth, ne00);
- 
-                 ggml_metal_kargs_sum_rows args = {
-                    /*.ne00 =*/ ne00,
-@@ -2396,11 +2418,12 @@ static bool ggml_metal_encode_node(
-                 };
- 
-                 [encoder setComputePipelineState:pipeline];
-                [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
-                [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];
-                [encoder setBytes:&args length:sizeof(args) atIndex:2];
-+                [encoder setBytes:&args length:sizeof(args) atIndex:0];
-+                [encoder setBuffer:id_src0 offset:offs_src0 atIndex:1];
-+                [encoder setBuffer:id_dst  offset:offs_dst  atIndex:2];
-+                [encoder setThreadgroupMemoryLength:32*sizeof(float) atIndex:0];
- 
-                [encoder dispatchThreadgroups:MTLSizeMake(ne01, ne02, ne03) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
-+                [encoder dispatchThreadgroups:MTLSizeMake(ne01, ne02, ne03) threadsPerThreadgroup:MTLSizeMake(nth, 1, 1)];
-             } break;
-         case GGML_OP_SOFT_MAX:
-             {
-diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
-index 9cfddf45..08e8d807 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
-+++ b/ggml/src/ggml-metal/ggml-metal.metal
-@@ -956,31 +956,61 @@ kernel void kernel_neg(
-     dst[tpig] = -src0[tpig];
- }
- 
-+template <bool norm>
- kernel void kernel_sum_rows(
-+        constant ggml_metal_kargs_sum_rows & args,
-         device const float * src0,
-         device       float * dst,
-        constant ggml_metal_kargs_sum_rows & args,
-        uint3 tpig[[thread_position_in_grid]]) {
-    int64_t i3 = tpig.z;
-    int64_t i2 = tpig.y;
-    int64_t i1 = tpig.x;
-+        threadgroup  float * shmem_f32 [[threadgroup(0)]],
-+        uint3   tgpig[[threadgroup_position_in_grid]],
-+        ushort3 tpitg[[thread_position_in_threadgroup]],
-+        ushort  sgitg[[simdgroup_index_in_threadgroup]],
-+        ushort  tiisg[[thread_index_in_simdgroup]],
-+        ushort3   ntg[[threads_per_threadgroup]]) {
-+    int64_t i3 = tgpig.z;
-+    int64_t i2 = tgpig.y;
-+    int64_t i1 = tgpig.x;
- 
-     if (i3 >= args.ne03 || i2 >= args.ne02 || i1 >= args.ne01) {
-         return;
-     }
- 
-+    if (sgitg == 0) {
-+        shmem_f32[tiisg] = 0.0f;
-+    }
-+
-     device const float * src_row = (device const float *) ((device const char *) src0 + i1*args.nb01 + i2*args.nb02 + i3*args.nb03);
-     device       float * dst_row = (device       float *) ((device       char *) dst  + i1*args.nb1  + i2*args.nb2  + i3*args.nb3);
- 
-    float row_sum = 0;
-+    float sumf = 0;
- 
-    for (int64_t i0 = 0; i0 < args.ne00; i0++) {
-        row_sum += src_row[i0];
-+    for (int64_t i0 = tpitg.x; i0 < args.ne00; i0 += ntg.x) {
-+        sumf += src_row[i0];
-     }
- 
-    dst_row[0] = row_sum;
-+    sumf = simd_sum(sumf);
-+
-+    threadgroup_barrier(mem_flags::mem_threadgroup);
-+
-+    if (tiisg == 0) {
-+        shmem_f32[sgitg] = sumf;
-+    }
-+
-+    threadgroup_barrier(mem_flags::mem_threadgroup);
-+
-+    sumf = shmem_f32[tiisg];
-+    sumf = simd_sum(sumf);
-+
-+    if (tpitg.x == 0) {
-+        dst_row[0] = norm ? sumf / args.ne00 : sumf;
-+    }
- }
- 
-+typedef decltype(kernel_sum_rows<false>) kernel_sum_rows_t;
-+
-+template [[host_name("kernel_sum_rows")]] kernel kernel_sum_rows_t kernel_sum_rows<false>;
-+template [[host_name("kernel_mean")]]     kernel kernel_sum_rows_t kernel_sum_rows<true>;
-+
- template<typename T>
- kernel void kernel_soft_max(
-         device const  char * src0,
--- a/llama/patches/0020-CUDA-add-mean-operation-14313.patch
+++ b/llama/patches/0020-CUDA-add-mean-operation-14313.patch
--- a/llama/patches/0020-Enable-CUDA-Graphs-for-gemma3n.patch
+++ b/llama/patches/0020-Enable-CUDA-Graphs-for-gemma3n.patch
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Oliver Simons <osimons@nvidia.com>
+Date: Tue, 22 Jul 2025 11:02:28 +0200
+Subject: [PATCH] Enable CUDA Graphs for gemma3n.
+
+Similar to
+https://github.com/ggml-org/llama.cpp/pull/14741,
+though ollama has a slightly different model graph
+than llama.cpp which requires different workaround
+checks.
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu | 18 ++++++++++++++++++
+ 1 file changed, 18 insertions(+)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 57eae461..9db0c8b5 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -2671,12 +2671,24 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
+     // Loop over nodes in GGML graph to obtain info needed for CUDA graph
+     cuda_ctx->cuda_graph->cpy_dest_ptrs.clear();
+ 
+    // This fix was added in llama.cpp and Ollama in parallel, but with
+    // different tensor names.
+    // llama.cpp: https://github.com/ggml-org/llama.cpp/pull/14741
+    // ollama: https://github.com/ollama/ollama/pull/11525
+
+    const std::string gemma3n_per_layer_proj_src1_name_ollama = " (reshaped)";
+    const std::string gemma3n_node_name_ollama                = "node_";
+
+     const std::string gemma3n_per_layer_proj_src0_name = "inp_per_layer_selected";
+     const std::string gemma3n_per_layer_proj_src1_name = "per_layer_proj";
+
+    const std::string ffn_moe_bias_suffix = "_exps.bias";
+
+     const std::string ffn_moe_gate_bias_prefix = "ffn_moe_gate_biased";
+     const std::string ffn_moe_up_bias_prefix = "ffn_moe_up_biased";
+     const std::string ffn_moe_down_bias_prefix = "ffn_moe_down_biased";
+ 
+
+     for (int i = 0; i < cgraph->n_nodes; i++) {
+         ggml_tensor * node = cgraph->nodes[i];
+ 
+@@ -2700,6 +2712,12 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
+ 
+         if (node->op == GGML_OP_ADD &&
+             node->src[1] && node->src[1]->ne[1] > 1 &&
+            // ollama
+            // workarounds to exclude Gemma3n's `project_per_layer_input` operation from the batch-size heuristic, specific to ollama's implementation of gemma3n
+            // number of layers is different for per_layer_proj between gemma3n:2b and gemma3n:4b, which is why we don't check that value here
+            !(node->ne[0] == 256 && node->ne[2] == 1 && node->ne[3] == 1 && node->src[0] ? std::string(node->src[0]->name).find(gemma3n_node_name_ollama) != std::string::npos : false && node->src[1] ? node->src[1]->name == gemma3n_per_layer_proj_src1_name_ollama : false) &&
+            node->src[1] ? std::string(node->src[1]->name).find(ffn_moe_bias_suffix) == std::string::npos : false &&
+            // upstream
+             (node->src[0] ? node->src[0]->name != gemma3n_per_layer_proj_src0_name : true) &&
+             (node->src[1] ? node->src[1]->name != gemma3n_per_layer_proj_src1_name : true) &&
+             strncmp(node->name, ffn_moe_gate_bias_prefix.c_str(), ffn_moe_gate_bias_prefix.size()) != 0 &&
--- a/llama/patches/0025-Disable-ggml-blas-on-macos-v13-and-older.patch
+++ b/llama/patches/0025-Disable-ggml-blas-on-macos-v13-and-older.patch
@@ -8,7 +8,7 @@ Subject: [PATCH] Disable ggml-blas on macos v13 and older
 1 file changed, 5 insertions(+)

 diff --git a/ggml/src/ggml-blas/ggml-blas.cpp b/ggml/src/ggml-blas/ggml-blas.cpp
-index ec158dfa..22926d75 100644
+index aeac2e57..40738d5b 100644
 --- a/ggml/src/ggml-blas/ggml-blas.cpp
 +++ b/ggml/src/ggml-blas/ggml-blas.cpp
 @@ -505,6 +505,11 @@ static const struct ggml_backend_reg_i ggml_backend_blas_reg_i = {

--- a/llama/patches/0021-Enable-CUDA-Graphs-for-gemma3n.patch
+++ b/llama/patches/0021-Enable-CUDA-Graphs-for-gemma3n.patch
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Oliver Simons <osimons@nvidia.com>
-Date: Tue, 22 Jul 2025 11:02:28 +0200
-Subject: [PATCH] Enable CUDA Graphs for gemma3n.
-
-Similar to
-https://github.com/ggml-org/llama.cpp/pull/14741,
-though ollama has a slightly different model graph
-than llama.cpp which requires different workaround
-checks.
---
- ggml/src/ggml-cuda/ggml-cuda.cu | 16 ++++++++++++----
- 1 file changed, 12 insertions(+), 4 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 2b9fabf4..28ccf4be 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -2474,6 +2474,9 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
-     // Loop over nodes in GGML graph to obtain info needed for CUDA graph
-     cuda_ctx->cuda_graph->cpy_dest_ptrs.clear();
- 
-+    const std::string gemma3n_per_layer_proj_src1_name   = " (reshaped)";
-+    const std::string gemma3n_node_name                  = "node_";
-+
-     for (int i = 0; i < cgraph->n_nodes; i++) {
-         ggml_tensor * node = cgraph->nodes[i];
- 
-@@ -2495,12 +2498,17 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
- #endif
-         }
- 
-        if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1) {
-            // disable CUDA graphs for batch size > 1 for now.
-            // Changes in batch size or context size can cause changes to the grid size of some kernels.
-+        // workarounds to exclude Gemma3n's `project_per_layer_input` operation from the batch-size heuristic, specific to ollama's implementation of gemma3n
-+        // number of layers is different for per_layer_proj between gemma3n:2b and gemma3n:4b, which is why we don't check that value here
-+        if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1 && !(node->ne[0] == 256
-+                                                                                    && node->ne[2] == 1
-+                                                                                    && node->ne[3] == 1
-+                                                                                    && node->src[0] ? std::string(node->src[0]->name).find(gemma3n_node_name) != std::string::npos : false
-+                                                                                    && node->src[1] ? node->src[1]->name == gemma3n_per_layer_proj_src1_name : false)) {
-+            // Generally, changes in batch size or context size can cause changes to the grid size of some kernels.
-             use_cuda_graph = false;
- #ifndef NDEBUG
-            GGML_LOG_DEBUG("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
-+            GGML_LOG_INFO("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
- #endif
-         }
- 
--- a/llama/patches/0022-fix-mtmd-audio.cpp-build-on-windows.patch
+++ b/llama/patches/0022-fix-mtmd-audio.cpp-build-on-windows.patch
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Daniel Hiltgen <daniel@ollama.com>
+Date: Wed, 6 Aug 2025 12:35:29 -0700
+Subject: [PATCH] fix mtmd-audio.cpp build on windows
+
+---
+ tools/mtmd/mtmd-audio.cpp | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/tools/mtmd/mtmd-audio.cpp b/tools/mtmd/mtmd-audio.cpp
+index 4d053895..84bdc277 100644
+--- a/tools/mtmd/mtmd-audio.cpp
+++ b/tools/mtmd/mtmd-audio.cpp
+@@ -1,6 +1,6 @@
+#define _USE_MATH_DEFINES // for M_PI
+ #include "mtmd-audio.h"
+ 
+-#define _USE_MATH_DEFINES // for M_PI
+ #include <cmath>
+ #include <cstdint>
+ #include <cstring>
--- a/llama/patches/0023-MXFP4.patch
+++ b/llama/patches/0023-MXFP4.patch
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Daniel Hiltgen <daniel@ollama.com>
-Date: Mon, 21 Jul 2025 12:06:13 -0700
-Subject: [PATCH] MXFP4
-
-Partial implementation of MXFP4 tensor type
---
- ggml/include/ggml.h                   |   2 +-
- ggml/src/ggml-common.h                |   7 +
- ggml/src/ggml-cpu/ggml-cpu-quants.h   |   2 +
- ggml/src/ggml-cpu/ggml-cpu.c          |   5 +
- ggml/src/ggml-cpu/ops.cpp             |   1 +
- ggml/src/ggml-cpu/vec.cpp             |  90 ++++++++
- ggml/src/ggml-cpu/vec.h               |   2 +
- ggml/src/ggml-cuda/convert.cu         |  80 +++++++
- ggml/src/ggml-cuda/ggml-cuda.cu       |  16 +-
- ggml/src/ggml-cuda/mmvmxfp4.cu        | 307 ++++++++++++++++++++++++++
- ggml/src/ggml-cuda/mmvmxfp4.cuh       |   9 +
- ggml/src/ggml-metal/ggml-metal-impl.h |   3 +
- ggml/src/ggml-metal/ggml-metal.m      |  25 ++-
- ggml/src/ggml-metal/ggml-metal.metal  | 173 ++++++++++++++-
- ggml/src/ggml-quants.c                | 142 +++++++++++-
- ggml/src/ggml-quants.h                |   6 +
- ggml/src/ggml.c                       |  13 +-
- 17 files changed, 868 insertions(+), 15 deletions(-)
- create mode 100644 ggml/src/ggml-cuda/mmvmxfp4.cu
- create mode 100644 ggml/src/ggml-cuda/mmvmxfp4.cuh
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index e91dedf1..873baa24 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -353,7 +353,7 @@ extern "C" {
-         GGML_TYPE_F16     = 1,
-         GGML_TYPE_Q4_0    = 2,
-         GGML_TYPE_Q4_1    = 3,
-        // GGML_TYPE_Q4_2 = 4, support has been removed
-+        GGML_TYPE_MXFP4   = 4, // Formerly removed type GGML_TYPE_Q4_2
-         // GGML_TYPE_Q4_3 = 5, support has been removed
-         GGML_TYPE_Q5_0    = 6,
-         GGML_TYPE_Q5_1    = 7,
-diff --git a/ggml/src/ggml-common.h b/ggml/src/ggml-common.h
-index 086c822d..e0d71451 100644
--- a/ggml/src/ggml-common.h
-+++ b/ggml/src/ggml-common.h
-@@ -417,6 +417,13 @@ typedef struct {
- } block_iq4_xs;
- static_assert(sizeof(block_iq4_xs) == sizeof(ggml_half) + sizeof(uint16_t) + QK_K/64 + QK_K/2, "wrong iq4_xs block size/padding");
- 
-+#define MXFP4 32
-+typedef struct {
-+    uint8_t d;              // scale E8M0 float 
-+    uint8_t qs[MXFP4 / 2];  // (32) 4 bit elements E2M1 float
-+} block_mxfp4;
-+static_assert(sizeof(block_mxfp4) == sizeof(uint8_t) + MXFP4/2, "wrong mxfp4 block size/padding");
-+
- #endif // GGML_COMMON_DECL
- #endif // GGML_COMMON_DECL
- 
-diff --git a/ggml/src/ggml-cpu/ggml-cpu-quants.h b/ggml/src/ggml-cpu/ggml-cpu-quants.h
-index e33d9d47..6a25d062 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-quants.h
-+++ b/ggml/src/ggml-cpu/ggml-cpu-quants.h
-@@ -58,6 +58,8 @@ void ggml_vec_dot_iq4_nl_q8_0 (int n, float * GGML_RESTRICT s, size_t bs, const
- void ggml_vec_dot_iq4_xs_q8_K (int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
- void ggml_vec_dot_iq3_s_q8_K  (int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
- 
-+void ggml_vec_dot_mxfp4(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const float * GGML_RESTRICT y, size_t by, int nrc);
-+
- #ifdef __cplusplus
- }
- #endif
-diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
-index 2462d2b8..bff9c426 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.c
-+++ b/ggml/src/ggml-cpu/ggml-cpu.c
-@@ -362,6 +362,11 @@ static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = {
-         .vec_dot_type             = GGML_TYPE_Q8_K,
-         .nrows                    = 1,
-     },
-+    [GGML_TYPE_MXFP4] = {
-+        .vec_dot                  = (ggml_vec_dot_t) ggml_vec_dot_mxfp4,
-+        .vec_dot_type             = GGML_TYPE_F32,
-+        .nrows                    = 1,
-+    },
- };
- 
- const struct ggml_type_traits_cpu * ggml_get_type_traits_cpu(enum ggml_type type) {
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 654e2f28..be0aa683 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -4965,6 +4965,7 @@ void ggml_compute_forward_clamp(
-         case GGML_TYPE_I32:
-         case GGML_TYPE_I64:
-         case GGML_TYPE_F64:
-+        case GGML_TYPE_MXFP4:
-         case GGML_TYPE_COUNT:
-             {
-                 GGML_ABORT("fatal error");
-diff --git a/ggml/src/ggml-cpu/vec.cpp b/ggml/src/ggml-cpu/vec.cpp
-index 02d40618..ec3ec9b1 100644
--- a/ggml/src/ggml-cpu/vec.cpp
-+++ b/ggml/src/ggml-cpu/vec.cpp
-@@ -250,3 +250,93 @@ ggml_float ggml_vec_log_soft_max_f32(const int n, float * y, const float * x, fl
-     }
-     return sum = (ggml_float)logf(sum);
- }
-+
-+#define MXFP4 32
-+typedef struct {
-+    uint8_t d;              // scale E8M0 float 
-+    uint8_t qs[MXFP4 / 2];  // (32) 4 bit elements E2M1 float
-+} block_mxfp4;
-+static_assert(sizeof(block_mxfp4) == sizeof(uint8_t) + MXFP4/2, "wrong mxfp4 block size/padding");
-+#define MXFP4_VALS {0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, 0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0}
-+
-+void ggml_vec_dot_mxfp4(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const float * GGML_RESTRICT y, size_t by, int nrc) {
-+    assert(nrc == 1);
-+    GGML_UNUSED(nrc);
-+    GGML_UNUSED(bx);
-+    GGML_UNUSED(by);
-+    GGML_UNUSED(bs);
-+    ggml_float mxfp4_table[] = MXFP4_VALS;
-+
-+#if defined(GGML_SIMD)
-+    float sumf = 0.0f;
-+    const int np = (n & ~(GGML_F32_STEP - 1));
-+    const block_mxfp4 * GGML_RESTRICT xx = (const block_mxfp4 *) vx;
-+    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
-+
-+    GGML_F32_VEC scalev;
-+    GGML_F32_VEC ax[GGML_F32_ARR];
-+    GGML_F32_VEC ay[GGML_F32_ARR];
-+    for (int i = 0; i < np; i += GGML_F32_STEP) { // ARM: +16  AVX512: +64
-+        for (int j = 0; j < GGML_F32_ARR; j++) { // ARM: 0 .. 4 AVX512: 0 .. 4
-+            // convert GGML_F32_ARR X elements 
-+            const int ib = (i + j*GGML_F32_EPR) / MXFP4;
-+            const block_mxfp4 * GGML_RESTRICT x = &xx[ib];
-+            union {
-+                uint32_t as_bits;
-+                float as_value;
-+            } scale;
-+            scale.as_bits = (((uint32_t)x->d) << 23);
-+            scalev = GGML_F32_VEC_SET1(scale.as_value);
-+            float xf[GGML_F32_EPR]= {0.f};
-+            assert(((i+j*GGML_F32_EPR) % MXFP4)+GGML_F32_ARR < MXFP4 && "block overrun");
-+            for (int qi = 0; qi < GGML_F32_EPR/2 ; ++qi) {
-+                xf[qi*2] = mxfp4_table[(x->qs[((i+j*GGML_F32_EPR)%MXFP4)/2+qi] & 0xf)];
-+                xf[qi*2+1] = mxfp4_table[(x->qs[((i+j*GGML_F32_EPR)%MXFP4)/2+qi] & 0xf0) >> 4];
-+            }
-+
-+            ax[j] = GGML_F32_VEC_MUL(GGML_F32_VEC_LOAD(xf), scalev);
-+            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
-+            sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
-+        }
-+    }
-+    GGML_F32_VEC_REDUCE(sumf, sum);
-+
-+    // leftovers
-+    for (int i = np; i < n; i+=2) {
-+        const int ib = i / MXFP4;
-+        const block_mxfp4 * GGML_RESTRICT x = &xx[ib];
-+        union {
-+            uint32_t as_bits;
-+            float as_value;
-+        } scale;
-+        scale.as_bits = (((uint32_t)x->d) << 23);
-+        sumf += y[i] * scale.as_value * mxfp4_table[(x->qs[(i%MXFP4)/2] & 0xf)];
-+        sumf += y[i+1] * scale.as_value * mxfp4_table[(x->qs[(i%MXFP4)/2] & 0xf0) >> 4];
-+    }
-+
-+
-+#else // defined(GGML_SIMD)
-+    const int nb = n / MXFP4;
-+    assert(n % MXFP4 == 0);
-+
-+    int yi = 0;
-+
-+    const block_mxfp4 * GGML_RESTRICT xx = (const block_mxfp4 *) vx;
-+
-+    ggml_float sumf = 0.0;
-+    for (int ib = 0; ib < nb; ++ib) {
-+        const block_mxfp4 * GGML_RESTRICT x = &xx[ib + 0];
-+        union {
-+            uint32_t as_bits;
-+            float as_value;
-+        } scale;
-+        scale.as_bits = (((uint32_t)x->d) << 23);
-+        for (int i = 0; i < MXFP4/2; ++i) {
-+            sumf += mxfp4_table[(x->qs[i] & 0xf)] * (ggml_float)(scale.as_value) * (ggml_float)(y[ib*MXFP4 + i*2]);
-+            sumf += mxfp4_table[(x->qs[i] & 0xf0) >> 4] * (ggml_float)(scale.as_value) * (ggml_float)(y[ib*MXFP4 + i*2+1]);
-+        }
-+    }
-+#endif
-+
-+    *s = sumf;
-+}
-diff --git a/ggml/src/ggml-cpu/vec.h b/ggml/src/ggml-cpu/vec.h
-index 23cbb305..7480ca08 100644
--- a/ggml/src/ggml-cpu/vec.h
-+++ b/ggml/src/ggml-cpu/vec.h
-@@ -42,6 +42,8 @@ void ggml_vec_dot_f32(int n, float * GGML_RESTRICT s, size_t bs, const float * G
- void ggml_vec_dot_bf16(int n, float * GGML_RESTRICT s, size_t bs, ggml_bf16_t * GGML_RESTRICT x, size_t bx, ggml_bf16_t * GGML_RESTRICT y, size_t by, int nrc);
- void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * GGML_RESTRICT x, size_t bx, ggml_fp16_t * GGML_RESTRICT y, size_t by, int nrc);
- 
-+void ggml_vec_dot_mxfp4(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const float * GGML_RESTRICT y, size_t by, int nrc);
-+
- void ggml_vec_silu_f32(const int n, float * y, const float * x);
- ggml_float ggml_vec_soft_max_f32(const int n, float * y, const float * x, float max);
- ggml_float ggml_vec_log_soft_max_f32(const int n, float * y, const float * x, float max);
-diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu
-index c6dec427..0e016ccc 100644
--- a/ggml/src/ggml-cuda/convert.cu
-+++ b/ggml/src/ggml-cuda/convert.cu
-@@ -571,6 +571,82 @@ static void dequantize_row_iq4_xs_cuda(const void * vx, dst_t * y, const int64_t
-     dequantize_block_iq4_xs<<<nb, 32, 0, stream>>>(vx, y);
- }
- 
-+// MXFP4 dequantize derived from dequantize_block_q4_0
-+template<typename dst_t>
-+static __global__ void dequantize_block_mxfp4(const void * __restrict__ vx, dst_t * __restrict__ yy, int nb32) {
-+    const uint16_t dst_bias = 15;
-+    const uint16_t dst_0p5 = 0x3800;
-+    const uint16_t dst_m_bits = 10;
-+    const int64_t i = blockIdx.x;
-+
-+    // assume 32 threads
-+    const int64_t tid = threadIdx.x;
-+    const int64_t il  = tid/8;
-+    const int64_t ir  = tid%8;
-+    const int64_t ib = 8*i + ir;
-+    if (ib >= nb32) {
-+        return;
-+    }
-+
-+    const uint64_t offset = 256*i + MXFP4*ir + 8*il;
-+    dst_t * y = yy + offset;
-+
-+    const block_mxfp4 * x = (const block_mxfp4 *)vx + ib;
-+    union {
-+        uint32_t as_bits;
-+        float as_value;
-+    } scale;
-+    scale.as_bits = (((uint32_t)x->d) << 23);
-+
-+    // offset within the block 1/4 chunks (8 items)
-+    const uint8_t * q = x->qs + 4*il;
-+
-+    for (int l = 0; l < 4; ++l) {
-+        uint16_t em0 = q[l] & 0x07;
-+        uint16_t em1 = q[l] & 0x70;
-+        // float16 values
-+        iq1m_scale_t x0;
-+        iq1m_scale_t x1;
-+
-+        x0.u16 = (em0 << (dst_m_bits - 1)) | ((q[l] & 0x08) << 12);
-+        x1.u16 = (em1 << (dst_m_bits - 5)) | ((q[l] & 0x80) << 8);
-+
-+        // Three cases:
-+        // x is normal and non-zero: Correct bias
-+        if ((em0 & 0x06) != 0) {
-+            x0.u16 = x0.u16 + ((dst_bias - 1) << dst_m_bits);
-+        }
-+        if ((em1 & 0x60) != 0) {
-+            x1.u16 = x1.u16 + ((dst_bias - 1) << dst_m_bits);
-+        }
-+        // x is subnormal (x == 0bs001 where s is the sign): Map to +-0.5 in the dst type
-+        if (em0 == 0x01) {
-+            x0.u16 = dst_0p5 | (x0.u16 & 0x8000);
-+        }
-+        if (em1 == 0x10) {
-+            x1.u16 = dst_0p5 | (x1.u16 & 0x8000);
-+        }
-+        // x is zero, do nothing
-+
-+        // XXX it looks correct here - but mulmat still gives bad results...
-+        // printf("i:%lld ir:%lld il:%lld l:%d y_offset:[%3lld +%d] = %f \n",
-+        //     i, ir, il, l, 256*i + 32*ir + 4*il, l*2+ 0, scale * float(x0.f16));
-+        // printf("i:%lld ir:%lld il:%lld l:%d y_offset:[%3lld +%d] = %f \n",
-+        //     i, ir, il, l, 256*i + 32*ir + 4*il, l*2+ 1, scale * float(x1.f16));
-+
-+        y[l*2] = scale.as_value * float(x0.f16);
-+        y[l*2+1] = scale.as_value * float(x1.f16);
-+    }
-+}
-+
-+// derived from dequantize_row_q4_0_cuda
-+template<typename dst_t>
-+static void dequantize_row_mxfp4_cuda(const void * vx, dst_t * y, const int64_t k, cudaStream_t stream) {
-+    const int nb32 = k / 32;
-+    const int nb = (k + 255) / 256;
-+    dequantize_block_mxfp4<<<nb, 32, 0, stream>>>(vx, y, nb32);
-+}
-+
- template <typename src_t, typename dst_t>
- static __global__ void convert_unary(
-         const void * __restrict__ vx, dst_t * __restrict__ y, const int64_t ne00, const int64_t ne01, const int64_t ne02,
-@@ -664,6 +740,8 @@ to_fp16_cuda_t ggml_get_to_fp16_cuda(ggml_type type) {
-             return convert_unary_cont_cuda<float>;
-         case GGML_TYPE_BF16:
-             return convert_unary_cont_cuda<nv_bfloat16>;
-+        case GGML_TYPE_MXFP4:
-+            return dequantize_row_mxfp4_cuda;
-         default:
-             return nullptr;
-     }
-@@ -713,6 +791,8 @@ to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
-             return convert_unary_cont_cuda<half>;
-         case GGML_TYPE_BF16:
-             return convert_unary_cont_cuda<nv_bfloat16>;
-+        case GGML_TYPE_MXFP4:
-+            return dequantize_row_mxfp4_cuda;
-         default:
-             return nullptr;
-     }
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 28ccf4be..bb19b06e 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -21,6 +21,7 @@
- #include "ggml-cuda/im2col.cuh"
- #include "ggml-cuda/mmq.cuh"
- #include "ggml-cuda/mmv.cuh"
-+#include "ggml-cuda/mmvmxfp4.cuh"
- #include "ggml-cuda/mmvq.cuh"
- #include "ggml-cuda/norm.cuh"
- #include "ggml-cuda/opt-step-adamw.cuh"
-@@ -1202,7 +1203,7 @@ static void ggml_cuda_op_mul_mat_cublas(
- 
-     const int cc = ggml_cuda_info().devices[id].cc;
- 
-    const bool use_fp16 = (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0) && row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT;
-+    const bool use_fp16 = (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0) && row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT && src0->type != GGML_TYPE_MXFP4;
- 
-     if (src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
-         ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
-@@ -1924,7 +1925,11 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
-         && src0->ne[0] % 2 == 0 && src1->ne[1] == 1;
-     bool use_mul_mat_vec_q = ggml_is_quantized(src0->type) && !bad_padding_clear
-         && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32
-        && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE;
-+        && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE
-+        && src0->type != GGML_TYPE_MXFP4;
-+    bool use_mul_mat_vec_mxfp4 = src0->type == GGML_TYPE_MXFP4
-+        && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32
-+        && src0->ne[0] % 2 == 0 && src1->ne[1] == 1;
-     bool use_mul_mat_q     = ggml_is_quantized(src0->type) && !bad_padding_clear
-         && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32;
- 
-@@ -1978,6 +1983,8 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
-         ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_vec_q, quantize_row_q8_1_cuda);
-     } else if (use_mul_mat_q) {
-         ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_q, quantize_mmq_q8_1_cuda);
-+    } else if (use_mul_mat_vec_mxfp4) {
-+        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_vec_mxfp4, nullptr);
-     } else {
-         ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_cublas, nullptr);
-     }
-@@ -1997,6 +2004,10 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
- 
-     if (src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) {
-+        if (ne2 == 1 && src0->type == GGML_TYPE_MXFP4) {
-+            ggml_cuda_mul_mat_vec_mxfp4(ctx, src0, src1, ids, dst);
-+            return;
-+        }
-         if (ne2 == 1) {
-             if (ggml_is_quantized(src0->type)) {
-                 ggml_cuda_mul_mat_vec_q(ctx, src0, src1, ids, dst);
-@@ -3056,6 +3067,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
-                     case GGML_TYPE_IQ4_NL:
-                     case GGML_TYPE_IQ4_XS:
-                     case GGML_TYPE_BF16:
-+                    case GGML_TYPE_MXFP4:
- #ifdef GGML_USE_MUSA
-                         if (a->type == GGML_TYPE_Q3_K) {
-                             return false;
-diff --git a/ggml/src/ggml-cuda/mmvmxfp4.cu b/ggml/src/ggml-cuda/mmvmxfp4.cu
-new file mode 100644
-index 00000000..da62062b
--- /dev/null
-+++ b/ggml/src/ggml-cuda/mmvmxfp4.cu
-@@ -0,0 +1,307 @@
-+#include "ggml.h"
-+#include "common.cuh"
-+#include "mmvmxfp4.cuh"
-+
-+// MXFP4 implementation derived from mmv.cu float32 code paths
-+typedef union {
-+    half f16;
-+    uint16_t  u16;
-+} f16_t;
-+
-+template <typename type_acc, int block_size> // TODO type_acc unused - consider bf16 support
-+static __global__ void mul_mat_vec_mxfp4(
-+        const block_mxfp4 * __restrict__ x, const float * __restrict__ y, const int32_t * __restrict__ ids, float * __restrict__ dst,
-+        const int64_t ncols2, const int64_t nchannels_y, const int64_t stride_row,
-+        const int64_t channel_ratio, const int64_t stride_channel_x, const int64_t stride_channel_y, const int64_t stride_channel_dst,
-+        const int64_t sample_ratio, const int64_t stride_sample_x, const int64_t stride_sample_y, const int64_t stride_sample_dst) {
-+    const int64_t row         = blockIdx.x;
-+    const int64_t channel_dst = blockIdx.y;
-+    const int64_t channel_x   = ids ? ids[channel_dst]          : channel_dst / channel_ratio;
-+    const int64_t channel_y   = ids ? channel_dst % nchannels_y : channel_dst;
-+    const int64_t sample_dst  = blockIdx.z;
-+    const int64_t sample_x    = sample_dst / sample_ratio;
-+    const int64_t sample_y    = sample_dst;
-+    const int     tid         = threadIdx.x;
-+    constexpr int warp_size   = ggml_cuda_get_physical_warp_size();
-+
-+    const uint16_t dst_bias = 15;
-+    const uint16_t dst_0p5 = 0x3800;
-+    const uint16_t dst_m_bits = 10;
-+
-+    x   += sample_x  *stride_sample_x   + channel_x  *stride_channel_x   + row*stride_row;
-+    y   += sample_y  *stride_sample_y   + channel_y  *stride_channel_y;
-+    dst += sample_dst*stride_sample_dst + channel_dst*stride_channel_dst;
-+    
-+    const float2 * y2 = (const float2 *) y;
-+
-+    extern __shared__ char data_mmv[]; // allocated in GPU shared memory: warp_size*sizeof(float)
-+    float * buf_iw = (float *) data_mmv;
-+
-+    if (block_size > warp_size) {
-+        if (tid < warp_size) {
-+            buf_iw[tid] = 0.0f;
-+        }
-+        __syncthreads();
-+    }
-+
-+    float sumf = 0.0f;
-+
-+    for (int64_t col2 = tid; col2 < ncols2; col2 += block_size) {
-+        int offset0 = col2 / (MXFP4/2);
-+        int i = col2 % (MXFP4/2);
-+        const block_mxfp4 *x2 = x+offset0;
-+
-+        union {
-+            uint32_t as_bits;
-+            float as_value;
-+        } scale;
-+        scale.as_bits = (((uint32_t)x2->d) << 23);
-+        uint16_t em0 = x2->qs[i] & 0x07;
-+        uint16_t em1 = x2->qs[i] & 0x70;
-+        // float16 values
-+        f16_t x0;
-+        f16_t x1;
-+        x0.u16 = (em0 << (dst_m_bits - 1)) | ((x2->qs[i] & 0x08) << 12);
-+        x1.u16 = (em1 << (dst_m_bits - 5)) | ((x2->qs[i] & 0x80) << 8);
-+
-+        // Three cases:
-+        // x is normal and non-zero: Correct bias
-+        if ((em0 & 0x06) != 0) {
-+            x0.u16 = x0.u16 + ((dst_bias - 1) << dst_m_bits);
-+        }
-+        if ((em1 & 0x60) != 0) {
-+            x1.u16 = x1.u16 + ((dst_bias - 1) << dst_m_bits);
-+        }
-+        // x is subnormal (x == 0bs001 where s is the sign): Map to +-0.5 in the dst type
-+        if (em0 == 0x01) {
-+            x0.u16 = dst_0p5 | (x0.u16 & 0x8000);
-+        }
-+        if (em1 == 0x10) {
-+            x1.u16 = dst_0p5 | (x1.u16 & 0x8000);
-+        }
-+        // x is zero, do nothing
-+
-+        if (isnan(scale.as_value)) {
-+            sumf = scale.as_value;
-+            break;
-+        }
-+
-+        const float2 tmpx = {x0.f16, x1.f16};
-+        const float2 tmpy = y2[col2];
-+        sumf += tmpx.x*tmpy.x*scale.as_value;
-+        sumf += tmpx.y*tmpy.y*scale.as_value;
-+    }
-+
-+    sumf = warp_reduce_sum<warp_size>(sumf);
-+
-+    if (block_size > warp_size) {
-+        buf_iw[tid/warp_size] = sumf;
-+        __syncthreads();
-+        if (tid >= warp_size) {
-+            return;
-+        }
-+        sumf = buf_iw[tid];
-+        sumf = warp_reduce_sum<warp_size>(sumf);
-+    }
-+
-+    if (tid != 0) {
-+        return;
-+    }
-+
-+    dst[row] = sumf;
-+}
-+
-+template <typename type_acc>
-+static void launch_mul_mat_vec_cuda_mxfp4(
-+        const block_mxfp4 * x, const float * y, const int32_t * ids, float * dst,
-+        const int64_t ncols, const int64_t nrows, const int64_t stride_row, const int64_t nchannels_x, const int64_t nchannels_y, const int64_t nchannels_dst,
-+        const int64_t stride_channel_x, const int64_t stride_channel_y, const int64_t stride_channel_dst, const int64_t nsamples_x,
-+        const int64_t nsamples_dst, const int64_t stride_sample_x, const int64_t stride_sample_y, const int64_t stride_sample_dst,
-+        cudaStream_t stream) {
-+    GGML_ASSERT(ncols      % 2 == 0);
-+    // GGML_ASSERT(stride_row % 2 == 0); // TODO 
-+    GGML_ASSERT(ids || nchannels_dst % nchannels_x == 0);
-+    GGML_ASSERT(       nsamples_dst  % nsamples_x  == 0);
-+    const int64_t channel_ratio = nchannels_dst / nchannels_x;
-+    const int64_t sample_ratio  = nsamples_dst  / nsamples_x;
-+    int device;
-+    int warp_size;
-+
-+    CUDA_CHECK(cudaGetDevice(&device));
-+    warp_size = ggml_cuda_info().devices[device].warp_size;
-+
-+    int64_t block_size_best = warp_size;
-+    int64_t niter_best      = (ncols + 2*warp_size - 1) / (2*warp_size);
-+    int64_t max_block_size  = 256;
-+    if(ggml_cuda_info().devices[device].cc > GGML_CUDA_CC_OFFSET_AMD && ggml_cuda_info().devices[device].cc < GGML_CUDA_CC_RDNA1) {
-+        max_block_size = 128;
-+    }
-+    for (int64_t block_size = 2*warp_size; block_size <= max_block_size; block_size += warp_size) {
-+        const int64_t niter = (ncols + 2*block_size - 1) / (2*block_size);
-+        if (niter < niter_best) {
-+            niter_best      = niter;
-+            block_size_best = block_size;
-+        }
-+    }
-+
-+    const int smem = warp_size*sizeof(float);
-+    const dim3 block_nums(nrows, nchannels_dst, nsamples_dst);
-+    const dim3 block_dims(block_size_best, 1, 1);
-+
-+    switch (block_size_best) {
-+        case   32: {
-+            mul_mat_vec_mxfp4<type_acc,  32><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case   64: {
-+            mul_mat_vec_mxfp4<type_acc,  64><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case   96: {
-+            mul_mat_vec_mxfp4<type_acc,  96><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case  128: {
-+            mul_mat_vec_mxfp4<type_acc, 128><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case  160: {
-+            mul_mat_vec_mxfp4<type_acc, 160><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case  192: {
-+            mul_mat_vec_mxfp4<type_acc, 192><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case  224: {
-+            mul_mat_vec_mxfp4<type_acc, 224><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        case  256: {
-+            mul_mat_vec_mxfp4<type_acc, 256><<<block_nums, block_dims, smem, stream>>>
-+                (x, y, ids, dst, ncols/2, nchannels_y, stride_row, channel_ratio, stride_channel_x, stride_channel_y,
-+                 stride_channel_dst, sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
-+        } break;
-+        default: {
-+            GGML_ABORT("fatal error");
-+        } break;
-+    }
-+}
-+
-+static void mul_mat_vec_cuda_mxfp4(
-+        const block_mxfp4 * x, const float * y, const int32_t * ids, float * dst,
-+        const int64_t ncols, const int64_t nrows, const int64_t stride_row, const int64_t nchannels_x, const int64_t nchannels_y, const int64_t nchannels_dst,
-+        const int64_t stride_channel_x, const int64_t stride_channel_y, const int64_t stride_channel_dst, const int64_t nsamples_x,
-+        const int64_t nsamples_dst, const int64_t stride_sample_x, const int64_t stride_sample_y, const int64_t stride_sample_dst,
-+        enum ggml_prec prec, cudaStream_t stream) {
-+    launch_mul_mat_vec_cuda_mxfp4<float>
-+        (x, y, ids, dst, ncols, nrows, stride_row, nchannels_x, nchannels_y, nchannels_dst, stride_channel_x, stride_channel_y,
-+         stride_channel_dst, nsamples_x, nsamples_dst, stride_sample_x, stride_sample_y, stride_sample_dst, stream);
-+}
-+
-+void ggml_cuda_mul_mat_vec_mxfp4(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * ids, ggml_tensor * dst) {
-+    GGML_ASSERT(        src1->type == GGML_TYPE_F32);
-+    GGML_ASSERT(!ids ||  ids->type == GGML_TYPE_I32);
-+    GGML_ASSERT(         dst->type == GGML_TYPE_F32);
-+
-+    GGML_TENSOR_BINARY_OP_LOCALS;
-+
-+    const size_t ts_src0 = ggml_type_size(src0->type);
-+    const size_t ts_src1 = ggml_type_size(src1->type);
-+    const size_t ts_dst  = ggml_type_size(dst->type);
-+
-+    GGML_ASSERT(!ids || ne12 == 1); // Implementation is only correct for  batch size 1.
-+    GGML_ASSERT(ne13 == ne3);
-+
-+    // GGML_ASSERT(        nb00       == ts_src0); // TODO adjust for block sizing logic
-+    GGML_ASSERT(        nb10       == ts_src1);
-+    GGML_ASSERT(!ids || ids->nb[0] == ggml_type_size(ids->type));
-+    GGML_ASSERT(        nb0        == ts_dst);
-+
-+    const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-+    const enum ggml_prec prec = fast_fp16_available(cc) ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
-+
-+    const float   * src1_d =       (const float   *) src1->data;
-+    const int32_t *  ids_d = ids ? (const int32_t *)  ids->data : nullptr;
-+    float         *  dst_d =       (float         *)  dst->data;
-+
-+    const int64_t stride_row = src0->nb[1] / ts_src0;
-+    const int64_t s11 = src1->nb[1] / ts_src1;
-+    const int64_t s1  =  dst->nb[1] / ts_dst;
-+    const int64_t stride_channel_x = src0->nb[2] / ts_src0;
-+    const int64_t s12 = src1->nb[2] / ts_src1;
-+    const int64_t s2  =  dst->nb[2] / ts_dst;
-+    const int64_t stride_sample_x = src0->nb[3] / ts_src0;
-+    const int64_t stride_sample_y = src1->nb[3] / ts_src1;
-+    const int64_t stride_sample_dst  =  dst->nb[3] / ts_dst;
-+    const int64_t nsamples_dst = ne3;
-+    const int64_t nsamples_x = ne03;
-+    const int64_t nchannels_x = ne02;
-+    const int64_t nrows = ne01;
-+    const int64_t ncols = ne00;
-+
-+    // For MUL_MAT_ID the memory layout is different than for MUL_MAT:
-+    const int64_t ncols_dst          = ids ? ne2  : ne1;
-+    const int64_t nchannels_y        = ids ? ne11 : ne12;
-+    const int64_t nchannels_dst      = ids ? ne1  : ne2;
-+    const int64_t stride_channel_dst = ids ? s1   : s2;
-+    const int64_t stride_channel_y   = ids ? s11  : s12;
-+
-+    GGML_ASSERT(ncols_dst == 1);
-+
-+    const block_mxfp4 * src0_d = (const block_mxfp4 *) src0->data;
-+    mul_mat_vec_cuda_mxfp4(src0_d, src1_d, ids_d, dst_d, ncols, nrows, stride_row,
-+        nchannels_x, nchannels_y, nchannels_dst, stride_channel_x, stride_channel_y, stride_channel_dst,
-+        nsamples_x, nsamples_dst, stride_sample_x, stride_sample_y, stride_sample_dst, prec, ctx.stream());
-+}
-+
-+void ggml_cuda_op_mul_mat_vec_mxfp4(
-+    ggml_backend_cuda_context & ctx,
-+    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
-+    const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
-+    const int64_t src1_padded_row_size, cudaStream_t stream) {
-+
-+    GGML_ASSERT(src1->type == GGML_TYPE_F32);
-+    GGML_ASSERT(dst->type  == GGML_TYPE_F32);
-+
-+    const int64_t ne00 = src0->ne[0];
-+    const int64_t row_diff = row_high - row_low;
-+
-+    GGML_ASSERT(src1_ncols == 1);
-+
-+    const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-+    const enum ggml_prec prec = fast_fp16_available(cc) ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
-+
-+    // ggml_cuda_op provides single, contiguous matrices
-+    const int64_t stride_row         = ne00 / MXFP4; 
-+    const int64_t nchannels_x        = 1;
-+    const int64_t nchannels_y        = 1;
-+    const int64_t nchannels_dst      = 1;
-+    const int64_t stride_channel_x   = 0;
-+    const int64_t stride_channel_y   = 0;
-+    const int64_t stride_channel_dst = 0;
-+    const int64_t nsamples_x         = 1;
-+    const int64_t nsamples_dst       = 1;
-+    const int64_t stride_sample_x    = 0;
-+    const int64_t stride_sample_y    = 0;
-+    const int64_t stride_sample_dst  = 0;
-+
-+    const block_mxfp4 * src0_d = (const block_mxfp4 *) src0_dd_i;
-+    mul_mat_vec_cuda_mxfp4(src0_d, src1_ddf_i, nullptr, dst_dd_i, ne00, row_diff, stride_row,
-+        nchannels_x, nchannels_y, nchannels_dst, stride_channel_x, stride_channel_y, stride_channel_dst,
-+        nsamples_x, nsamples_dst, stride_sample_x, stride_sample_y, stride_sample_dst, prec, stream);
-+
-+    GGML_UNUSED(ctx);
-+    GGML_UNUSED(src1);
-+    GGML_UNUSED(dst);
-+    GGML_UNUSED(src1_ddq_i);
-+    GGML_UNUSED(src1_ncols);
-+    GGML_UNUSED(src1_padded_row_size);
-+}
-diff --git a/ggml/src/ggml-cuda/mmvmxfp4.cuh b/ggml/src/ggml-cuda/mmvmxfp4.cuh
-new file mode 100644
-index 00000000..a08fc780
--- /dev/null
-+++ b/ggml/src/ggml-cuda/mmvmxfp4.cuh
-@@ -0,0 +1,9 @@
-+#include "common.cuh"
-+
-+void ggml_cuda_mul_mat_vec_mxfp4(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * ids, ggml_tensor * dst);
-+
-+void ggml_cuda_op_mul_mat_vec_mxfp4(
-+    ggml_backend_cuda_context & ctx,
-+    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
-+    const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
-+    const int64_t src1_padded_row_size, cudaStream_t stream);
-diff --git a/ggml/src/ggml-metal/ggml-metal-impl.h b/ggml/src/ggml-metal/ggml-metal-impl.h
-index 17eab976..938386ba 100644
--- a/ggml/src/ggml-metal/ggml-metal-impl.h
-+++ b/ggml/src/ggml-metal/ggml-metal-impl.h
-@@ -65,6 +65,9 @@
- #define N_R0_IQ4_XS 2
- #define N_SG_IQ4_XS 2
- 
-+#define N_R0_MXFP4 4
-+#define N_SG_MXFP4 2
-+
- // kernel argument structs
- //
- // - element counters (e.g. ne00) typically use int32_t to reduce register usage
-diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index ab46f6e3..d8e05a21 100644
--- a/ggml/src/ggml-metal/ggml-metal.m
-+++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -40,6 +40,7 @@ static const NSInteger MTLGPUFamilyMetal3_GGML = 5001;
- static struct ggml_backend_reg    g_ggml_backend_metal_reg;
- static struct ggml_backend_device g_ggml_backend_metal_device;
- 
-+
- // information about a Metal device
- // note: assumes single GPU device - the default one
- // TODO: support multiple GPU devices
-@@ -209,6 +210,7 @@ enum ggml_metal_kernel_type {
-     GGML_METAL_KERNEL_TYPE_MUL_MV_Q5_0_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_Q5_1_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_Q8_0_F32,
-+    GGML_METAL_KERNEL_TYPE_MUL_MV_MXFP4_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_EXT_F16_F32_R1_2,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_EXT_F16_F32_R1_3,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_EXT_F16_F32_R1_4,
-@@ -288,6 +290,7 @@ enum ggml_metal_kernel_type {
-     GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ1_M_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ4_NL_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ4_XS_F32,
-+    GGML_METAL_KERNEL_TYPE_MUL_MV_ID_MXFP4_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_F32_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_F16_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_BF16_F32,
-@@ -310,6 +313,7 @@ enum ggml_metal_kernel_type {
-     GGML_METAL_KERNEL_TYPE_MUL_MM_IQ1_M_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_IQ4_NL_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_IQ4_XS_F32,
-+    GGML_METAL_KERNEL_TYPE_MUL_MM_MXFP4_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MAP0_F16,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MAP1_F32,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_ID_F32_F16,
-@@ -334,6 +338,7 @@ enum ggml_metal_kernel_type {
-     GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ1_M_F16,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ4_NL_F16,
-     GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ4_XS_F16,
-+    GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MXFP4_F16,
-     GGML_METAL_KERNEL_TYPE_ROPE_NORM_F32,
-     GGML_METAL_KERNEL_TYPE_ROPE_NORM_F16,
-     GGML_METAL_KERNEL_TYPE_ROPE_MULTI_F32,
-@@ -934,7 +939,7 @@ static id<MTLLibrary> ggml_metal_load_library(id<MTLDevice> device, bool use_bfl
- 
-             MTLCompileOptions * options = [MTLCompileOptions new];
-             options.preprocessorMacros = prep;
-
-+            
-             //[options setFastMathEnabled:false];
- 
-             metal_library = [device newLibraryWithSource:src options:options error:&error];
-@@ -1157,6 +1162,7 @@ static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t de
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_Q5_0_F32,                 mul_mv_q5_0_f32,                 has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_Q5_1_F32,                 mul_mv_q5_1_f32,                 has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_Q8_0_F32,                 mul_mv_q8_0_f32,                 has_simdgroup_reduction);
-+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_MXFP4_F32,                mul_mv_mxfp4_f32,                has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_EXT_F16_F32_R1_2,         mul_mv_ext_f16_f32_r1_2,         has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_EXT_F16_F32_R1_3,         mul_mv_ext_f16_f32_r1_3,         has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_EXT_F16_F32_R1_4,         mul_mv_ext_f16_f32_r1_4,         has_simdgroup_reduction);
-@@ -1236,6 +1242,7 @@ static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t de
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ1_M_F32,             mul_mv_id_iq1_m_f32,             has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ4_NL_F32,            mul_mv_id_iq4_nl_f32,            has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ4_XS_F32,            mul_mv_id_iq4_xs_f32,            has_simdgroup_reduction);
-+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MV_ID_MXFP4_F32,             mul_mv_id_mxfp4_f32,             has_simdgroup_reduction);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_F32_F32,                  mul_mm_f32_f32,                  has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_F16_F32,                  mul_mm_f16_f32,                  has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_BF16_F32,                 mul_mm_bf16_f32,                 has_simdgroup_mm && use_bfloat);
-@@ -1258,6 +1265,7 @@ static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t de
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_IQ1_M_F32,                mul_mm_iq1_m_f32,                has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_IQ4_NL_F32,               mul_mm_iq4_nl_f32,               has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_IQ4_XS_F32,               mul_mm_iq4_xs_f32,               has_simdgroup_mm);
-+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_MXFP4_F32,                mul_mm_mxfp4_f32,                has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MAP0_F16,              mul_mm_id_map0_f16,              has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MAP1_F32,              mul_mm_id_map1_f32,              has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_F32_F16,               mul_mm_id_f32_f16,               has_simdgroup_mm);
-@@ -1282,6 +1290,7 @@ static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t de
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ1_M_F16,             mul_mm_id_iq1_m_f16,             has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ4_NL_F16,            mul_mm_id_iq4_nl_f16,            has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ4_XS_F16,            mul_mm_id_iq4_xs_f16,            has_simdgroup_mm);
-+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MXFP4_F16,             mul_mm_id_mxfp4_f16,             has_simdgroup_mm);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_ROPE_NORM_F32,                   rope_norm_f32,                   true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_ROPE_NORM_F16,                   rope_norm_f16,                   true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_ROPE_MULTI_F32,                  rope_multi_f32,                  true);
-@@ -3007,6 +3016,7 @@ static bool ggml_metal_encode_node(
-                         case GGML_TYPE_IQ1_M:   pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_IQ1_M_F32  ].pipeline; break;
-                         case GGML_TYPE_IQ4_NL:  pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_IQ4_NL_F32 ].pipeline; break;
-                         case GGML_TYPE_IQ4_XS:  pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_IQ4_XS_F32 ].pipeline; break;
-+                        case GGML_TYPE_MXFP4:   pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_MXFP4_F32  ].pipeline; break;
-                         default: GGML_ABORT("MUL MAT-MAT not implemented");
-                     }
- 
-@@ -3212,6 +3222,12 @@ static bool ggml_metal_encode_node(
-                                 smem = 32*sizeof(float);
-                                 pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_IQ4_XS_F32].pipeline;
-                             } break;
-+                        case GGML_TYPE_MXFP4:
-+                            {
-+                                nsg = N_SG_MXFP4;
-+                                nr0 = N_R0_MXFP4;
-+                                pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_MXFP4_F32].pipeline;
-+                            } break;
-                         default:
-                             {
-                                 GGML_LOG_ERROR("Asserting on type %d\n", (int)src0t);
-@@ -3396,6 +3412,7 @@ static bool ggml_metal_encode_node(
-                             case GGML_TYPE_IQ1_M:   pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ1_M_F16  ].pipeline; break;
-                             case GGML_TYPE_IQ4_NL:  pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ4_NL_F16 ].pipeline; break;
-                             case GGML_TYPE_IQ4_XS:  pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_ID_IQ4_XS_F16 ].pipeline; break;
-+                            case GGML_TYPE_MXFP4:   pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MM_ID_MXFP4_F16  ].pipeline; break;
-                             default: GGML_ABORT("MUL_MAT_ID not implemented");
-                         }
- 
-@@ -3607,6 +3624,12 @@ static bool ggml_metal_encode_node(
-                                 smem = 32*sizeof(float);
-                                 pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_ID_IQ4_XS_F32].pipeline;
-                             } break;
-+                        case GGML_TYPE_MXFP4:
-+                            {
-+                                nsg = N_SG_MXFP4;
-+                                nr0 = N_R0_MXFP4;
-+                                pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_ID_MXFP4_F32].pipeline;
-+                            } break;
-                         default:
-                             {
-                                 GGML_LOG_ERROR("Asserting on type %d\n", (int)src2t);
-diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
-index 08e8d807..69fa17de 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
-+++ b/ggml/src/ggml-metal/ggml-metal.metal
-@@ -1902,16 +1902,16 @@ void mul_vec_q_n_f32_impl(
-         device const char * src1,
-         device       char * dst,
-         threadgroup  char * shmem,
-        uint3  tgpig,
-        ushort tiisg,
-        ushort sgitg) {
-    const int nb = args.ne00/QK4_0;
-+        uint3  tgpig, // Threadgroup Position in Grid
-+        ushort tiisg, // Thread Index in SIMD Group
-+        ushort sgitg) { // SIMD Group Index in ThreadGroup
-+    const int nb = args.ne00/QK4_0; // src0->ne[0] / 32
- 
-     const int r0 = tgpig.x;
-     const int r1 = tgpig.y;
-     const int im = tgpig.z;
- 
-    const int first_row = (r0 * nsg + sgitg) * nr0;
-+    const int first_row = (r0 * nsg + sgitg) * nr0; // nsg=2 nr0=4
- 
-     const uint i12 = im%args.ne12;
-     const uint i13 = im/args.ne12;
-@@ -6744,6 +6744,49 @@ kernel void kernel_mul_mm_id(
-     }
- }
- 
-+template <typename type4x4>
-+void dequantize_mxfp4(device const block_mxfp4 * xb, short il, thread type4x4 & reg) {
-+    float4x4 reg_f;
-+    const ushort dst_bias = 15;
-+    const ushort dst_0p5 = 0x3800;
-+    const ushort dst_m_bits = 10;
-+    const half scale = (half)(as_type<float>(((uint32_t)xb->d) << 23));
-+    // il:0 first 16, il:1 last 16
-+    for (int i = 0; i < 8; i++) {
-+        ushort em0 = xb->qs[il*8 + i] & 0x07;
-+        ushort em1 = xb->qs[il*8 + i] & 0x70;
-+        // float16 values
-+        ushort x0 = (em0 << (dst_m_bits - 1)) | ((xb->qs[il*8 + i] & 0x08) << 12);
-+        ushort x1 = (em1 << (dst_m_bits - 5)) | ((xb->qs[il*8 + i] & 0x80) << 8);
-+
-+        // Three cases:
-+        // x is normal and non-zero: Correct bias
-+        if ((em0 & 0x06) != 0) {
-+            x0 = x0 + ((dst_bias - 1) << dst_m_bits);
-+        }
-+        if ((em1 & 0x60) != 0) {
-+            x1 = x1 + ((dst_bias - 1) << dst_m_bits);
-+        }
-+        // x is subnormal (x == 0bs001 where s is the sign): Map to +-0.5 in the dst type
-+        if (em0 == 0x01) {
-+            x0 = dst_0p5 | (x0 & 0x8000);
-+        }
-+        if (em1 == 0x10) {
-+            x1 = dst_0p5 | (x1 & 0x8000);
-+        }
-+        // x is zero, do nothing
-+
-+        if (isnan(scale)) {
-+            reg_f[i/2][2*(i%2) + 0] = scale;
-+            reg_f[i/2][2*(i%2) + 1] = scale;
-+        } else {
-+            reg_f[i/2][2*(i%2) + 0] = scale * as_type<half>(x0);
-+            reg_f[i/2][2*(i%2) + 1] = scale * as_type<half>(x1);
-+        }
-+    }
-+    reg = (type4x4) reg_f;
-+}
-+
- #define QK_NL 16
- 
- //
-@@ -6811,6 +6854,8 @@ template [[host_name("kernel_mul_mm_iq1_m_f32")]]   kernel mul_mm_t kernel_mul_m
- template [[host_name("kernel_mul_mm_iq4_nl_f32")]]  kernel mul_mm_t kernel_mul_mm<half,   half4x4,   simdgroup_half8x8,   block_iq4_nl,  2,     dequantize_iq4_nl>;
- template [[host_name("kernel_mul_mm_iq4_xs_f32")]]  kernel mul_mm_t kernel_mul_mm<half,   half4x4,   simdgroup_half8x8,   block_iq4_xs,  QK_NL, dequantize_iq4_xs>;
- 
-+template [[host_name("kernel_mul_mm_mxfp4_f32")]]   kernel mul_mm_t kernel_mul_mm<half,   half4x4,   simdgroup_half8x8,   block_mxfp4,   2,     dequantize_mxfp4>;
-+
- //
- // indirect matrix-matrix multiplication
- //
-@@ -6842,6 +6887,8 @@ template [[host_name("kernel_mul_mm_id_iq1_m_f16")]]   kernel mul_mm_id kernel_m
- template [[host_name("kernel_mul_mm_id_iq4_nl_f16")]]  kernel mul_mm_id kernel_mul_mm_id<half,   half4x4,   simdgroup_half8x8,   block_iq4_nl,  2,     dequantize_iq4_nl>;
- template [[host_name("kernel_mul_mm_id_iq4_xs_f16")]]  kernel mul_mm_id kernel_mul_mm_id<half,   half4x4,   simdgroup_half8x8,   block_iq4_xs,  QK_NL, dequantize_iq4_xs>;
- 
-+template [[host_name("kernel_mul_mm_id_mxfp4_f16")]]   kernel mul_mm_id kernel_mul_mm_id<half,   half4x4,   simdgroup_half8x8,   block_mxfp4,    2,    dequantize_mxfp4>;
-+
- 
- //
- // matrix-vector multiplication
-@@ -6958,6 +7005,120 @@ kernel void kernel_mul_mv_id(
-         sgitg);
- }
- 
-+// MXFP32 implementation derived from mul_vec_q_n_f32_impl and block_q_n_dot_y
-+void mul_mv_mxfp4_f32_impl(
-+        ggml_metal_kargs_mul_mv args,
-+        device const char * src0,
-+        device const char * src1,
-+        device       char * dst,
-+        threadgroup  char * shmem,
-+        uint3  tgpig,
-+        ushort tiisg,
-+        ushort sgitg) {
-+    const ushort dst_bias = 15;
-+    const ushort dst_0p5 = 0x3800;
-+    const ushort dst_m_bits = 10;
-+    const int nr0 = N_R0_MXFP4;
-+    const int nsg = N_SG_MXFP4;
-+    const int nw = N_SIMDWIDTH;
-+    const int nb = args.ne00/MXFP4;
-+
-+    const int r0 = tgpig.x;
-+    const int r1 = tgpig.y;
-+    const int im = tgpig.z;
-+
-+    const int first_row = (r0 * nsg + sgitg) * nr0;
-+
-+    const uint i12 = im%args.ne12;
-+    const uint i13 = im/args.ne12;
-+
-+    const uint64_t offset1 =        r1*args.nb11 + (i12        )*args.nb12 + (i13        )*args.nb13;
-+
-+    device const float       * y = (device const float       *) (src1 + offset1);
-+
-+    // pointers to src0 rows
-+    device const block_mxfp4 * ax[nr0];
-+    for (int row = 0; row < nr0; ++row) {
-+        const uint64_t offset0 = (first_row + row)*args.nb01 + (i12/args.r2)*args.nb02 + (i13/args.r3)*args.nb03;
-+
-+        ax[row] = (device const block_mxfp4 *) ((device char *) src0 + offset0);
-+    }
-+
-+    float yl[16]; // src1 vector cache
-+    float sumf[nr0] = {0.f};
-+
-+    const short ix = (tiisg/2);
-+    const short il = (tiisg%2)*16;
-+
-+    device const float * yb = y + ix*MXFP4 + il;
-+
-+    // each thread in a SIMD group deals with half a block.
-+    for (int ib = ix; ib < nb; ib += nw/2) {
-+
-+#pragma unroll
-+        for (short row = 0; row < nr0; row++) {
-+            // Processes 16 items
-+            device const block_mxfp4 * qb_curr = ax[row] + ib;
-+            float d = as_type<float>(((uint32_t)(ax[row] + ib)->d) << 23);
-+            // il = 0 or 16
-+            device const uint8_t *qs = ((device const uint8_t *) qb_curr + 1 + il/2);
-+            for (int i = 0; i < 8; ++i) {
-+                ushort em0 = qs[i] & 0x07;
-+                ushort em1 = qs[i] & 0x70;
-+                ushort x0 = (em0 << (dst_m_bits - 1)) | ((qs[i] & 0x08) << 12);
-+                ushort x1 = (em1 << (dst_m_bits - 5)) | ((qs[i] & 0x80) << 8);
-+                // Three cases:
-+                // x is normal and non-zero: Correct bias
-+                if ((em0 & 0x06) != 0) {
-+                    x0 = x0 + ((dst_bias - 1) << dst_m_bits);
-+                }
-+                if ((em1 & 0x60) != 0) {
-+                    x1 = x1 + ((dst_bias - 1) << dst_m_bits);
-+                }
-+                // x is subnormal (x == 0bs001 where s is the sign): Map to +-0.5 in the dst type
-+                if (em0 == 0x01) {
-+                    x0 = dst_0p5 | (x0 & 0x8000);
-+                }
-+                if (em1 == 0x10) {
-+                    x1 = dst_0p5 | (x1 & 0x8000);
-+                }
-+                // x is zero, do nothing
-+                if (!isnan(d)) {
-+                    sumf[row] += yb[i*2] * as_type<half>(x0) * d
-+                        + yb[i*2+1] * as_type<half>(x1) * d;
-+                } else {
-+                    sumf[row] = d;
-+                }
-+            }
-+        }
-+
-+        yb += MXFP4 * 16;
-+    }
-+
-+    device float * dst_f32 = (device float *) dst + im*args.ne0*args.ne1 + r1*args.ne0;
-+
-+    for (int row = 0; row < nr0; ++row) {
-+        const float tot = simd_sum(sumf[row]);
-+
-+        if (tiisg == 0 && first_row + row < args.ne01) {
-+            dst_f32[first_row + row] = tot;
-+        }
-+    }
-+}
-+
-+[[host_name("kernel_mul_mv_mxfp4_f32")]]
-+kernel void kernel_mul_mv_mxfp4_f32(
-+        constant ggml_metal_kargs_mul_mv & args,
-+        device const char * src0,
-+        device const char * src1,
-+        device       char * dst,
-+        threadgroup  char * shmem [[threadgroup(0)]],
-+        uint3  tgpig[[threadgroup_position_in_grid]],
-+        ushort tiisg[[thread_index_in_simdgroup]],
-+        ushort sgitg[[simdgroup_index_in_threadgroup]]) {
-+    mul_mv_mxfp4_f32_impl(args, src0, src1, dst, shmem, tgpig, tiisg, sgitg);
-+}
-+
- typedef decltype(kernel_mul_mv_id<mmv_fn<kernel_mul_mv_impl<float, float4, float, float4>>>) kernel_mul_mv_id_t;
- 
- template [[host_name("kernel_mul_mv_id_f32_f32")]]     kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<kernel_mul_mv_impl<float, float4, float, float4>>>;
-@@ -6987,6 +7148,8 @@ template [[host_name("kernel_mul_mv_id_iq2_s_f32")]]   kernel kernel_mul_mv_id_t
- template [[host_name("kernel_mul_mv_id_iq4_nl_f32")]]  kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<kernel_mul_mv_iq4_nl_f32_impl <N_R0_IQ4_NL,  N_SG_IQ4_NL,  N_SIMDWIDTH>>>;
- template [[host_name("kernel_mul_mv_id_iq4_xs_f32")]]  kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<kernel_mul_mv_iq4_xs_f32_impl <N_R0_IQ4_XS,  N_SG_IQ4_XS,  N_SIMDWIDTH>>>;
- 
-+template [[host_name("kernel_mul_mv_id_mxfp4_f32")]]    kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<mul_mv_mxfp4_f32_impl>>;
-+
- kernel void kernel_pool_2d_max_f32(
-         device  const float * src0,
-         device        float * dst,
-diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
-index 84ec6dfe..17c308aa 100644
--- a/ggml/src/ggml-quants.c
-+++ b/ggml/src/ggml-quants.c
-@@ -4925,6 +4925,144 @@ void quantize_row_iq2_s_ref(const float * GGML_RESTRICT x, block_iq2_s * GGML_RE
-     quantize_iq2_s(x, y, 1, k, NULL);
- }
- 
-+// =============================== mxfp4 (de)-quantization
-+
-+void quantize_row_mxfp4_ref(const float * GGML_RESTRICT x, block_mxfp4 * GGML_RESTRICT y, int64_t k) {
-+    static const int qk = MXFP4;
-+    static const uint32_t E8_BIAS = 127;
-+    static const uint32_t E2_BIAS = 1;
-+
-+    assert(k % qk == 0);
-+
-+    const int nb = k / qk;
-+
-+    for (int i = 0; i < nb; i++) {
-+        float amax = 0.0f; // absolute max
-+
-+        for (int j = 0; j < qk; j++) {
-+            const float v = x[i*qk + j];
-+            if (amax < fabsf(v)) {
-+                amax = fabsf(v);
-+            }
-+        }
-+
-+        const float dequant_scale  = amax / 6.0f;
-+        uint32_t dequant_scale_exponent = 0;
-+        memcpy(&dequant_scale_exponent, &dequant_scale, sizeof(dequant_scale_exponent));
-+
-+        // Rounding up
-+        dequant_scale_exponent = (dequant_scale_exponent + 0x007FFFFF) & 0x7F800000;
-+        // Rounding down
-+        // dequant_scale_exponent = dequant_scale_exponent & 0x7F800000;
-+
-+        float dequant_scale_rounded = 0.0f;
-+        memcpy(&dequant_scale_rounded, &dequant_scale_exponent, sizeof(dequant_scale_rounded));
-+        float quant_scale = 0.0f;
-+        if (dequant_scale_rounded != 0.0f) {
-+            quant_scale = 1.0f / dequant_scale_rounded;
-+        }
-+
-+        y[i].d = (uint8_t)(dequant_scale_exponent >> 23);
-+
-+        for (int j = 0; j < qk/2; ++j) {
-+            const float x0 = x[i*qk + j*2]*quant_scale;
-+            const float x1 = x[i*qk + j*2+1]*quant_scale;
-+
-+            uint32_t xi0 = 0;
-+            uint32_t xi1 = 0;
-+            memcpy(&xi0, &x0, sizeof(xi0));
-+            memcpy(&xi1, &x1, sizeof(xi1));
-+
-+            uint32_t s0 = xi0 & 0x80000000;
-+            uint32_t s1 = xi1 & 0x80000000;
-+            uint32_t e0 = (xi0 >> 23) & 0xFF;
-+            uint32_t e1 = (xi1 >> 23) & 0xFF;
-+            uint32_t m0 = (xi0 & 0x7FFFFF);
-+            uint32_t m1 = (xi1 & 0x7FFFFF);
-+
-+            // 0.25 <= x < 0.75 maps to 0.5, a denormal number
-+            // Move implicit bit 1 at the beginning to mantissa for denormals
-+            // adjusted_exponents
-+            uint32_t ae0 = E8_BIAS - (e0 + 1);
-+            uint32_t ae1 = E8_BIAS - (e1 + 1);
-+            if (e0 < E8_BIAS) {
-+                m0 = (0x400000 | (m0 >> 1)) >> ae0;
-+            }
-+            if (e1 < E8_BIAS) {
-+                m1 = (0x400000 | (m1 >> 1)) >> ae1;
-+            }
-+
-+            // For normal numbers, we change the bias from 127 to 1, and for subnormals, we keep exponent as 0.
-+            e0 = MAX(e0, E8_BIAS - E2_BIAS) - (E8_BIAS - E2_BIAS);
-+            e1 = MAX(e1, E8_BIAS - E2_BIAS) - (E8_BIAS - E2_BIAS);
-+
-+            // Combine sign, exponent, and mantissa, while saturating
-+            // rounding nearest with tie breaking up by adding +1 to one bit right of the LSB, then shift right
-+            uint32_t tmp0 = MIN((((e0 << 2) | (m0 >> 21)) + 1) >> 1, 0x7);
-+            uint32_t tmp1 = MIN((((e1 << 2) | (m1 >> 21)) + 1) >> 1, 0x7);
-+            uint8_t v0 = (uint8_t)((s0 >> 28) | tmp0);
-+            uint8_t v1 = (uint8_t)((s1 >> 28) | tmp1);           
-+            y[i].qs[j]  = v0;
-+            y[i].qs[j] |= v1 << 4;
-+        }
-+    }
-+}
-+
-+void dequantize_row_mxfp4(const block_mxfp4 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) {
-+    assert(k % MXFP4 == 0);
-+
-+    const int nb = k / MXFP4;
-+    const uint16_t dst_bias = 15;
-+    const uint16_t dst_0p5 = 0x3800;
-+    const uint16_t dst_m_bits = 10;
-+
-+    for (int i = 0; i < nb; i++) {
-+        union {
-+            uint32_t as_bits;
-+            float as_value;
-+        } scale;
-+        scale.as_bits = (((uint32_t)x[i].d) << 23);
-+        for (int j = 0; j < MXFP4/2; ++j) {
-+            uint16_t em0 = x[i].qs[j] & 0x07;
-+            uint16_t em1 = x[i].qs[j] & 0x70;
-+            // float16 values
-+            uint16_t x0 = (em0 << (dst_m_bits - 1)) | ((x[i].qs[j] & 0x08) << 12);
-+            uint16_t x1 = (em1 << (dst_m_bits - 5)) | ((x[i].qs[j] & 0x80) << 8);
-+
-+            // Three cases:
-+            // x is normal and non-zero: Correct bias
-+            if ((em0 & 0x06) != 0) {
-+                x0 = x0 + ((dst_bias - 1) << dst_m_bits);
-+            }
-+            if ((em1 & 0x60) != 0) {
-+                x1 = x1 + ((dst_bias - 1) << dst_m_bits);
-+            }
-+            // x is subnormal (x == 0bs001 where s is the sign): Map to +-0.5 in the dst type
-+            if (em0 == 0x01) {
-+                x0 = dst_0p5 | (x0 & 0x8000);
-+            }
-+            if (em1 == 0x10) {
-+                x1 = dst_0p5 | (x1 & 0x8000);
-+            }
-+            // x is zero, do nothing
-+
-+            if (isnan(scale.as_value)) {
-+                y[i*MXFP4 + j*2] = scale.as_value;
-+                y[i*MXFP4 + j*2+1] = scale.as_value;
-+            } else {
-+                y[i*MXFP4 + j*2] = GGML_FP16_TO_FP32(x0)*scale.as_value;
-+                y[i*MXFP4 + j*2+1] = GGML_FP16_TO_FP32(x1)*scale.as_value;
-+            }
-+        }
-+    }
-+}
-+
-+
-+size_t quantize_mxfp4(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrow, int64_t n_per_row, const float * quant_weights) {
-+    quantize_row_mxfp4_ref(src, dst, (int64_t)nrow*n_per_row);
-+    return nrow * ggml_row_size(GGML_TYPE_MXFP4, n_per_row);
-+}
-+
- // =============================== data validation
- 
- static bool validate_float(float f, size_t i) {
-@@ -5214,7 +5352,9 @@ bool ggml_validate_row_data(enum ggml_type type, const void * data, size_t nbyte
-             {
-                 VALIDATE_ROW_DATA_D_F16_IMPL(block_iq4_nl, data, nb);
-             } break;
-
-+        case GGML_TYPE_MXFP4:
-+            // TODO - anything to validate?
-+            break;
-         case GGML_TYPE_I8:
-         case GGML_TYPE_I16:
-         case GGML_TYPE_I32:
-diff --git a/ggml/src/ggml-quants.h b/ggml/src/ggml-quants.h
-index d09173e1..2fc40f75 100644
--- a/ggml/src/ggml-quants.h
-+++ b/ggml/src/ggml-quants.h
-@@ -37,6 +37,8 @@ GGML_API void quantize_row_iq4_xs_ref (const float * GGML_RESTRICT x, block_iq4_
- GGML_API void quantize_row_iq3_s_ref  (const float * GGML_RESTRICT x, block_iq3_s   * GGML_RESTRICT y, int64_t k);
- GGML_API void quantize_row_iq2_s_ref  (const float * GGML_RESTRICT x, block_iq2_s   * GGML_RESTRICT y, int64_t k);
- 
-+GGML_API void quantize_row_mxfp4_ref(const float * GGML_RESTRICT x, block_mxfp4 * GGML_RESTRICT y, int64_t k);
-+
- // Dequantization
- GGML_API void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
- GGML_API void dequantize_row_q4_1(const block_q4_1 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
-@@ -65,6 +67,8 @@ GGML_API void dequantize_row_iq4_nl (const block_iq4_nl  * GGML_RESTRICT x, floa
- GGML_API void dequantize_row_iq4_xs (const block_iq4_xs  * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
- GGML_API void dequantize_row_iq3_s  (const block_iq3_s   * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
- 
-+GGML_API void dequantize_row_mxfp4(const block_mxfp4 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
-+
- // Quantization utilizing an importance matrix (a.k.a. "Activation aWare Quantization")
- GGML_API size_t quantize_iq2_xxs(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
- GGML_API size_t quantize_iq2_xs (const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
-@@ -90,6 +94,8 @@ GGML_API size_t quantize_q5_0(const float * GGML_RESTRICT src, void * GGML_RESTR
- GGML_API size_t quantize_q5_1(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
- GGML_API size_t quantize_q8_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
- 
-+GGML_API size_t quantize_mxfp4(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
-+
- GGML_API void iq2xs_init_impl(enum ggml_type type);
- GGML_API void iq2xs_free_impl(enum ggml_type type);
- GGML_API void iq3xs_init_impl(int grid_size);
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index 8a654624..0f3c9834 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -589,11 +589,13 @@ static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {
-         .to_float                 = (ggml_to_float_t) dequantize_row_q4_1,
-         .from_float_ref           = (ggml_from_float_t) quantize_row_q4_1_ref,
-     },
-    [4] = { // GGML_TYPE_Q4_2
-        .type_name                = "DEPRECATED",
-        .blck_size                = 0,
-        .type_size                = 0,
-        .is_quantized             = false,
-+    [GGML_TYPE_MXFP4] = { // formerly deprecated GGML_TYPE_Q4_2
-+        .type_name                = "mxfp4",
-+        .blck_size                = MXFP4,
-+        .type_size                = sizeof(block_mxfp4),
-+        .is_quantized             = true,
-+        .to_float                 = (ggml_to_float_t) dequantize_row_mxfp4,
-+        .from_float_ref           = (ggml_from_float_t) quantize_row_mxfp4_ref,
-     },
-     [5] = { // GGML_TYPE_Q4_3
-         .type_name                = "DEPRECATED",
-@@ -6446,6 +6448,7 @@ size_t ggml_quantize_chunk(
-         case GGML_TYPE_IQ1_M:   result = quantize_iq1_m  (src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
-         case GGML_TYPE_IQ4_NL:  result = quantize_iq4_nl (src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
-         case GGML_TYPE_IQ4_XS:  result = quantize_iq4_xs (src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
-+        case GGML_TYPE_MXFP4:   result = quantize_mxfp4  (src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
-         case GGML_TYPE_F16:
-             {
-                 size_t elemsize = sizeof(ggml_fp16_t);