fix: use proper unified diff and Dockerfile for kimi patch (#7435)

Signed-off-by: Anant Sharma <anants@nvidia.com>

fix: use proper unified diff and Dockerfile for kimi patch (#7435)
Signed-off-by: Anant Sharma <anants@nvidia.com>
15f978c1 · Anant Sharma · GitHub · 9df692c1 · 15f978c1 · 15f978c1
Unverified Commit 15f978c1 authored Mar 16, 2026 by Anant Sharma Committed by GitHub Mar 16, 2026
5 changed files
--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
@@ -96,10 +96,10 @@ The nvidia variant supports text inference with reasoning parsing (`--dyn-reason
 The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
 Before deploying, you must:

-1. Run the [patch script](trtllm/agg/nvidia/patch/) to build a patched image (appends `-patched` to the tag).
+1. Build a patched image via `docker build` with the `trtllm/agg/nvidia/patch/` context and `BASE_IMAGE` build-arg (see command below).
 2. Update the `image:` fields in the deploy YAML to reference the patched image.

-See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for full details on what the patch does.
+See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for details on what the patch does.

 ```bash
 # Set namespace
@@ -115,11 +115,10 @@ kubectl create secret generic hf-token-secret \
 kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

-# Patch the container image (required — upstream support not yet available)
-# This produces: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
-cd trtllm/agg/nvidia/patch
-./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-cd -
+# Patch the container image (required for nvidia weights)
+docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag \
+  -t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched \
+  trtllm/agg/nvidia/patch/

 # Update the image in the deploy manifest to use the patched tag


--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/Dockerfile
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/Dockerfile
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Patches TensorRT-LLM with KimiK25ForConditionalGeneration support.
+# Upstream tracking PR: https://github.com/NVIDIA/TensorRT-LLM/pull/11816
+#
+# Usage:
+#   docker build --build-arg BASE_IMAGE=<image> -t <image>-patched .
+
+ARG BASE_IMAGE
+FROM ${BASE_IMAGE}
+
+USER root
+
+COPY kimi.patch /tmp/kimi.patch
+
+# Apply upstream diff — idempotent, fails if target file has diverged
+RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
+    TARGET="$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" && \
+    cd "$SITE_PKGS" && \
+    if patch -p1 --forward --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
+        patch -p1 --forward --fuzz=0 < /tmp/kimi.patch; \
+    elif patch -p1 --reverse --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
+        echo "Patch already applied, skipping."; \
+    else \
+        echo "ERROR: Patch failed — the target file may have changed upstream." >&2; \
+        echo "Try updating kimi.patch from https://github.com/NVIDIA/TensorRT-LLM/pull/11816" >&2; \
+        exit 1; \
+    fi && \
+    rm -f /tmp/kimi.patch
+
+# Smoke test
+RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
+    grep -q '@register_auto_model("KimiK25ForConditionalGeneration")' \
+        "$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" || \
+    { echo "ERROR: KimiK25ForConditionalGeneration not registered after patching" >&2; exit 1; }
+
+USER dynamo
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
 # Kimi K2.5 TensorRT-LLM Patch

-Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking branch](https://github.com/NVIDIA/TensorRT-LLM/compare/main...feat/k25-demo)).
+Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking PR](https://github.com/NVIDIA/TensorRT-LLM/pull/11816)).

-This directory contains an append-only patch that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
+This directory contains a unified diff that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.

 ## Quick start

-Patch a Dynamo docker image by running:
+Build a patched image:

 ```bash
-./patch-container.sh <docker-image>
+docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 \
+  -t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0-patched \
+  recipes/kimi-k2.5/trtllm/agg/nvidia/patch/
 ```

-For example:
-
-```bash
-./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-# produces image:    nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
-```
-
-If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
+The patch is applied via `patch -p1 --fuzz=0`:
+- If the target file has changed upstream, the build **fails loudly** instead of silently producing broken code.
+- If the patch is already applied, it is skipped (idempotent).
+- A smoke test verifies the class is registered before the build completes.

 ## Files

 | File | Description |
 |------|-------------|
-| `patch-container.sh` | Builds a patched docker image from a base Dynamo image |
-| `kimi.patch` | Appended to `modeling_deepseekv3.py` inside the container -- adds a thin `DeepseekV3ForCausalLM` subclass that extracts the Kimi text backbone config and remaps weight prefixes |
+| `Dockerfile` | Single-stage build that applies the patch to a base Dynamo image |
+| `kimi.patch` | Unified diff from [upstream PR #11816](https://github.com/NVIDIA/TensorRT-LLM/pull/11816) — adds `KimiK25ForConditionalGeneration` to `modeling_deepseekv3.py` |
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/kimi.patch
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/kimi.patch
-
-@register_auto_model("KimiK25ForConditionalGeneration")
-class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
-    """Kimi-K2.5 multimodal model (text-only path).
-
-    Extracts the DeepSeek-V3 text backbone from the composite config
-    and strips the ``language_model.`` weight prefix so that the
-    standard DeepseekV3ForCausalLM loading path works unchanged.
-
-    NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
-    so MTP-based speculative decoding is not applicable to this model.
-    """
-
-    _LANG_PREFIX = "language_model."
-
-    def __init__(self, model_config: ModelConfig[PretrainedConfig]):
-        model_config = copy.copy(model_config)
-        if hasattr(model_config.pretrained_config, 'text_config'):
-            model_config._frozen = False
-            model_config.pretrained_config = model_config.pretrained_config.text_config
-            if model_config.quant_config.exclude_modules:
-                model_config.quant_config = copy.copy(model_config.quant_config)
-                p = self._LANG_PREFIX
-                mapped = []
-                for m in model_config.quant_config.exclude_modules:
-                    if m.startswith(p):
-                        rest = m[len(p):]
-                        if rest.startswith('layers.'):
-                            rest = 'model.' + rest
-                        mapped.append(rest)
+diff --git a/tensorrt_llm/_torch/models/modeling_deepseekv3.py b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
+--- a/tensorrt_llm/_torch/models/modeling_deepseekv3.py
+++ b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
+@@ -1866,3 +1866,46 @@ def post_load_weights(self):
             else:
-                        mapped.append(m)
-                model_config.quant_config.exclude_modules = mapped
-            model_config._frozen = True
-        super().__init__(model_config)
-
-    def load_weights(self, weights: ConsumableWeightsDict):
-        has_prefix = any(k.startswith("language_model.") for k in weights)
-        if has_prefix:
-            weights = filter_weights("language_model", weights)
-            weights = ConsumableWeightsDict(weights)
-        super().load_weights(weights)
\ No newline at end of file
+                 layer.next_layer_layernorm = self.model.layers[
+                     idx + 1].input_layernorm
+
+
+@register_auto_model("KimiK25ForConditionalGeneration")
+class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
+    """Kimi-K2.5 multimodal model (text-only path).
+
+    Extracts the DeepSeek-V3 text backbone from the composite config
+    and strips the ``language_model.`` weight prefix so that the
+    standard DeepseekV3ForCausalLM loading path works unchanged.
+
+    NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
+    so MTP-based speculative decoding is not applicable to this model.
+    """
+
+    _LANG_PREFIX = "language_model."
+
+    def __init__(self, model_config: ModelConfig[PretrainedConfig]):
+        model_config = copy.copy(model_config)
+        if hasattr(model_config.pretrained_config, 'text_config'):
+            model_config._frozen = False
+            model_config.pretrained_config = model_config.pretrained_config.text_config
+            if model_config.quant_config.exclude_modules:
+                model_config.quant_config = copy.copy(model_config.quant_config)
+                p = self._LANG_PREFIX
+                mapped = []
+                for m in model_config.quant_config.exclude_modules:
+                    if m.startswith(p):
+                        rest = m[len(p):]
+                        if rest.startswith('layers.'):
+                            rest = 'model.' + rest
+                        mapped.append(rest)
+                    else:
+                        mapped.append(m)
+                model_config.quant_config.exclude_modules = mapped
+            model_config._frozen = True
+        super().__init__(model_config)
+
+    def load_weights(self, weights: ConsumableWeightsDict):
+        has_prefix = any(k.startswith("language_model.") for k in weights)
+        if has_prefix:
+            weights = filter_weights("language_model", weights)
+            weights = ConsumableWeightsDict(weights)
+        super().load_weights(weights)
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/patch-container.sh
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/patch-container.sh
-#!/usr/bin/env bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-set -euo pipefail
-
-if [[ $# -ne 1 ]]; then
-    echo "Usage: $0 <docker-image>"
-    echo "  Patches modeling_deepseekv3.py with KimiK25ForConditionalGeneration class."
-    echo "  Outputs: <docker-image>-patched"
-    exit 1
-fi
-
-SRC_IMAGE="$1"
-DST_IMAGE="${SRC_IMAGE}-patched"
-TARGET_FILE="/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py"
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-PATCH_FILE="${SCRIPT_DIR}/kimi.patch"
-
-if [[ ! -f "$PATCH_FILE" ]]; then
-    echo "ERROR: Patch file not found: $PATCH_FILE"
-    exit 1
-fi
-
-TMPDIR="$(mktemp -d)"
-trap 'rm -rf "$TMPDIR"' EXIT
-
-cp "$PATCH_FILE" "$TMPDIR/kimi.patch"
-
-cat > "$TMPDIR/Dockerfile" <<'DOCKERFILE'
-ARG BASE_IMAGE
-FROM ${BASE_IMAGE}
-
-ARG TARGET_FILE
-
-USER root
-
-COPY kimi.patch /opt/kimi.patch
-
-RUN if grep -q 'KimiK25ForConditionalGeneration' "${TARGET_FILE}"; then \
-        echo "Patch already applied, skipping."; \
-    else \
-        if ! head -50 "${TARGET_FILE}" | grep -q '^import copy'; then \
-            sed -i '1s/^/import copy\n/' "${TARGET_FILE}"; \
-        fi && \
-        echo "" >> "${TARGET_FILE}" && \
-        cat /opt/kimi.patch >> "${TARGET_FILE}"; \
-    fi && \
-    rm -f /opt/kimi.patch
-
-USER 1000
-DOCKERFILE
-
-echo "Building patched image: ${DST_IMAGE}"
-docker build \
-    --build-arg BASE_IMAGE="$SRC_IMAGE" \
-    --build-arg TARGET_FILE="$TARGET_FILE" \
-    -t "$DST_IMAGE" \
-    "$TMPDIR"
-
-echo "Done. Patched image: ${DST_IMAGE}"