fix(recipes): remove container patch requirements for kimi k2.5 recipes (#8199)

be70dae0 · Karen Chung · GitHub · d185c881 · be70dae0 · be70dae0
Unverified Commit be70dae0 authored Apr 14, 2026 by Karen Chung Committed by GitHub Apr 14, 2026
8 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -66,7 +66,7 @@ These recipes are under active development and may require additional setup step
 | Model | Framework | Mode | GPUs | Deployment | Notes |
 |-------|-----------|------|------|------------|-------|
 | **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). |
-| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Requires [container patch](kimi-k2.5/trtllm/agg/nvidia/patch/). Vision input not yet functional with the patch. |
+| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. |
 ## Recipe Structure

--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
@@ -9,7 +9,7 @@ There are two model weight variants, each with its own model download and deploy
 | Variant | Model | Status | Modality | Deploy Configs | Notes |
 |---------|-------|--------|----------|---------------|-------|
 | **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
-| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) for `deploy.yaml` and `deploy-kvbm.yaml`, while `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
+| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
 All configurations use TP8, EP8, aggregated mode with KV-aware routing.
@@ -81,22 +81,15 @@ curl http://localhost:8000/v1/chat/completions \
 **Status:** Functional | **Modality:** Text only upstream support
-> **Experimental for standard and KVBM deployments**: Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch) for the patch script and full instructions.
-> **Functional**: [Speculative Decoding recipe](trtllm/agg/nvidia/deploy-specdec.yaml) doesn't need the patch and is optimized for performance.
 > **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
 > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
 > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
-The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`. The standard and KVBM deployments still require the Kimi patched TRT-LLM image, while the speculative decoding deployment in `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image.
+The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment `deploy-specdec.yaml` that uses speculative decoding.
 ### Quick Start
-The nvidia deploy manifests use two image flows:
+The nvidia deploy manifests use the placeholder top-of-tree image: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
- `deploy.yaml` and `deploy-kvbm.yaml` use the placeholder patched image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`
- `deploy-specdec.yaml` uses the placeholder top-of-tree image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
 Before deploying, update the `image:` fields in the manifest you plan to use.
@@ -114,12 +107,6 @@ kubectl create secret generic hf-token-secret \
 kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
-# Patch the container image (required for nvidia weights)
-# Skip this step for Speculative Decoding recipe `deploy-specdec.yaml`
-docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag \
-  -t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched \
-  trtllm/agg/nvidia/patch/
 # Update the image in the deploy manifest to use the container tag (or the patched tag)
 # Deploy
@@ -252,4 +239,3 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
 ## Notes
 - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The two basic recipes in the nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
 # Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
-> Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`patch/`](patch/) for the patch script and full instructions.
-> **Note**: The two standard deployment (`deploy.yaml` and `deploy-kvbm.yaml`) for nvidia/Kimi-K2.5-NVFP4 model requires a patched TensorRT-LLM container image because upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before deploying either configuration below. See patch/ for the script and instructions. **`deploy-specdec.yaml` speculative decoding recipe doesn't need the image patch**.
 > **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
 > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
 > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
@@ -22,7 +18,6 @@ This directory contains three aggregated deployment configurations for the `nvid
 - 1x8 B200 GPUs or 8x4 GB200 GPUs
 - A `hf-token-secret` Secret containing your Hugging Face token
 - A pre-existing `model-cache` PVC
- `deploy.yaml` and `deploy-kvbm.yaml` require a patched image tag such as `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
 - `deploy-specdec.yaml` uses `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` and works with a current top-of-tree Dynamo TRT-LLM image
 ---
@@ -32,7 +27,6 @@ This directory contains three aggregated deployment configurations for the `nvid
 Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
 ```bash
-# Update the image in deploy.yaml to your patched image, then:
 kubectl apply -f deploy.yaml -n ${NAMESPACE}
 ```
@@ -47,7 +41,6 @@ This creates:
 Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
 ```bash
-# Update the image in deploy-kvbm.yaml to your patched image, then:
 kubectl apply -f deploy-kvbm.yaml -n ${NAMESPACE}
 ```
@@ -83,7 +76,7 @@ This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worke
 ## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
-Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200 and does not require the patched image used by the standard and KVBM manifests.
+Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200.
 ### Speculative Decoding Prerequisites

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
@@ -55,7 +55,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
      replicas: 1
    TrtllmWorker:
      componentType: worker
@@ -95,10 +95,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          # REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-          # Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
-          # Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy.yaml
@@ -51,7 +51,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
      replicas: 1
    TrtllmWorker:
      componentType: worker
@@ -84,10 +84,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          # REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-          # Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
-          # Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"
@@ -111,4 +108,4 @@ spec:
        limits:
          gpu: "8"
        requests:
          gpu: "8"
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/Dockerfile
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/Dockerfile
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Patches TensorRT-LLM with KimiK25ForConditionalGeneration support.
-# Upstream tracking PR: https://github.com/NVIDIA/TensorRT-LLM/pull/11816
-#
-# Usage:
-#   docker build --build-arg BASE_IMAGE=<image> -t <image>-patched .
-ARG BASE_IMAGE
-FROM ${BASE_IMAGE}
-USER root
-COPY kimi.patch /tmp/kimi.patch
-# Apply upstream diff — idempotent, fails if target file has diverged
-RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
-    TARGET="$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" && \
-    cd "$SITE_PKGS" && \
-    if patch -p1 --forward --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
-        patch -p1 --forward --fuzz=0 < /tmp/kimi.patch; \
-    elif patch -p1 --reverse --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
-        echo "Patch already applied, skipping."; \
-    else \
-        echo "ERROR: Patch failed — the target file may have changed upstream." >&2; \
-        echo "Try updating kimi.patch from https://github.com/NVIDIA/TensorRT-LLM/pull/11816" >&2; \
-        exit 1; \
-    fi && \
-    rm -f /tmp/kimi.patch
-# Smoke test
-RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
-    grep -q '@register_auto_model("KimiK25ForConditionalGeneration")' \
-        "$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" || \
-    { echo "ERROR: KimiK25ForConditionalGeneration not registered after patching" >&2; exit 1; }
-USER dynamo
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
-# Kimi K2.5 TensorRT-LLM Patch
-Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking PR](https://github.com/NVIDIA/TensorRT-LLM/pull/11816)).
-This directory contains a unified diff that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
-## Quick start
-Build a patched image:
-```bash
-docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 \
-  -t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0-patched \
-  recipes/kimi-k2.5/trtllm/agg/nvidia/patch/
-```
-The patch is applied via `patch -p1 --fuzz=0`:
- If the target file has changed upstream, the build **fails loudly** instead of silently producing broken code.
- If the patch is already applied, it is skipped (idempotent).
- A smoke test verifies the class is registered before the build completes.
-## Files
-| File | Description |
-|------|-------------|
-| `Dockerfile` | Single-stage build that applies the patch to a base Dynamo image |
-| `kimi.patch` | Unified diff from [upstream PR #11816](https://github.com/NVIDIA/TensorRT-LLM/pull/11816) — adds `KimiK25ForConditionalGeneration` to `modeling_deepseekv3.py` |
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/kimi.patch
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/kimi.patch
-diff --git a/tensorrt_llm/_torch/models/modeling_deepseekv3.py b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
--- a/tensorrt_llm/_torch/models/modeling_deepseekv3.py
-+++ b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
-@@ -1866,3 +1866,46 @@ def post_load_weights(self):
-             else:
-                 layer.next_layer_layernorm = self.model.layers[
-                     idx + 1].input_layernorm
-+
-+
-+@register_auto_model("KimiK25ForConditionalGeneration")
-+class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
-+    """Kimi-K2.5 multimodal model (text-only path).
-+
-+    Extracts the DeepSeek-V3 text backbone from the composite config
-+    and strips the ``language_model.`` weight prefix so that the
-+    standard DeepseekV3ForCausalLM loading path works unchanged.
-+
-+    NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
-+    so MTP-based speculative decoding is not applicable to this model.
-+    """
-+
-+    _LANG_PREFIX = "language_model."
-+
-+    def __init__(self, model_config: ModelConfig[PretrainedConfig]):
-+        model_config = copy.copy(model_config)
-+        if hasattr(model_config.pretrained_config, 'text_config'):
-+            model_config._frozen = False
-+            model_config.pretrained_config = model_config.pretrained_config.text_config
-+            if model_config.quant_config.exclude_modules:
-+                model_config.quant_config = copy.copy(model_config.quant_config)
-+                p = self._LANG_PREFIX
-+                mapped = []
-+                for m in model_config.quant_config.exclude_modules:
-+                    if m.startswith(p):
-+                        rest = m[len(p):]
-+                        if rest.startswith('layers.'):
-+                            rest = 'model.' + rest
-+                        mapped.append(rest)
-+                    else:
-+                        mapped.append(m)
-+                model_config.quant_config.exclude_modules = mapped
-+            model_config._frozen = True
-+        super().__init__(model_config)
-+
-+    def load_weights(self, weights: ConsumableWeightsDict):
-+        has_prefix = any(k.startswith("language_model.") for k in weights)
-+        if has_prefix:
-+            weights = filter_weights("language_model", weights)
-+            weights = ConsumableWeightsDict(weights)
-+        super().load_weights(weights)