feat: kimi2.5 with nvidia's model weights and trtllm patch (#6842)

01baf4a3 · Biswa Panda · GitHub · 3c0a0d75 · 01baf4a3 · 01baf4a3
Unverified Commit 01baf4a3 authored Mar 04, 2026 by Biswa Panda Committed by GitHub Mar 04, 2026
9 changed files
--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
@@ -2,13 +2,18 @@
 Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
-> **Note:** Support for the official **`nvidia/Kimi-K2.5-NVFP4`** checkpoint is in progress and will be added soon. The current recipe uses **`baseten-admin/Kimi-2.5-text-nvfp4-v3`**, a text-only variant where users can experience Kimi-K2.5 and its tool calling and reasoning capabilities.
 ## Available Configurations
-| Configuration | GPUs | Mode | Description |
+There are two model weight variants, each with its own model download and deploy manifests:
-|--------------|------|------|-------------|
-| [**trtllm/agg**](trtllm/agg/) | 8x GPU | Aggregated | TP8, EP8, KV-aware routing |
+| Variant | Model | Deploy Configs | Notes |
+|---------|-------|---------------|-------|
+| **nvidia** 🚧 | `nvidia/Kimi-K2.5-NVFP4` | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) |
+| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image |
+All configurations use TP8, EP8, aggregated mode with KV-aware routing.
+The **nvidia** variant also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
 ## Prerequisites
@@ -16,7 +21,7 @@ Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware ro
 2. **GPU cluster** with B200 GPUs (8x per worker)
 3. **HuggingFace token** with access to the model
-## Quick Start
+## Quick Start (nvidia variant)
 ```bash
 # Set namespace
@@ -29,13 +34,20 @@ kubectl create secret generic hf-token-secret \
  -n ${NAMESPACE}
 # Download model (update storageClassName in model-cache/model-cache.yaml first!)
-kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+# Patch the container image (required for nvidia weights)
+cd trtllm/agg/nvidia/patch
+./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+cd -
 # Deploy
-kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
+kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
 ```
+For baseten weights, use `model-cache/baseten/` and `trtllm/agg/baseten/deploy.yaml` instead — no image patch needed.
 ## Test the Deployment
 ```bash
@@ -46,7 +58,7 @@ kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "model": "nvidia/Kimi-K2.5-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
@@ -54,11 +66,11 @@ curl http://localhost:8000/v1/chat/completions \
 ## Model Details
- **Model**: `baseten-admin/Kimi-2.5-text-nvfp4-v3` (NV FP4 quantized, text-only)
+- **Model**: `nvidia/Kimi-K2.5-NVFP4` (NV FP4 quantized, text-only)
 - **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
 - **Backend**: TensorRT-LLM (PyTorch backend)
 - **Parallelism**: TP8, EP8 (Expert Parallel)
- **Features**: Reasoning (chain-of-thought), tool calling (function calling)
 ## Hardware Requirements
@@ -74,7 +86,7 @@ The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's cha
 curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "model": "nvidia/Kimi-K2.5-NVFP4",
    "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
    "max_tokens": 200
  }' | python3 -m json.tool
@@ -111,7 +123,7 @@ The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls i
 curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "model": "nvidia/Kimi-K2.5-NVFP4",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
@@ -166,4 +178,5 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
 ## Notes
 - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
\ No newline at end of file
+- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream
--- a/recipes/kimi-k2.5/model-cache/model-download.yaml
+++ b/recipes/kimi-k2.5/model-cache/model-download.yaml
@@ -23,7 +23,7 @@ spec:
                name: hf-token-secret
          env:
            - name: MODEL_NAME
-              value: baseten-admin/Kimi-2.5-text-nvfp4-v3  #  text-only variant
+              value: baseten-admin/Kimi-2.5-text-nvfp4-v3
            - name: HF_HOME
              value: /model-store
            - name: HF_HUB_ENABLE_HF_TRANSFER

--- a/recipes/kimi-k2.5/model-cache/nvidia/model-download.yaml
+++ b/recipes/kimi-k2.5/model-cache/nvidia/model-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: nvidia/Kimi-K2.5-NVFP4
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download $MODEL_NAME
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/baseten/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/baseten/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-config
+data:
+  config.yaml: |
+    max_batch_size: 128
+    max_num_tokens: 8448
+    max_seq_len: 8212
+    tensor_parallel_size: 8
+    moe_expert_parallel_size: 8
+    enable_attention_dp: true
+    pipeline_parallel_size: 1
+    print_iter_log: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.75
+      dtype: fp8
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 8448
+    trust_remote_code: true
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: kimi-k25-agg
+spec:
+  backendFramework: trtllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      extraPodSpec:
+        affinity:
+          podAntiAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                - key: nvidia.com/dynamo-graph-deployment-name
+                  operator: In
+                  values:
+                  - kimi-k25-agg-frontend
+              topologyKey: kubernetes.io/hostname
+        mainContainer:
+          args:
+          - python3 -m dynamo.frontend --router-mode kv --http-port 8000
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+      replicas: 1
+    TrtllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 80Gi
+      extraPodSpec:
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+              - matchExpressions:
+                - key: nvidia.com/gpu.present
+                  operator: In
+                  values:
+                  - "true"
+        mainContainer:
+          args:
+          - |
+            python3 -m dynamo.trtllm \
+              --model-path "${MODEL_NAME}" \
+              --served-model-name "${MODEL_NAME}" \
+              --extra-engine-args "${ENGINE_ARGS}" \
+              --tensor-parallel-size 8 \
+              --dyn-reasoning-parser kimi_k25 \
+              --dyn-tool-call-parser kimi_k2
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          env:
+          - name: TRTLLM_ENABLE_PDL
+            value: "1"
+          - name: MODEL_NAME
+            value: baseten-admin/Kimi-2.5-text-nvfp4-v3
+          - name: ENGINE_ARGS
+            value: /opt/dynamo/configs/config.yaml
+          - name: HF_HOME
+            value: /opt/models
+          volumeMounts:
+          - mountPath: /opt/dynamo/configs
+            name: llm-config
+            readOnly: true
+          workingDir: /workspace/examples/backends/trtllm
+        volumes:
+        - configMap:
+            name: llm-config
+          name: llm-config
+      replicas: 1
+      resources:
+        limits:
+          gpu: "8"
+        requests:
+          gpu: "8"
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-config-kimi-agg-kvbm
+data:
+  config.yaml: |
+    max_batch_size: 128
+    max_num_tokens: 8448
+    max_seq_len: 8212
+    tensor_parallel_size: 8
+    moe_expert_parallel_size: 8
+    enable_attention_dp: true
+    pipeline_parallel_size: 1
+    print_iter_log: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.75
+      dtype: fp8
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 8448
+    trust_remote_code: true
+    kv_connector_config:
+      connector_module: kvbm.trtllm_integration.connector
+      connector_scheduler_class: DynamoKVBMConnectorLeader
+      connector_worker_class: DynamoKVBMConnectorWorker
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: kimi-k25-agg-kvbm
+spec:
+  backendFramework: trtllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      extraPodSpec:
+        affinity:
+          podAntiAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                - key: nvidia.com/dynamo-graph-deployment-name
+                  operator: In
+                  values:
+                  - kimi-k25-agg-kvbm-frontend
+              topologyKey: kubernetes.io/hostname
+        mainContainer:
+          args:
+          - python3 -m dynamo.frontend --router-mode kv --http-port 8000
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+      replicas: 1
+    TrtllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 80Gi
+      extraPodSpec:
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+              - matchExpressions:
+                - key: nvidia.com/gpu.present
+                  operator: In
+                  values:
+                  - "true"
+        mainContainer:
+          args:
+          - |
+            python3 -m dynamo.trtllm \
+              --model-path "${MODEL_NAME}" \
+              --served-model-name "${MODEL_NAME}" \
+              --extra-engine-args "${ENGINE_ARGS}" \
+              --tensor-parallel-size 8 \
+              --dyn-reasoning-parser kimi_k25 \
+              --dyn-tool-call-parser kimi_k2
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          env:
+          - name: TRTLLM_ENABLE_PDL
+            value: "1"
+          - name: MODEL_NAME
+            value: nvidia/Kimi-K2.5-NVFP4
+          - name: ENGINE_ARGS
+            value: /opt/dynamo/configs/config.yaml
+          - name: HF_HOME
+            value: /opt/models
+          # Adjust CPU cache size as needed
+          - name: DYN_KVBM_CPU_CACHE_GB
+            value: "100"
+          volumeMounts:
+          - mountPath: /opt/dynamo/configs
+            name: llm-config-kimi-agg-kvbm
+            readOnly: true
+          workingDir: /workspace/examples/backends/trtllm
+        volumes:
+        - configMap:
+            name: llm-config-kimi-agg-kvbm
+          name: llm-config-kimi-agg-kvbm
+      replicas: 1
+      resources:
+        limits:
+          gpu: "8"
+        requests:
+          gpu: "8"
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-config
+data:
+  config.yaml: |
+    max_batch_size: 128
+    max_num_tokens: 8448
+    max_seq_len: 8212
+    tensor_parallel_size: 8
+    moe_expert_parallel_size: 8
+    enable_attention_dp: true
+    pipeline_parallel_size: 1
+    print_iter_log: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.75
+      dtype: fp8
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 8448
+    trust_remote_code: true
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: kimi-k25-agg
+spec:
+  backendFramework: trtllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      extraPodSpec:
+        affinity:
+          podAntiAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                - key: nvidia.com/dynamo-graph-deployment-name
+                  operator: In
+                  values:
+                  - kimi-k25-agg-frontend
+              topologyKey: kubernetes.io/hostname
+        mainContainer:
+          args:
+          - python3 -m dynamo.frontend --router-mode kv --http-port 8000
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+      replicas: 1
+    TrtllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 80Gi
+      extraPodSpec:
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+              - matchExpressions:
+                - key: nvidia.com/gpu.present
+                  operator: In
+                  values:
+                  - "true"
+        mainContainer:
+          args:
+          - |
+            python3 -m dynamo.trtllm \
+              --model-path "${MODEL_NAME}" \
+              --served-model-name "${MODEL_NAME}" \
+              --extra-engine-args "${ENGINE_ARGS}" \
+              --tensor-parallel-size 8 \
+              --dyn-reasoning-parser kimi_k25 \
+              --dyn-tool-call-parser kimi_k2
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          env:
+          - name: TRTLLM_ENABLE_PDL
+            value: "1"
+          - name: MODEL_NAME
+            value: nvidia/Kimi-K2.5-NVFP4
+          - name: ENGINE_ARGS
+            value: /opt/dynamo/configs/config.yaml
+          - name: HF_HOME
+            value: /opt/models
+          volumeMounts:
+          - mountPath: /opt/dynamo/configs
+            name: llm-config
+            readOnly: true
+          workingDir: /workspace/examples/backends/trtllm
+        volumes:
+        - configMap:
+            name: llm-config
+          name: llm-config
+      replicas: 1
+      resources:
+        limits:
+          gpu: "8"
+        requests:
+          gpu: "8"
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
+# Kimi K2.5 TensorRT-LLM Patch
+Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking branch](https://github.com/NVIDIA/TensorRT-LLM/compare/main...feat/k25-demo)).
+This directory contains an append-only patch that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
+## Quick start
+Patch a Dynamo docker image by running:
+```bash
+./patch-container.sh <docker-image>
+```
+For example:
+```bash
+./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+# produces image:    nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
+```
+If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
+## Files
+| File | Description |
+|------|-------------|
+| `patch-container.sh` | Builds a patched docker image from a base Dynamo image |
+| `kimi.patch` | Appended to `modeling_deepseekv3.py` inside the container -- adds a thin `DeepseekV3ForCausalLM` subclass that extracts the Kimi text backbone config and remaps weight prefixes |
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/kimi.patch
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/kimi.patch
+@register_auto_model("KimiK25ForConditionalGeneration")
+class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
+    """Kimi-K2.5 multimodal model (text-only path).
+    Extracts the DeepSeek-V3 text backbone from the composite config
+    and strips the ``language_model.`` weight prefix so that the
+    standard DeepseekV3ForCausalLM loading path works unchanged.
+    NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
+    so MTP-based speculative decoding is not applicable to this model.
+    """
+    _LANG_PREFIX = "language_model."
+    def __init__(self, model_config: ModelConfig[PretrainedConfig]):
+        model_config = copy.copy(model_config)
+        if hasattr(model_config.pretrained_config, 'text_config'):
+            model_config._frozen = False
+            model_config.pretrained_config = model_config.pretrained_config.text_config
+            if model_config.quant_config.exclude_modules:
+                model_config.quant_config = copy.copy(model_config.quant_config)
+                p = self._LANG_PREFIX
+                mapped = []
+                for m in model_config.quant_config.exclude_modules:
+                    if m.startswith(p):
+                        rest = m[len(p):]
+                        if rest.startswith('layers.'):
+                            rest = 'model.' + rest
+                        mapped.append(rest)
+                    else:
+                        mapped.append(m)
+                model_config.quant_config.exclude_modules = mapped
+            model_config._frozen = True
+        super().__init__(model_config)
+    def load_weights(self, weights: ConsumableWeightsDict):
+        has_prefix = any(k.startswith("language_model.") for k in weights)
+        if has_prefix:
+            weights = filter_weights("language_model", weights)
+            weights = ConsumableWeightsDict(weights)
+        super().load_weights(weights)
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/patch-container.sh
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/patch-container.sh
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -euo pipefail
+if [[ $# -ne 1 ]]; then
+    echo "Usage: $0 <docker-image>"
+    echo "  Patches modeling_deepseekv3.py with KimiK25ForConditionalGeneration class."
+    echo "  Outputs: <docker-image>-patched"
+    exit 1
+fi
+SRC_IMAGE="$1"
+DST_IMAGE="${SRC_IMAGE}-patched"
+TARGET_FILE="/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PATCH_FILE="${SCRIPT_DIR}/kimi.patch"
+if [[ ! -f "$PATCH_FILE" ]]; then
+    echo "ERROR: Patch file not found: $PATCH_FILE"
+    exit 1
+fi
+TMPDIR="$(mktemp -d)"
+trap 'rm -rf "$TMPDIR"' EXIT
+cp "$PATCH_FILE" "$TMPDIR/kimi.patch"
+cat > "$TMPDIR/Dockerfile" <<'DOCKERFILE'
+ARG BASE_IMAGE
+FROM ${BASE_IMAGE}
+ARG TARGET_FILE
+USER root
+COPY kimi.patch /opt/kimi.patch
+RUN if grep -q 'KimiK25ForConditionalGeneration' "${TARGET_FILE}"; then \
+        echo "Patch already applied, skipping."; \
+    else \
+        if ! head -50 "${TARGET_FILE}" | grep -q '^import copy'; then \
+            sed -i '1s/^/import copy\n/' "${TARGET_FILE}"; \
+        fi && \
+        echo "" >> "${TARGET_FILE}" && \
+        cat /opt/kimi.patch >> "${TARGET_FILE}"; \
+    fi && \
+    rm -f /opt/kimi.patch
+USER 1000
+DOCKERFILE
+echo "Building patched image: ${DST_IMAGE}"
+docker build \
+    --build-arg BASE_IMAGE="$SRC_IMAGE" \
+    --build-arg TARGET_FILE="$TARGET_FILE" \
+    -t "$DST_IMAGE" \
+    "$TMPDIR"
+echo "Done. Patched image: ${DST_IMAGE}"