Unverified Commit 15f978c1 authored by Anant Sharma's avatar Anant Sharma Committed by GitHub
Browse files

fix: use proper unified diff and Dockerfile for kimi patch (#7435)


Signed-off-by: default avatarAnant Sharma <anants@nvidia.com>
parent 9df692c1
...@@ -96,10 +96,10 @@ The nvidia variant supports text inference with reasoning parsing (`--dyn-reason ...@@ -96,10 +96,10 @@ The nvidia variant supports text inference with reasoning parsing (`--dyn-reason
The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`. The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Before deploying, you must: Before deploying, you must:
1. Run the [patch script](trtllm/agg/nvidia/patch/) to build a patched image (appends `-patched` to the tag). 1. Build a patched image via `docker build` with the `trtllm/agg/nvidia/patch/` context and `BASE_IMAGE` build-arg (see command below).
2. Update the `image:` fields in the deploy YAML to reference the patched image. 2. Update the `image:` fields in the deploy YAML to reference the patched image.
See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for full details on what the patch does. See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for details on what the patch does.
```bash ```bash
# Set namespace # Set namespace
...@@ -115,11 +115,10 @@ kubectl create secret generic hf-token-secret \ ...@@ -115,11 +115,10 @@ kubectl create secret generic hf-token-secret \
kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE} kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Patch the container image (required — upstream support not yet available) # Patch the container image (required for nvidia weights)
# This produces: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag \
cd trtllm/agg/nvidia/patch -t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched \
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag trtllm/agg/nvidia/patch/
cd -
# Update the image in the deploy manifest to use the patched tag # Update the image in the deploy manifest to use the patched tag
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Patches TensorRT-LLM with KimiK25ForConditionalGeneration support.
# Upstream tracking PR: https://github.com/NVIDIA/TensorRT-LLM/pull/11816
#
# Usage:
# docker build --build-arg BASE_IMAGE=<image> -t <image>-patched .
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
USER root
COPY kimi.patch /tmp/kimi.patch
# Apply upstream diff — idempotent, fails if target file has diverged
RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
TARGET="$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" && \
cd "$SITE_PKGS" && \
if patch -p1 --forward --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
patch -p1 --forward --fuzz=0 < /tmp/kimi.patch; \
elif patch -p1 --reverse --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
echo "Patch already applied, skipping."; \
else \
echo "ERROR: Patch failed — the target file may have changed upstream." >&2; \
echo "Try updating kimi.patch from https://github.com/NVIDIA/TensorRT-LLM/pull/11816" >&2; \
exit 1; \
fi && \
rm -f /tmp/kimi.patch
# Smoke test
RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
grep -q '@register_auto_model("KimiK25ForConditionalGeneration")' \
"$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" || \
{ echo "ERROR: KimiK25ForConditionalGeneration not registered after patching" >&2; exit 1; }
USER dynamo
# Kimi K2.5 TensorRT-LLM Patch # Kimi K2.5 TensorRT-LLM Patch
Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking branch](https://github.com/NVIDIA/TensorRT-LLM/compare/main...feat/k25-demo)). Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking PR](https://github.com/NVIDIA/TensorRT-LLM/pull/11816)).
This directory contains an append-only patch that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today. This directory contains a unified diff that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
## Quick start ## Quick start
Patch a Dynamo docker image by running: Build a patched image:
```bash ```bash
./patch-container.sh <docker-image> docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 \
-t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0-patched \
recipes/kimi-k2.5/trtllm/agg/nvidia/patch/
``` ```
For example: The patch is applied via `patch -p1 --fuzz=0`:
- If the target file has changed upstream, the build **fails loudly** instead of silently producing broken code.
```bash - If the patch is already applied, it is skipped (idempotent).
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag - A smoke test verifies the class is registered before the build completes.
# produces image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
```
If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
## Files ## Files
| File | Description | | File | Description |
|------|-------------| |------|-------------|
| `patch-container.sh` | Builds a patched docker image from a base Dynamo image | | `Dockerfile` | Single-stage build that applies the patch to a base Dynamo image |
| `kimi.patch` | Appended to `modeling_deepseekv3.py` inside the container -- adds a thin `DeepseekV3ForCausalLM` subclass that extracts the Kimi text backbone config and remaps weight prefixes | | `kimi.patch` | Unified diff from [upstream PR #11816](https://github.com/NVIDIA/TensorRT-LLM/pull/11816) — adds `KimiK25ForConditionalGeneration` to `modeling_deepseekv3.py` |
diff --git a/tensorrt_llm/_torch/models/modeling_deepseekv3.py b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
@register_auto_model("KimiK25ForConditionalGeneration") --- a/tensorrt_llm/_torch/models/modeling_deepseekv3.py
class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM): +++ b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
"""Kimi-K2.5 multimodal model (text-only path). @@ -1866,3 +1866,46 @@ def post_load_weights(self):
else:
Extracts the DeepSeek-V3 text backbone from the composite config layer.next_layer_layernorm = self.model.layers[
and strips the ``language_model.`` weight prefix so that the idx + 1].input_layernorm
standard DeepseekV3ForCausalLM loading path works unchanged. +
+
NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``, +@register_auto_model("KimiK25ForConditionalGeneration")
so MTP-based speculative decoding is not applicable to this model. +class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
""" + """Kimi-K2.5 multimodal model (text-only path).
+
_LANG_PREFIX = "language_model." + Extracts the DeepSeek-V3 text backbone from the composite config
+ and strips the ``language_model.`` weight prefix so that the
def __init__(self, model_config: ModelConfig[PretrainedConfig]): + standard DeepseekV3ForCausalLM loading path works unchanged.
model_config = copy.copy(model_config) +
if hasattr(model_config.pretrained_config, 'text_config'): + NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
model_config._frozen = False + so MTP-based speculative decoding is not applicable to this model.
model_config.pretrained_config = model_config.pretrained_config.text_config + """
if model_config.quant_config.exclude_modules: +
model_config.quant_config = copy.copy(model_config.quant_config) + _LANG_PREFIX = "language_model."
p = self._LANG_PREFIX +
mapped = [] + def __init__(self, model_config: ModelConfig[PretrainedConfig]):
for m in model_config.quant_config.exclude_modules: + model_config = copy.copy(model_config)
if m.startswith(p): + if hasattr(model_config.pretrained_config, 'text_config'):
rest = m[len(p):] + model_config._frozen = False
if rest.startswith('layers.'): + model_config.pretrained_config = model_config.pretrained_config.text_config
rest = 'model.' + rest + if model_config.quant_config.exclude_modules:
mapped.append(rest) + model_config.quant_config = copy.copy(model_config.quant_config)
else: + p = self._LANG_PREFIX
mapped.append(m) + mapped = []
model_config.quant_config.exclude_modules = mapped + for m in model_config.quant_config.exclude_modules:
model_config._frozen = True + if m.startswith(p):
super().__init__(model_config) + rest = m[len(p):]
+ if rest.startswith('layers.'):
def load_weights(self, weights: ConsumableWeightsDict): + rest = 'model.' + rest
has_prefix = any(k.startswith("language_model.") for k in weights) + mapped.append(rest)
if has_prefix: + else:
weights = filter_weights("language_model", weights) + mapped.append(m)
weights = ConsumableWeightsDict(weights) + model_config.quant_config.exclude_modules = mapped
super().load_weights(weights) + model_config._frozen = True
\ No newline at end of file + super().__init__(model_config)
+
+ def load_weights(self, weights: ConsumableWeightsDict):
+ has_prefix = any(k.startswith("language_model.") for k in weights)
+ if has_prefix:
+ weights = filter_weights("language_model", weights)
+ weights = ConsumableWeightsDict(weights)
+ super().load_weights(weights)
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -euo pipefail
if [[ $# -ne 1 ]]; then
echo "Usage: $0 <docker-image>"
echo " Patches modeling_deepseekv3.py with KimiK25ForConditionalGeneration class."
echo " Outputs: <docker-image>-patched"
exit 1
fi
SRC_IMAGE="$1"
DST_IMAGE="${SRC_IMAGE}-patched"
TARGET_FILE="/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PATCH_FILE="${SCRIPT_DIR}/kimi.patch"
if [[ ! -f "$PATCH_FILE" ]]; then
echo "ERROR: Patch file not found: $PATCH_FILE"
exit 1
fi
TMPDIR="$(mktemp -d)"
trap 'rm -rf "$TMPDIR"' EXIT
cp "$PATCH_FILE" "$TMPDIR/kimi.patch"
cat > "$TMPDIR/Dockerfile" <<'DOCKERFILE'
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
ARG TARGET_FILE
USER root
COPY kimi.patch /opt/kimi.patch
RUN if grep -q 'KimiK25ForConditionalGeneration' "${TARGET_FILE}"; then \
echo "Patch already applied, skipping."; \
else \
if ! head -50 "${TARGET_FILE}" | grep -q '^import copy'; then \
sed -i '1s/^/import copy\n/' "${TARGET_FILE}"; \
fi && \
echo "" >> "${TARGET_FILE}" && \
cat /opt/kimi.patch >> "${TARGET_FILE}"; \
fi && \
rm -f /opt/kimi.patch
USER 1000
DOCKERFILE
echo "Building patched image: ${DST_IMAGE}"
docker build \
--build-arg BASE_IMAGE="$SRC_IMAGE" \
--build-arg TARGET_FILE="$TARGET_FILE" \
-t "$DST_IMAGE" \
"$TMPDIR"
echo "Done. Patched image: ${DST_IMAGE}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment