Unverified Commit be70dae0 authored by Karen Chung's avatar Karen Chung Committed by GitHub
Browse files

fix(recipes): remove container patch requirements for kimi k2.5 recipes (#8199)

parent d185c881
......@@ -66,7 +66,7 @@ These recipes are under active development and may require additional setup step
| Model | Framework | Mode | GPUs | Deployment | Notes |
|-------|-----------|------|------|------------|-------|
| **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). |
| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Requires [container patch](kimi-k2.5/trtllm/agg/nvidia/patch/). Vision input not yet functional with the patch. |
| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. |
## Recipe Structure
......
......@@ -9,7 +9,7 @@ There are two model weight variants, each with its own model download and deploy
| Variant | Model | Status | Modality | Deploy Configs | Notes |
|---------|-------|--------|----------|---------------|-------|
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) for `deploy.yaml` and `deploy-kvbm.yaml`, while `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
......@@ -81,22 +81,15 @@ curl http://localhost:8000/v1/chat/completions \
**Status:** Functional | **Modality:** Text only upstream support
> **Experimental for standard and KVBM deployments**: Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch) for the patch script and full instructions.
> **Functional**: [Speculative Decoding recipe](trtllm/agg/nvidia/deploy-specdec.yaml) doesn't need the patch and is optimized for performance.
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`. The standard and KVBM deployments still require the Kimi patched TRT-LLM image, while the speculative decoding deployment in `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment `deploy-specdec.yaml` that uses speculative decoding.
### Quick Start
The nvidia deploy manifests use two image flows:
- `deploy.yaml` and `deploy-kvbm.yaml` use the placeholder patched image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`
- `deploy-specdec.yaml` uses the placeholder top-of-tree image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
The nvidia deploy manifests use the placeholder top-of-tree image: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
Before deploying, update the `image:` fields in the manifest you plan to use.
......@@ -114,12 +107,6 @@ kubectl create secret generic hf-token-secret \
kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Patch the container image (required for nvidia weights)
# Skip this step for Speculative Decoding recipe `deploy-specdec.yaml`
docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag \
-t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched \
trtllm/agg/nvidia/patch/
# Update the image in the deploy manifest to use the container tag (or the patched tag)
# Deploy
......@@ -252,4 +239,3 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The two basic recipes in the nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
# Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
> Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`patch/`](patch/) for the patch script and full instructions.
> **Note**: The two standard deployment (`deploy.yaml` and `deploy-kvbm.yaml`) for nvidia/Kimi-K2.5-NVFP4 model requires a patched TensorRT-LLM container image because upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before deploying either configuration below. See patch/ for the script and instructions. **`deploy-specdec.yaml` speculative decoding recipe doesn't need the image patch**.
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
......@@ -22,7 +18,6 @@ This directory contains three aggregated deployment configurations for the `nvid
- 1x8 B200 GPUs or 8x4 GB200 GPUs
- A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC
- `deploy.yaml` and `deploy-kvbm.yaml` require a patched image tag such as `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
- `deploy-specdec.yaml` uses `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` and works with a current top-of-tree Dynamo TRT-LLM image
---
......@@ -32,7 +27,6 @@ This directory contains three aggregated deployment configurations for the `nvid
Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
```bash
# Update the image in deploy.yaml to your patched image, then:
kubectl apply -f deploy.yaml -n ${NAMESPACE}
```
......@@ -47,7 +41,6 @@ This creates:
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
```bash
# Update the image in deploy-kvbm.yaml to your patched image, then:
kubectl apply -f deploy-kvbm.yaml -n ${NAMESPACE}
```
......@@ -83,7 +76,7 @@ This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worke
## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200 and does not require the patched image used by the standard and KVBM manifests.
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200.
### Speculative Decoding Prerequisites
......
......@@ -55,7 +55,7 @@ spec:
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
......@@ -95,10 +95,7 @@ spec:
command:
- /bin/sh
- -c
# REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
# Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
# Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
......
......@@ -51,7 +51,7 @@ spec:
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
......@@ -84,10 +84,7 @@ spec:
command:
- /bin/sh
- -c
# REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
# Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
# Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
......@@ -111,4 +108,4 @@ spec:
limits:
gpu: "8"
requests:
gpu: "8"
\ No newline at end of file
gpu: "8"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Patches TensorRT-LLM with KimiK25ForConditionalGeneration support.
# Upstream tracking PR: https://github.com/NVIDIA/TensorRT-LLM/pull/11816
#
# Usage:
# docker build --build-arg BASE_IMAGE=<image> -t <image>-patched .
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
USER root
COPY kimi.patch /tmp/kimi.patch
# Apply upstream diff — idempotent, fails if target file has diverged
RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
TARGET="$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" && \
cd "$SITE_PKGS" && \
if patch -p1 --forward --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
patch -p1 --forward --fuzz=0 < /tmp/kimi.patch; \
elif patch -p1 --reverse --fuzz=0 --dry-run < /tmp/kimi.patch > /dev/null 2>&1; then \
echo "Patch already applied, skipping."; \
else \
echo "ERROR: Patch failed — the target file may have changed upstream." >&2; \
echo "Try updating kimi.patch from https://github.com/NVIDIA/TensorRT-LLM/pull/11816" >&2; \
exit 1; \
fi && \
rm -f /tmp/kimi.patch
# Smoke test
RUN SITE_PKGS=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))") && \
grep -q '@register_auto_model("KimiK25ForConditionalGeneration")' \
"$SITE_PKGS/tensorrt_llm/_torch/models/modeling_deepseekv3.py" || \
{ echo "ERROR: KimiK25ForConditionalGeneration not registered after patching" >&2; exit 1; }
USER dynamo
# Kimi K2.5 TensorRT-LLM Patch
Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking PR](https://github.com/NVIDIA/TensorRT-LLM/pull/11816)).
This directory contains a unified diff that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
## Quick start
Build a patched image:
```bash
docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 \
-t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0-patched \
recipes/kimi-k2.5/trtllm/agg/nvidia/patch/
```
The patch is applied via `patch -p1 --fuzz=0`:
- If the target file has changed upstream, the build **fails loudly** instead of silently producing broken code.
- If the patch is already applied, it is skipped (idempotent).
- A smoke test verifies the class is registered before the build completes.
## Files
| File | Description |
|------|-------------|
| `Dockerfile` | Single-stage build that applies the patch to a base Dynamo image |
| `kimi.patch` | Unified diff from [upstream PR #11816](https://github.com/NVIDIA/TensorRT-LLM/pull/11816) — adds `KimiK25ForConditionalGeneration` to `modeling_deepseekv3.py` |
diff --git a/tensorrt_llm/_torch/models/modeling_deepseekv3.py b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
--- a/tensorrt_llm/_torch/models/modeling_deepseekv3.py
+++ b/tensorrt_llm/_torch/models/modeling_deepseekv3.py
@@ -1866,3 +1866,46 @@ def post_load_weights(self):
else:
layer.next_layer_layernorm = self.model.layers[
idx + 1].input_layernorm
+
+
+@register_auto_model("KimiK25ForConditionalGeneration")
+class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
+ """Kimi-K2.5 multimodal model (text-only path).
+
+ Extracts the DeepSeek-V3 text backbone from the composite config
+ and strips the ``language_model.`` weight prefix so that the
+ standard DeepseekV3ForCausalLM loading path works unchanged.
+
+ NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
+ so MTP-based speculative decoding is not applicable to this model.
+ """
+
+ _LANG_PREFIX = "language_model."
+
+ def __init__(self, model_config: ModelConfig[PretrainedConfig]):
+ model_config = copy.copy(model_config)
+ if hasattr(model_config.pretrained_config, 'text_config'):
+ model_config._frozen = False
+ model_config.pretrained_config = model_config.pretrained_config.text_config
+ if model_config.quant_config.exclude_modules:
+ model_config.quant_config = copy.copy(model_config.quant_config)
+ p = self._LANG_PREFIX
+ mapped = []
+ for m in model_config.quant_config.exclude_modules:
+ if m.startswith(p):
+ rest = m[len(p):]
+ if rest.startswith('layers.'):
+ rest = 'model.' + rest
+ mapped.append(rest)
+ else:
+ mapped.append(m)
+ model_config.quant_config.exclude_modules = mapped
+ model_config._frozen = True
+ super().__init__(model_config)
+
+ def load_weights(self, weights: ConsumableWeightsDict):
+ has_prefix = any(k.startswith("language_model.") for k in weights)
+ if has_prefix:
+ weights = filter_weights("language_model", weights)
+ weights = ConsumableWeightsDict(weights)
+ super().load_weights(weights)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment