docs: add performance-optimized recipe detail for Kimi-K2.5 with Speculative...

docs: add performance-optimized recipe detail for Kimi-K2.5 with Speculative Decoding recipe (#7555) Signed-off-by: Yunzhou Liu <232973175+yunzhoul-nv@users.noreply.github.com>

docs: add performance-optimized recipe detail for Kimi-K2.5 with Speculative...
docs: add performance-optimized recipe detail for Kimi-K2.5 with Speculative Decoding recipe (#7555) Signed-off-by: Yunzhou Liu <232973175+yunzhoul-nv@users.noreply.github.com>
5631db75 · yunzhoul-nv · GitHub · cbdab502 · 5631db75 · 5631db75
Unverified Commit 5631db75 authored Mar 23, 2026 by yunzhoul-nv Committed by GitHub Mar 23, 2026
4 changed files
--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
@@ -9,14 +9,14 @@ There are two model weight variants, each with its own model download and deploy
 | Variant | Model | Status | Modality | Deploy Configs | Notes |
 |---------|-------|--------|----------|---------------|-------|
 | **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
-| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/). Vision input is not yet functional — the patch loads the text backbone only. |
+| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) for `deploy.yaml` and `deploy-kvbm.yaml`, while `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |

 All configurations use TP8, EP8, aggregated mode with KV-aware routing.

 ## Prerequisites

 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
-2. **GPU cluster** with B200 GPUs (8x per worker)
+2. **GPU cluster** with B200 GPUs (8x per worker) or GB200 GPUs (4 workers, 2x4 per worker)
 3. **HuggingFace token** with access to the model

 ## Hardware Requirements
@@ -24,6 +24,7 @@ All configurations use TP8, EP8, aggregated mode with KV-aware routing.
 | Configuration | GPUs |
 |--------------|------|
 | Aggregated | 8x B200 |
+| Aggregated Speculative Decoding | 8x4 GB200 (4 workers, each worker spanning 2 nodes) |

 ---

@@ -78,28 +79,26 @@ curl http://localhost:8000/v1/chat/completions \

 ## nvidia/Kimi-K2.5-NVFP4

-**Status:** Experimental | **Modality:** Text only upstream support
+**Status:** Functional | **Modality:** Text only upstream support

-> **Experimental:** Upstream TensorRT-LLM does not yet include native support for Kimi K2.5.
-> This recipe works around that limitation by directly patching the container image with an
-> append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path.
-> See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for the patch script and full instructions.
+> **Experimental for standard and KVBM deployments**: Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch) for the patch script and full instructions.

-> **Text only:** The patch loads the DeepSeek-V3 text backbone from the Kimi K2.5 config
-> (`text_config`). The vision encoder is not loaded, so image inputs are not processed.
-> Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
+> **Functional**: [Speculative Decoding recipe](trtllm/agg/nvidia/deploy-specdec.yaml) doesn't need the patch and is optimized for performance.

-The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
+> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
+> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
+> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
+
+The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`. The standard and KVBM deployments still require the Kimi patched TRT-LLM image, while the speculative decoding deployment in `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image.

 ### Quick Start

-The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
-Before deploying, you must:
+The nvidia deploy manifests use two image flows:

-1. Build a patched image via `docker build` with the `trtllm/agg/nvidia/patch/` context and `BASE_IMAGE` build-arg (see command below).
-2. Update the `image:` fields in the deploy YAML to reference the patched image.
+- `deploy.yaml` and `deploy-kvbm.yaml` use the placeholder patched image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`
+- `deploy-specdec.yaml` uses the placeholder top-of-tree image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`

-See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for details on what the patch does.
+Before deploying, update the `image:` fields in the manifest you plan to use.

 ```bash
 # Set namespace
@@ -116,11 +115,12 @@ kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

 # Patch the container image (required for nvidia weights)
+# Skip this step for Speculative Decoding recipe `deploy-specdec.yaml`
 docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag \
  -t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched \
  trtllm/agg/nvidia/patch/

-# Update the image in the deploy manifest to use the patched tag
+# Update the image in the deploy manifest to use the container tag (or the patched tag)

 # Deploy
 kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
@@ -252,4 +252,4 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
 ## Notes

 - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
+- The two basic recipes in the nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
--- a/recipes/kimi-k2.5/model-cache/nvidia/eagle-download.yaml
+++ b/recipes/kimi-k2.5/model-cache/nvidia/eagle-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: eagle-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: eagle-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: eagle-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: nvidia/Kimi-K2.5-Thinking-Eagle3
+            - name: MODEL_REVISION
+              value: 0b0c6ac039089ad2c2418c91c039553381a302d9
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download "$MODEL_NAME" --revision "$MODEL_REVISION"
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+      volumes:
+        - name: model-cache
+          persistentVolumeClaim:
+            claimName: model-cache
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
 # Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes

-> **Note:** The `nvidia/Kimi-K2.5-NVFP4` model requires a patched TensorRT-LLM container image because
-> upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before
-> deploying either configuration below. See [`patch/`](patch/) for the script and instructions.
+> Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`patch/`](patch/) for the patch script and full instructions.

-> **Text only:** The patch registers `KimiK25ForConditionalGeneration` by loading the DeepSeek-V3
+> **Note**: The two standard deployment (`deploy.yaml` and `deploy-kvbm.yaml`) for nvidia/Kimi-K2.5-NVFP4 model requires a patched TensorRT-LLM container image because upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before deploying either configuration below. See patch/ for the script and instructions. **`deploy-specdec.yaml` speculative decoding recipe doesn't need the image patch**.
+
+> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
 > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
 > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.

-This directory contains two aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model:
+This directory contains three aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model.

-| Deployment | Manifest | Description |
-|-----------|----------|-------------|
-| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing |
-| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) |
+| Deployment | Manifest | Description | Hardware Requirement
+|-----------|----------|-------------|----|
+| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing | 1x8 B200 node |
+| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) | 1x8 B200 node |
+| **Aggregated + EAGLE SpecDec** | [`deploy-specdec.yaml`](deploy-specdec.yaml) | Performant aggregated deployment with EAGLE speculative decoding and KV-aware routing | 8x4 GB200 nodes |

 ## Prerequisites

 - A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- 8x B200 GPUs
+- 1x8 B200 GPUs or 8x4 GB200 GPUs
 - A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC with the downloaded model
- A **patched container image** -- the deploy manifests ship with a placeholder `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
+- A pre-existing `model-cache` PVC
+- `deploy.yaml` and `deploy-kvbm.yaml` require a patched image tag such as `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
+- `deploy-specdec.yaml` uses `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` and works with a current top-of-tree Dynamo TRT-LLM image

 ---

@@ -76,3 +78,63 @@ This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worke
 - `nvidia.com/metrics-enabled: "true"`

 > **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
+
+---
+
+## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
+
+Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200 and does not require the patched image used by the standard and KVBM manifests.
+
+### Speculative Decoding Prerequisites
+
+- 8 GB200 nodes, each having 4 GPUs per node
+- Update the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in [`deploy-specdec.yaml`](deploy-specdec.yaml) before deploying.
+
+### Additional Model Assets
+
+This deployment needs both the base Kimi weights and the Eagle draft model on the `model-cache` PVC.
+
+Download the base model:
+
+```bash
+kubectl apply -f ../../../model-cache/model-cache.yaml -n ${NAMESPACE}
+kubectl apply -f ../../../model-cache/nvidia/model-download.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
+```
+
+Download the Eagle draft model:
+
+```bash
+kubectl apply -f ../../../model-cache/nvidia/eagle-download.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/eagle-download -n ${NAMESPACE} --timeout=6000s
+```
+
+The worker config loads the draft model from:
+
+```yaml
+speculative_config:
+  decoding_type: Eagle
+  max_draft_len: 3
+  speculative_model_dir: /opt/models/hub/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/0b0c6ac039089ad2c2418c91c039553381a302d9
+```
+
+### Speculative Decoding Deployment Topology
+
+The manifest runs one aggregated frontend and four aggregated worker replicas. Each worker spans two nodes:
+
+- `multinode.nodeCount: 2`
+- `resources.limits.gpu: "4"` per node
+- `tensor_parallel_size: 8`
+- `moe_expert_parallel_size: 8`
+
+This is an 8-node deployment in total for the workers.
+
+### Deployment
+
+```bash
+kubectl apply -f deploy-specdec.yaml -n ${NAMESPACE}
+```
+
+This creates:
+- A **ConfigMap** (`llm-config-specdec`) with the TRT-LLM speculative decoding config
+- A **DynamoGraphDeployment** (`kimi-k25-agg-specdec`) with a KV-aware router frontend and four multinode TRT-LLM worker replicas serving `nvidia/Kimi-K2.5-NVFP4`
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-specdec.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-specdec.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-config-specdec
+data:
+  config.yaml: |
+    backend: pytorch
+    trust_remote_code: true
+    allreduce_strategy: MNNVL
+    max_num_tokens: 148000
+    enable_chunked_prefill: false
+    enable_attention_dp: false
+    max_batch_size: 16
+    cuda_graph_config:
+      batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      enable_padding: true
+    kv_cache_config:
+      dtype: fp8
+      enable_block_reuse: true
+      free_gpu_memory_fraction: 0.75
+    moe_config:
+      backend: TRTLLM
+      max_num_tokens: 2048
+    moe_expert_parallel_size: 8
+    num_postprocess_workers: 8
+    print_iter_log: true
+    stream_interval: 10
+    tensor_parallel_size: 8
+    speculative_config:
+      decoding_type: Eagle
+      max_draft_len: 3
+      speculative_model_dir: /opt/models/hub/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/0b0c6ac039089ad2c2418c91c039553381a302d9
+      allow_advanced_sampling: true
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: kimi-k25-agg-specdec
+spec:
+  backendFramework: trtllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode kv --router-reset-states --http-port 8000 --discovery-backend kubernetes --request-plane nats
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+    TrtllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      replicas: 4
+      multinode:
+        nodeCount: 2
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 179Gi
+      resources:
+        limits:
+          gpu: "4"
+        requests:
+          gpu: "4"
+      extraPodSpec:
+        mainContainer:
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - |
+              python3 -m dynamo.trtllm \
+                --model-path "${MODEL_NAME}" \
+                --served-model-name "${SERVED_MODEL_NAME}" \
+                --extra-engine-args "${ENGINE_ARGS}" \
+                --discovery-backend kubernetes \
+                --request-plane nats \
+                --publish-events-and-metrics \
+                --kv-block-size 32 \
+                --dyn-tool-call-parser kimi_k2 \
+                --dyn-reasoning-parser kimi_k25
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          env:
+            - name: MODEL_NAME
+              value: nvidia/Kimi-K2.5-NVFP4
+            - name: SERVED_MODEL_NAME
+              value: nvidia/Kimi-K2.5-NVFP4
+            - name: ENGINE_ARGS
+              value: /opt/dynamo/configs/config.yaml
+            - name: HF_HOME
+              value: /opt/models
+            - name: NCCL_DEBUG
+              value: INFO
+            - name: TRTLLM_ENABLE_PDL
+              value: "1"
+            - name: ENABLE_CONFIGURABLE_MOE
+              value: "1"
+            - name: TLLM_LOG_LEVEL
+              value: INFO
+            - name: TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS
+              value: "2"
+          volumeMounts:
+            - name: llm-config-specdec
+              mountPath: /opt/dynamo/configs
+              readOnly: true
+          workingDir: /workspace/
+        volumes:
+          - name: llm-config-specdec
+            configMap:
+              name: llm-config-specdec