Unverified Commit 5631db75 authored by yunzhoul-nv's avatar yunzhoul-nv Committed by GitHub
Browse files

docs: add performance-optimized recipe detail for Kimi-K2.5 with Speculative...


docs: add performance-optimized recipe detail for Kimi-K2.5 with Speculative Decoding recipe (#7555)
Signed-off-by: default avatarYunzhou Liu <232973175+yunzhoul-nv@users.noreply.github.com>
parent cbdab502
......@@ -9,14 +9,14 @@ There are two model weight variants, each with its own model download and deploy
| Variant | Model | Status | Modality | Deploy Configs | Notes |
|---------|-------|--------|----------|---------------|-------|
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/). Vision input is not yet functional — the patch loads the text backbone only. |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) for `deploy.yaml` and `deploy-kvbm.yaml`, while `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with B200 GPUs (8x per worker)
2. **GPU cluster** with B200 GPUs (8x per worker) or GB200 GPUs (4 workers, 2x4 per worker)
3. **HuggingFace token** with access to the model
## Hardware Requirements
......@@ -24,6 +24,7 @@ All configurations use TP8, EP8, aggregated mode with KV-aware routing.
| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |
| Aggregated Speculative Decoding | 8x4 GB200 (4 workers, each worker spanning 2 nodes) |
---
......@@ -78,28 +79,26 @@ curl http://localhost:8000/v1/chat/completions \
## nvidia/Kimi-K2.5-NVFP4
**Status:** Experimental | **Modality:** Text only upstream support
**Status:** Functional | **Modality:** Text only upstream support
> **Experimental:** Upstream TensorRT-LLM does not yet include native support for Kimi K2.5.
> This recipe works around that limitation by directly patching the container image with an
> append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path.
> See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for the patch script and full instructions.
> **Experimental for standard and KVBM deployments**: Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch) for the patch script and full instructions.
> **Text only:** The patch loads the DeepSeek-V3 text backbone from the Kimi K2.5 config
> (`text_config`). The vision encoder is not loaded, so image inputs are not processed.
> Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
> **Functional**: [Speculative Decoding recipe](trtllm/agg/nvidia/deploy-specdec.yaml) doesn't need the patch and is optimized for performance.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`. The standard and KVBM deployments still require the Kimi patched TRT-LLM image, while the speculative decoding deployment in `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image.
### Quick Start
The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Before deploying, you must:
The nvidia deploy manifests use two image flows:
1. Build a patched image via `docker build` with the `trtllm/agg/nvidia/patch/` context and `BASE_IMAGE` build-arg (see command below).
2. Update the `image:` fields in the deploy YAML to reference the patched image.
- `deploy.yaml` and `deploy-kvbm.yaml` use the placeholder patched image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`
- `deploy-specdec.yaml` uses the placeholder top-of-tree image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for details on what the patch does.
Before deploying, update the `image:` fields in the manifest you plan to use.
```bash
# Set namespace
......@@ -116,11 +115,12 @@ kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Patch the container image (required for nvidia weights)
# Skip this step for Speculative Decoding recipe `deploy-specdec.yaml`
docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag \
-t nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched \
trtllm/agg/nvidia/patch/
# Update the image in the deploy manifest to use the patched tag
# Update the image in the deploy manifest to use the container tag (or the patched tag)
# Deploy
kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
......@@ -252,4 +252,4 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
- The two basic recipes in the nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: eagle-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: eagle-download
spec:
restartPolicy: Never
containers:
- name: eagle-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: nvidia/Kimi-K2.5-Thinking-Eagle3
- name: MODEL_REVISION
value: 0b0c6ac039089ad2c2418c91c039553381a302d9
- name: HF_HOME
value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download "$MODEL_NAME" --revision "$MODEL_REVISION"
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
> **Note:** The `nvidia/Kimi-K2.5-NVFP4` model requires a patched TensorRT-LLM container image because
> upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before
> deploying either configuration below. See [`patch/`](patch/) for the script and instructions.
> Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`patch/`](patch/) for the patch script and full instructions.
> **Text only:** The patch registers `KimiK25ForConditionalGeneration` by loading the DeepSeek-V3
> **Note**: The two standard deployment (`deploy.yaml` and `deploy-kvbm.yaml`) for nvidia/Kimi-K2.5-NVFP4 model requires a patched TensorRT-LLM container image because upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before deploying either configuration below. See patch/ for the script and instructions. **`deploy-specdec.yaml` speculative decoding recipe doesn't need the image patch**.
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
This directory contains two aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model:
This directory contains three aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model.
| Deployment | Manifest | Description |
|-----------|----------|-------------|
| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing |
| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) |
| Deployment | Manifest | Description | Hardware Requirement
|-----------|----------|-------------|----|
| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing | 1x8 B200 node |
| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) | 1x8 B200 node |
| **Aggregated + EAGLE SpecDec** | [`deploy-specdec.yaml`](deploy-specdec.yaml) | Performant aggregated deployment with EAGLE speculative decoding and KV-aware routing | 8x4 GB200 nodes |
## Prerequisites
- A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- 8x B200 GPUs
- 1x8 B200 GPUs or 8x4 GB200 GPUs
- A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC with the downloaded model
- A **patched container image** -- the deploy manifests ship with a placeholder `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
- A pre-existing `model-cache` PVC
- `deploy.yaml` and `deploy-kvbm.yaml` require a patched image tag such as `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
- `deploy-specdec.yaml` uses `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` and works with a current top-of-tree Dynamo TRT-LLM image
---
......@@ -76,3 +78,63 @@ This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worke
- `nvidia.com/metrics-enabled: "true"`
> **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
---
## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200 and does not require the patched image used by the standard and KVBM manifests.
### Speculative Decoding Prerequisites
- 8 GB200 nodes, each having 4 GPUs per node
- Update the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in [`deploy-specdec.yaml`](deploy-specdec.yaml) before deploying.
### Additional Model Assets
This deployment needs both the base Kimi weights and the Eagle draft model on the `model-cache` PVC.
Download the base model:
```bash
kubectl apply -f ../../../model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f ../../../model-cache/nvidia/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
```
Download the Eagle draft model:
```bash
kubectl apply -f ../../../model-cache/nvidia/eagle-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/eagle-download -n ${NAMESPACE} --timeout=6000s
```
The worker config loads the draft model from:
```yaml
speculative_config:
decoding_type: Eagle
max_draft_len: 3
speculative_model_dir: /opt/models/hub/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/0b0c6ac039089ad2c2418c91c039553381a302d9
```
### Speculative Decoding Deployment Topology
The manifest runs one aggregated frontend and four aggregated worker replicas. Each worker spans two nodes:
- `multinode.nodeCount: 2`
- `resources.limits.gpu: "4"` per node
- `tensor_parallel_size: 8`
- `moe_expert_parallel_size: 8`
This is an 8-node deployment in total for the workers.
### Deployment
```bash
kubectl apply -f deploy-specdec.yaml -n ${NAMESPACE}
```
This creates:
- A **ConfigMap** (`llm-config-specdec`) with the TRT-LLM speculative decoding config
- A **DynamoGraphDeployment** (`kimi-k25-agg-specdec`) with a KV-aware router frontend and four multinode TRT-LLM worker replicas serving `nvidia/Kimi-K2.5-NVFP4`
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config-specdec
data:
config.yaml: |
backend: pytorch
trust_remote_code: true
allreduce_strategy: MNNVL
max_num_tokens: 148000
enable_chunked_prefill: false
enable_attention_dp: false
max_batch_size: 16
cuda_graph_config:
batch_sizes:
- 1
- 2
- 4
- 8
- 16
enable_padding: true
kv_cache_config:
dtype: fp8
enable_block_reuse: true
free_gpu_memory_fraction: 0.75
moe_config:
backend: TRTLLM
max_num_tokens: 2048
moe_expert_parallel_size: 8
num_postprocess_workers: 8
print_iter_log: true
stream_interval: 10
tensor_parallel_size: 8
speculative_config:
decoding_type: Eagle
max_draft_len: 3
speculative_model_dir: /opt/models/hub/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/0b0c6ac039089ad2c2418c91c039553381a302d9
allow_advanced_sampling: true
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg-specdec
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode kv --router-reset-states --http-port 8000 --discovery-backend kubernetes --request-plane nats
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
replicas: 4
multinode:
nodeCount: 2
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 179Gi
resources:
limits:
gpu: "4"
requests:
gpu: "4"
extraPodSpec:
mainContainer:
command:
- /bin/sh
- -c
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${SERVED_MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--discovery-backend kubernetes \
--request-plane nats \
--publish-events-and-metrics \
--kv-block-size 32 \
--dyn-tool-call-parser kimi_k2 \
--dyn-reasoning-parser kimi_k25
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: MODEL_NAME
value: nvidia/Kimi-K2.5-NVFP4
- name: SERVED_MODEL_NAME
value: nvidia/Kimi-K2.5-NVFP4
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
- name: NCCL_DEBUG
value: INFO
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: ENABLE_CONFIGURABLE_MOE
value: "1"
- name: TLLM_LOG_LEVEL
value: INFO
- name: TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS
value: "2"
volumeMounts:
- name: llm-config-specdec
mountPath: /opt/dynamo/configs
readOnly: true
workingDir: /workspace/
volumes:
- name: llm-config-specdec
configMap:
name: llm-config-specdec
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment