docs: add experimental recipes details for Kimi-K2.5 recipe (#7412)

b22a9d76 · Biswa Panda · GitHub · 183100b1 · b22a9d76 · b22a9d76
Unverified Commit b22a9d76 authored Mar 16, 2026 by Biswa Panda Committed by GitHub Mar 16, 2026
8 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -42,7 +42,6 @@ These recipes demonstrate aggregated or disaggregated serving:
 | **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
 | **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
 | **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
-| **[Kimi-K2.5](kimi-k2.5/)** 🚧 | TensorRT-LLM | Aggregated | 8x B200 | ✅ | ❌ | Experimental — MoE model, TP8×EP8, reasoning + tool calling | ❌ |
 **Legend:**
 - **Deployment**: ✅ = Complete `deploy.yaml` manifest available
@@ -58,6 +57,15 @@ These recipes demonstrate functional deployments with Dynamo features, but have
 | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
 | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
 | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
+| **[Kimi-K2.5 (Baseten)](kimi-k2.5/trtllm/agg/baseten/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling |
+### Experimental Recipes
+These recipes are under active development and may require additional setup steps (e.g., container patching). They are functional but not yet fully validated for production use.
+| Model | Framework | Mode | GPUs | Deployment | Notes |
+|-------|-----------|------|------|------------|-------|
+| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Requires [container patch](kimi-k2.5/trtllm/agg/nvidia/patch/). Vision input not yet functional with the patch. |
 ## Recipe Structure

--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
 # Kimi-K2.5 Recipes
-> 🚧 **Work-in-Progress — Experimental Recipe**
+Deployment recipes for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
->
-> The TensorRT-LLM Python package used for Dynamo's TRT-LLM integration does not yet include
-> native Kimi K2.5 support. This recipe is an **experimental** effort to bring
-> Kimi K2.5 to Dynamo ahead of upstream availability. It needs to patch the container image on top of released dynamo image.
-Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
 ## Available Configurations
 There are two model weight variants, each with its own model download and deploy manifests:
-| Variant | Model | Deploy Configs | Notes |
+| Variant | Model | Status | Modality | Deploy Configs | Notes |
-|---------|-------|---------------|-------|
+|---------|-------|--------|----------|---------------|-------|
-| **nvidia** 🚧 | `nvidia/Kimi-K2.5-NVFP4` | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) |
+| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
-| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image |
+| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/). Vision input is not yet functional — the patch loads the text backbone only. |
 All configurations use TP8, EP8, aggregated mode with KV-aware routing.
-The **nvidia** variant also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
 ## Prerequisites
 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
 2. **GPU cluster** with B200 GPUs (8x per worker)
 3. **HuggingFace token** with access to the model
-## Quick Start (nvidia variant)
+## Hardware Requirements
+| Configuration | GPUs |
+|--------------|------|
+| Aggregated | 8x B200 |
+---
+## baseten-admin/Kimi-2.5-text-nvfp4-v3
+**Status:** Functional (not yet performance-optimized) | **Modality:** Text only
+The baseten variant uses a text-only backend built on the underlying DeepSeek-V3 architecture, which means it works out of the box with the stock TensorRT-LLM container image -- no patching or custom builds required. This recipe is functional for text-based inference with reasoning and tool calling, but has not yet been performance-tuned or benchmarked.
+### Quick Start
+The baseten deploy manifest ships with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
+Update the `image:` fields in [`trtllm/agg/baseten/deploy.yaml`](trtllm/agg/baseten/deploy.yaml) to your actual Dynamo release tag before deploying.
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+# Download model (update storageClassName in model-cache/model-cache.yaml first!)
+kubectl apply -f model-cache/baseten/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+# Update the image tag in trtllm/agg/baseten/deploy.yaml to your Dynamo release tag
+# Deploy
+kubectl apply -f trtllm/agg/baseten/deploy.yaml -n ${NAMESPACE}
+```
+### Test the Deployment
+```bash
+# Port-forward the frontend
+kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+```
+---
+## nvidia/Kimi-K2.5-NVFP4
+**Status:** Experimental | **Modality:** Text only upstream support
+> **Experimental:** Upstream TensorRT-LLM does not yet include native support for Kimi K2.5.
+> This recipe works around that limitation by directly patching the container image with an
+> append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path.
+> See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for the patch script and full instructions.
+> **Text only:** The patch loads the DeepSeek-V3 text backbone from the Kimi K2.5 config
+> (`text_config`). The vision encoder is not loaded, so image inputs are not processed.
+> Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
+The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
+### Quick Start
+The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
+Before deploying, you must:
+1. Run the [patch script](trtllm/agg/nvidia/patch/) to build a patched image (appends `-patched` to the tag).
+2. Update the `image:` fields in the deploy YAML to reference the patched image.
+See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for full details on what the patch does.
 ```bash
 # Set namespace
@@ -43,18 +115,19 @@ kubectl create secret generic hf-token-secret \
 kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
-# Patch the container image (required for nvidia weights)
+# Patch the container image (required — upstream support not yet available)
+# This produces: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
 cd trtllm/agg/nvidia/patch
 ./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
 cd -
+# Update the image in the deploy manifest to use the patched tag
 # Deploy
 kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
 ```
-For baseten weights, use `model-cache/baseten/` and `trtllm/agg/baseten/deploy.yaml` instead — no image patch needed.
+### Test the Deployment
-## Test the Deployment
 ```bash
 # Port-forward the frontend
@@ -70,19 +143,14 @@ curl http://localhost:8000/v1/chat/completions \
  }'
 ```
+---
 ## Model Details
- **Model**: `nvidia/Kimi-K2.5-NVFP4` (NV FP4 quantized, text-only)
 - **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
 - **Backend**: TensorRT-LLM (PyTorch backend)
 - **Parallelism**: TP8, EP8 (Expert Parallel)
+- **Quantization**: NV FP4
-## Hardware Requirements
-| Configuration | GPUs |
-|--------------|------|
-| Aggregated | 8x B200 |
 ## Verifying Reasoning
@@ -185,4 +253,4 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
 ## Notes
 - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream
+- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
--- a/recipes/kimi-k2.5/trtllm/agg/baseten/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/baseten/deploy.yaml
@@ -51,7 +51,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
      replicas: 1
    TrtllmWorker:
      componentType: worker
@@ -84,7 +84,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"

--- a/recipes/kimi-k2.5/trtllm/agg/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/deploy.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: llm-config
-data:
-  config.yaml: |
-    max_batch_size: 128
-    max_num_tokens: 8448
-    max_seq_len: 8212
-    tensor_parallel_size: 8
-    moe_expert_parallel_size: 8
-    enable_attention_dp: true
-    pipeline_parallel_size: 1
-    print_iter_log: true
-    kv_cache_config:
-      free_gpu_memory_fraction: 0.75
-      dtype: fp8
-    cache_transceiver_config:
-      backend: UCX
-      max_tokens_in_buffer: 8448
-    trust_remote_code: true
---
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: kimi-k25-agg
-spec:
-  backendFramework: trtllm
-  pvcs:
-    - name: model-cache
-      create: false
-  services:
-    Frontend:
-      componentType: frontend
-      extraPodSpec:
-        affinity:
-          podAntiAffinity:
-            requiredDuringSchedulingIgnoredDuringExecution:
-            - labelSelector:
-                matchExpressions:
-                - key: nvidia.com/dynamo-graph-deployment-name
-                  operator: In
-                  values:
-                  - kimi-k25-agg-frontend
-              topologyKey: kubernetes.io/hostname
-        mainContainer:
-          args:
-          - python3 -m dynamo.frontend --router-mode kv --http-port 8000
-          command:
-          - /bin/sh
-          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
-      replicas: 1
-    TrtllmWorker:
-      componentType: worker
-      envFromSecret: hf-token-secret
-      volumeMounts:
-        - name: model-cache
-          mountPoint: /opt/models
-      sharedMemory:
-        size: 80Gi
-      extraPodSpec:
-        affinity:
-          nodeAffinity:
-            requiredDuringSchedulingIgnoredDuringExecution:
-              nodeSelectorTerms:
-              - matchExpressions:
-                - key: nvidia.com/gpu.present
-                  operator: In
-                  values:
-                  - "true"
-        mainContainer:
-          args:
-          - |
-            python3 -m dynamo.trtllm \
-              --model-path "${MODEL_NAME}" \
-              --served-model-name "${MODEL_NAME}" \
-              --extra-engine-args "${ENGINE_ARGS}" \
-              --tensor-parallel-size 8 \
-              --dyn-reasoning-parser kimi_k25 \
-              --dyn-tool-call-parser kimi_k2
-          command:
-          - /bin/sh
-          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
-          env:
-          - name: TRTLLM_ENABLE_PDL
-            value: "1"
-          - name: MODEL_NAME
-            value: baseten-admin/Kimi-2.5-text-nvfp4-v3
-          - name: ENGINE_ARGS
-            value: /opt/dynamo/configs/config.yaml
-          - name: HF_HOME
-            value: /opt/models
-          volumeMounts:
-          - mountPath: /opt/dynamo/configs
-            name: llm-config
-            readOnly: true
-          workingDir: /workspace/examples/backends/trtllm
-        volumes:
-        - configMap:
-            name: llm-config
-          name: llm-config
-      replicas: 1
-      resources:
-        limits:
-          gpu: "8"
-        requests:
-          gpu: "8"
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
-# Kimi-K2.5 Aggregated Deployment with KVBM on Kubernetes
+# Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
+> **Note:** The `nvidia/Kimi-K2.5-NVFP4` model requires a patched TensorRT-LLM container image because
+> upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before
+> deploying either configuration below. See [`patch/`](patch/) for the script and instructions.
+> **Text only:** The patch registers `KimiK25ForConditionalGeneration` by loading the DeepSeek-V3
+> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
+> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
+This directory contains two aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model:
+| Deployment | Manifest | Description |
+|-----------|----------|-------------|
+| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing |
+| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) |
 ## Prerequisites
 - A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- 8× GPU nodes (e.g. H100/H200)
+- 8x B200 GPUs
 - A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC
+- A pre-existing `model-cache` PVC with the downloaded model
- Replace the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in `deploy-kvbm.yaml` with your actual image
+- A **patched container image** -- the deploy manifests ship with a placeholder `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
-## Deploy
+---
+## Standard Aggregated Deployment
+Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
 ```bash
-kubectl apply -f deploy-kvbm.yaml
+# Update the image in deploy.yaml to your patched image, then:
+kubectl apply -f deploy.yaml -n ${NAMESPACE}
+```
+This creates:
+- A **ConfigMap** (`llm-config`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache).
+- A **DynamoGraphDeployment** (`kimi-k25-agg`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
+---
+## Aggregated Deployment with KVBM
+Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
+```bash
+# Update the image in deploy-kvbm.yaml to your patched image, then:
+kubectl apply -f deploy-kvbm.yaml -n ${NAMESPACE}
 ```
 This creates:
 - A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
 - A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
+### KVBM Configuration
 Key environment variables on the worker:
 | Variable | Default | Description |
@@ -26,7 +63,7 @@ Key environment variables on the worker:
 | `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint |
 | `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint |
-## Enable Prometheus Metrics Scraping
+### Enable Prometheus Metrics Scraping
 If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor:

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
@@ -55,7 +55,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
      replicas: 1
    TrtllmWorker:
      componentType: worker
@@ -95,7 +95,10 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          # REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
+          # Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
+          # Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy.yaml
@@ -51,7 +51,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
      replicas: 1
    TrtllmWorker:
      componentType: worker
@@ -84,7 +84,10 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          # REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
+          # Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
+          # Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/patch/README.md
@@ -16,7 +16,7 @@ For example:
 ```bash
 ./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-# produces image:    nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0-patched
+# produces image:    nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
 ```
 If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.