| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) for `deploy.yaml` and `deploy-kvbm.yaml`, while `deploy-specdec.yaml` works with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
**Status:** Functional | **Modality:** Text only upstream support
**Status:** Functional | **Modality:** Text only upstream support
> **Experimental for standard and KVBM deployments**: Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch) for the patch script and full instructions.
> **Functional**: [Speculative Decoding recipe](trtllm/agg/nvidia/deploy-specdec.yaml) doesn't need the patch and is optimized for performance.
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`. The standard and KVBM deployments still require the Kimi patched TRT-LLM image, while the speculative decoding deployment in `deploy-specdec.yaml`works with a current top-of-tree Dynamo TRT-LLM image.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment `deploy-specdec.yaml`that uses speculative decoding.
### Quick Start
### Quick Start
The nvidia deploy manifests use two image flows:
The nvidia deploy manifests use the placeholder top-of-tree image: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
-`deploy.yaml` and `deploy-kvbm.yaml` use the placeholder patched image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`
-`deploy-specdec.yaml` uses the placeholder top-of-tree image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
Before deploying, update the `image:` fields in the manifest you plan to use.
Before deploying, update the `image:` fields in the manifest you plan to use.
# Update the image in the deploy manifest to use the container tag (or the patched tag)
# Update the image in the deploy manifest to use the container tag (or the patched tag)
# Deploy
# Deploy
...
@@ -252,4 +239,3 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
...
@@ -252,4 +239,3 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
## Notes
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The two basic recipes in the nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
# Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
# Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
> Upstream TensorRT-LLM does not yet include native support for Kimi K2.5. This recipe works around that limitation by directly patching the container image with an append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path. See [`patch/`](patch/) for the patch script and full instructions.
> **Note**: The two standard deployment (`deploy.yaml` and `deploy-kvbm.yaml`) for nvidia/Kimi-K2.5-NVFP4 model requires a patched TensorRT-LLM container image because upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before deploying either configuration below. See patch/ for the script and instructions. **`deploy-specdec.yaml` speculative decoding recipe doesn't need the image patch**.
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
...
@@ -22,7 +18,6 @@ This directory contains three aggregated deployment configurations for the `nvid
...
@@ -22,7 +18,6 @@ This directory contains three aggregated deployment configurations for the `nvid
- 1x8 B200 GPUs or 8x4 GB200 GPUs
- 1x8 B200 GPUs or 8x4 GB200 GPUs
- A `hf-token-secret` Secret containing your Hugging Face token
- A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC
- A pre-existing `model-cache` PVC
-`deploy.yaml` and `deploy-kvbm.yaml` require a patched image tag such as `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
-`deploy-specdec.yaml` uses `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` and works with a current top-of-tree Dynamo TRT-LLM image
-`deploy-specdec.yaml` uses `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` and works with a current top-of-tree Dynamo TRT-LLM image
---
---
...
@@ -32,7 +27,6 @@ This directory contains three aggregated deployment configurations for the `nvid
...
@@ -32,7 +27,6 @@ This directory contains three aggregated deployment configurations for the `nvid
Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
```bash
```bash
# Update the image in deploy.yaml to your patched image, then:
kubectl apply -f deploy.yaml -n${NAMESPACE}
kubectl apply -f deploy.yaml -n${NAMESPACE}
```
```
...
@@ -47,7 +41,6 @@ This creates:
...
@@ -47,7 +41,6 @@ This creates:
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
```bash
```bash
# Update the image in deploy-kvbm.yaml to your patched image, then:
kubectl apply -f deploy-kvbm.yaml -n${NAMESPACE}
kubectl apply -f deploy-kvbm.yaml -n${NAMESPACE}
```
```
...
@@ -83,7 +76,7 @@ This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worke
...
@@ -83,7 +76,7 @@ This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worke
## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200 and does not require the patched image used by the standard and KVBM manifests.
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200.
Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking PR](https://github.com/NVIDIA/TensorRT-LLM/pull/11816)).
This directory contains a unified diff that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.