These recipes are under active development and may require additional setup steps (e.g., container patching). They are functional but not yet fully validated for production use.
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only |[`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/). Vision input is not yet functional — the patch loads the text backbone only. |
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
The **nvidia** variant also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
## Prerequisites
## Prerequisites
1.**Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
1.**Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2.**GPU cluster** with B200 GPUs (8x per worker)
2.**GPU cluster** with B200 GPUs (8x per worker)
3.**HuggingFace token** with access to the model
3.**HuggingFace token** with access to the model
## Quick Start (nvidia variant)
## Hardware Requirements
| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |
---
## baseten-admin/Kimi-2.5-text-nvfp4-v3
**Status:** Functional (not yet performance-optimized) | **Modality:** Text only
The baseten variant uses a text-only backend built on the underlying DeepSeek-V3 architecture, which means it works out of the box with the stock TensorRT-LLM container image -- no patching or custom builds required. This recipe is functional for text-based inference with reasoning and tool calling, but has not yet been performance-tuned or benchmarked.
### Quick Start
The baseten deploy manifest ships with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Update the `image:` fields in [`trtllm/agg/baseten/deploy.yaml`](trtllm/agg/baseten/deploy.yaml) to your actual Dynamo release tag before deploying.
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here"\
-n${NAMESPACE}
# Download model (update storageClassName in model-cache/model-cache.yaml first!)
**Status:** Experimental | **Modality:** Text only upstream support
> **Experimental:** Upstream TensorRT-LLM does not yet include native support for Kimi K2.5.
> This recipe works around that limitation by directly patching the container image with an
> append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path.
> See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for the patch script and full instructions.
> **Text only:** The patch loads the DeepSeek-V3 text backbone from the Kimi K2.5 config
> (`text_config`). The vision encoder is not loaded, so image inputs are not processed.
> Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
### Quick Start
The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Before deploying, you must:
1. Run the [patch script](trtllm/agg/nvidia/patch/) to build a patched image (appends `-patched` to the tag).
2. Update the `image:` fields in the deploy YAML to reference the patched image.
See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for full details on what the patch does.
- A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- 8× GPU nodes (e.g. H100/H200)
- 8x B200 GPUs
- A `hf-token-secret` Secret containing your Hugging Face token
- A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC
- A pre-existing `model-cache` PVC with the downloaded model
-Replace the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in `deploy-kvbm.yaml` with your actual image
-A **patched container image** -- the deploy manifests ship with a placeholder `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
## Deploy
---
## Standard Aggregated Deployment
Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
```bash
```bash
kubectl apply -f deploy-kvbm.yaml
# Update the image in deploy.yaml to your patched image, then:
kubectl apply -f deploy.yaml -n${NAMESPACE}
```
This creates:
- A **ConfigMap** (`llm-config`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache).
- A **DynamoGraphDeployment** (`kimi-k25-agg`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
---
## Aggregated Deployment with KVBM
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
```bash
# Update the image in deploy-kvbm.yaml to your patched image, then:
kubectl apply -f deploy-kvbm.yaml -n${NAMESPACE}
```
```
This creates:
This creates:
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
### KVBM Configuration
Key environment variables on the worker:
Key environment variables on the worker:
| Variable | Default | Description |
| Variable | Default | Description |
...
@@ -26,7 +63,7 @@ Key environment variables on the worker:
...
@@ -26,7 +63,7 @@ Key environment variables on the worker:
If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.