| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml) and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment`deploy-specdec.yaml` that uses speculative decoding.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also ships`deploy-specdec.yaml` that uses speculative decoding.
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
```bash
kubectl apply -f deploy-kvbm.yaml -n${NAMESPACE}
```
This creates:
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
### KVBM Configuration
Key environment variables on the worker:
| Variable | Default | Description |
|---|---|---|
| `DYN_KVBM_CPU_CACHE_GB` | `10` | CPU cache size in GB for KVBM |
This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worker pods labeled with:
-`nvidia.com/dynamo-component-type: worker`
-`nvidia.com/metrics-enabled: "true"`
> **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
---
## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200.
| Variable | Default in this recipe | Description |
|---|---|---|
| `DYN_KVBM_CPU_CACHE_GB` | `100` | CPU memory reserved for offloaded KV blocks. Raise for longer contexts or higher reuse; if you change this, also bump `resources.requests.memory` / `limits.memory` on the worker by roughly the same delta. |
### (Optional) Prometheus metrics
Metrics are **off** by default. To expose them, add the following to the
worker's `env` and `mainContainer.ports` in `deploy.yaml`:
```yaml
env:
-name:DYN_KVBM_METRICS
value:"true"
-name:DYN_KVBM_METRICS_PORT
value:"6880"
ports:
-name:kvbm
containerPort:6880
```
Once enabled, scrape `:6880/metrics` for counters like