Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
> **Note:** Support for the official **`nvidia/Kimi-K2.5-NVFP4`** checkpoint is in progress and will be added soon. The current recipe uses **`baseten-admin/Kimi-2.5-text-nvfp4-v3`**, a text-only variant where users can experience Kimi-K2.5 and its tool calling and reasoning capabilities.
## Available Configurations
## Available Configurations
| Configuration | GPUs | Mode | Description |
There are two model weight variants, each with its own model download and deploy manifests:
Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking branch](https://github.com/NVIDIA/TensorRT-LLM/compare/main...feat/k25-demo)).
This directory contains an append-only patch that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
## Files
| File | Description |
|------|-------------|
| `patch-container.sh` | Builds a patched docker image from a base Dynamo image |
| `kimi.patch` | Appended to `modeling_deepseekv3.py` inside the container -- adds a thin `DeepseekV3ForCausalLM` subclass that extracts the Kimi text backbone config and remaps weight prefixes |