feat: add kimi k2.5 model recipe with baseten's model (#6602)

Signed-off-by: Biswa Panda <biswa.panda@gmail.com>

feat: add kimi k2.5 model recipe with baseten's model (#6602)
Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
62ec9f5b · Biswa Panda · GitHub · 90d74637 · 62ec9f5b · 62ec9f5b
Unverified Commit 62ec9f5b authored Mar 03, 2026 by Biswa Panda Committed by GitHub Mar 03, 2026
5 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -36,6 +36,7 @@ These recipes demonstrate aggregated or disaggregated serving:
 | **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ |
 | **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
 | **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
+| **[Kimi-K2.5](kimi-k2.5/trtllm/agg/)** | TensorRT-LLM | Aggregated | 8x GPU | ✅ | ❌ | MoE model, TP8×EP8, reasoning + tool calling | ❌ |

 *1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.


--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
+# Kimi-K2.5 Recipes
+
+Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
+
+> **Note:** Support for the official **`nvidia/Kimi-K2.5-NVFP4`** checkpoint is in progress and will be added soon. The current recipe uses **`baseten-admin/Kimi-2.5-text-nvfp4-v3`**, a text-only variant where users can experience Kimi-K2.5 and its tool calling and reasoning capabilities.
+
+## Available Configurations
+
+| Configuration | GPUs | Mode | Description |
+|--------------|------|------|-------------|
+| [**trtllm/agg**](trtllm/agg/) | 8x GPU | Aggregated | TP8, EP8, KV-aware routing |
+
+## Prerequisites
+
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with B200 GPUs (8x per worker)
+3. **HuggingFace token** with access to the model
+
+## Quick Start
+
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+
+# Download model (update storageClassName in model-cache/model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+
+# Deploy
+kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
+```
+
+## Test the Deployment
+
+```bash
+# Port-forward the frontend
+kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}
+
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+```
+
+## Model Details
+
+- **Model**: `baseten-admin/Kimi-2.5-text-nvfp4-v3` (NV FP4 quantized, text-only)
+- **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
+- **Backend**: TensorRT-LLM (PyTorch backend)
+- **Parallelism**: TP8, EP8 (Expert Parallel)
+- **Features**: Reasoning (chain-of-thought), tool calling (function calling)
+
+## Hardware Requirements
+
+| Configuration | GPUs |
+|--------------|------|
+| Aggregated | 8x B200 |
+
+## Verifying Reasoning
+
+The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's chain-of-thought into a separate `reasoning_content` field. Verify that reasoning is properly separated from the final answer:
+
+```bash
+curl -s http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
+    "max_tokens": 200
+  }' | python3 -m json.tool
+```
+
+**Expected behavior:**
+
+- `message.reasoning_content` contains the model's thinking process
+- `message.content` contains only the final answer (e.g., `"4"`)
+- No raw `</think>` tags appear in either field
+
+**Example response:**
+
+```json
+{
+  "choices": [{
+    "message": {
+      "content": "4",
+      "role": "assistant",
+      "reasoning_content": "The user is asking a simple math question: \"What is 2+2?\" and wants a brief answer.\n\n2+2 equals 4.\n\nI should answer briefly as requested."
+    },
+    "finish_reason": "stop"
+  }]
+}
+```
+
+If `reasoning_content` is `null` with raw `</think>` tags in `content`, the reasoning parser is not configured. Ensure the worker has `--dyn-reasoning-parser kimi_k25`.
+
+## Verifying Tool Calling
+
+The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls into OpenAI-compatible structured `tool_calls`. Send a request with tool definitions:
+
+```bash
+curl -s http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
+    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
+    "tools": [{
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "description": "Get the current weather for a location",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "location": {"type": "string", "description": "City name"}
+          },
+          "required": ["location"]
+        }
+      }
+    }],
+    "max_tokens": 300
+  }' | python3 -m json.tool
+```
+
+**Expected behavior:**
+
+- `message.tool_calls` contains a structured array with `name`, `arguments`, and `id`
+- `message.content` contains only the natural language portion
+- `message.reasoning_content` contains the model's reasoning about which tool to call
+- `finish_reason` is `"tool_calls"`
+- No raw `<|tool_calls_section_begin|>` tokens in `content`
+
+**Example response:**
+
+```json
+{
+  "choices": [{
+    "message": {
+      "content": "I'll check the weather in San Francisco for you.",
+      "tool_calls": [{
+        "id": "functions.get_weather:0",
+        "type": "function",
+        "function": {
+          "name": "get_weather",
+          "arguments": "{\"location\":\"San Francisco\"}"
+        }
+      }],
+      "role": "assistant",
+      "reasoning_content": "The user is asking for the weather in San Francisco. I have a function called get_weather that can retrieve weather information. I need to call this function with \"San Francisco\" as the location parameter."
+    },
+    "finish_reason": "tool_calls"
+  }]
+}
+```
+
+If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `content`, the tool call parser is not configured. Ensure the worker has `--dyn-tool-call-parser kimi_k2`.
+
+## Notes
+
+- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
\ No newline at end of file
--- a/recipes/kimi-k2.5/model-cache/model-cache.yaml
+++ b/recipes/kimi-k2.5/model-cache/model-cache.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  storageClassName: "your-storage-class-name"
+  resources:
+    requests:
+      storage: 700Gi
\ No newline at end of file
--- a/recipes/kimi-k2.5/model-cache/model-download.yaml
+++ b/recipes/kimi-k2.5/model-cache/model-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: baseten-admin/Kimi-2.5-text-nvfp4-v3  #  text-only variant
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download $MODEL_NAME
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
\ No newline at end of file
--- a/recipes/kimi-k2.5/trtllm/agg/deploy.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-config
+data:
+  config.yaml: |
+    max_batch_size: 128
+    max_num_tokens: 8448
+    max_seq_len: 8212
+    tensor_parallel_size: 8
+    moe_expert_parallel_size: 8
+    enable_attention_dp: true
+    pipeline_parallel_size: 1
+    print_iter_log: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.75
+      dtype: fp8
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 8448
+    trust_remote_code: true
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: kimi-k25-agg
+spec:
+  backendFramework: trtllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      extraPodSpec:
+        affinity:
+          podAntiAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                - key: nvidia.com/dynamo-graph-deployment-name
+                  operator: In
+                  values:
+                  - kimi-k25-agg-frontend
+              topologyKey: kubernetes.io/hostname
+        mainContainer:
+          args:
+          - python3 -m dynamo.frontend --router-mode kv --http-port 8000
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+      replicas: 1
+    TrtllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 80Gi
+      extraPodSpec:
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+              - matchExpressions:
+                - key: nvidia.com/gpu.present
+                  operator: In
+                  values:
+                  - "true"
+        mainContainer:
+          args:
+          - |
+            python3 -m dynamo.trtllm \
+              --model-path "${MODEL_NAME}" \
+              --served-model-name "${MODEL_NAME}" \
+              --extra-engine-args "${ENGINE_ARGS}" \
+              --tensor-parallel-size 8 \
+              --dyn-reasoning-parser kimi_k25 \
+              --dyn-tool-call-parser kimi_k2
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          env:
+          - name: TRTLLM_ENABLE_PDL
+            value: "1"
+          - name: MODEL_NAME
+            value: baseten-admin/Kimi-2.5-text-nvfp4-v3
+          - name: ENGINE_ARGS
+            value: /opt/dynamo/configs/config.yaml
+          - name: HF_HOME
+            value: /opt/models
+          volumeMounts:
+          - mountPath: /opt/dynamo/configs
+            name: llm-config
+            readOnly: true
+          workingDir: /workspace/examples/backends/trtllm
+        volumes:
+        - configMap:
+            name: llm-config
+          name: llm-config
+      replicas: 1
+      resources:
+        limits:
+          gpu: "8"
+        requests:
+          gpu: "8"
\ No newline at end of file