"docs/kubernetes/multinode-deployment.md" did not exist on "9e6972a548c44e78361ca1296d36f862bbe4dbae"
Unverified Commit 62ec9f5b authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat: add kimi k2.5 model recipe with baseten's model (#6602)


Signed-off-by: default avatarBiswa Panda <biswa.panda@gmail.com>
parent 90d74637
......@@ -36,6 +36,7 @@ These recipes demonstrate aggregated or disaggregated serving:
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
| **[Kimi-K2.5](kimi-k2.5/trtllm/agg/)** | TensorRT-LLM | Aggregated | 8x GPU | ✅ | ❌ | MoE model, TP8×EP8, reasoning + tool calling | ❌ |
*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
......
# Kimi-K2.5 Recipes
Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
> **Note:** Support for the official **`nvidia/Kimi-K2.5-NVFP4`** checkpoint is in progress and will be added soon. The current recipe uses **`baseten-admin/Kimi-2.5-text-nvfp4-v3`**, a text-only variant where users can experience Kimi-K2.5 and its tool calling and reasoning capabilities.
## Available Configurations
| Configuration | GPUs | Mode | Description |
|--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 8x GPU | Aggregated | TP8, EP8, KV-aware routing |
## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with B200 GPUs (8x per worker)
3. **HuggingFace token** with access to the model
## Quick Start
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache/model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Deploy
kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
```
## Test the Deployment
```bash
# Port-forward the frontend
kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
## Model Details
- **Model**: `baseten-admin/Kimi-2.5-text-nvfp4-v3` (NV FP4 quantized, text-only)
- **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
- **Backend**: TensorRT-LLM (PyTorch backend)
- **Parallelism**: TP8, EP8 (Expert Parallel)
- **Features**: Reasoning (chain-of-thought), tool calling (function calling)
## Hardware Requirements
| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |
## Verifying Reasoning
The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's chain-of-thought into a separate `reasoning_content` field. Verify that reasoning is properly separated from the final answer:
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
"messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
"max_tokens": 200
}' | python3 -m json.tool
```
**Expected behavior:**
- `message.reasoning_content` contains the model's thinking process
- `message.content` contains only the final answer (e.g., `"4"`)
- No raw `</think>` tags appear in either field
**Example response:**
```json
{
"choices": [{
"message": {
"content": "4",
"role": "assistant",
"reasoning_content": "The user is asking a simple math question: \"What is 2+2?\" and wants a brief answer.\n\n2+2 equals 4.\n\nI should answer briefly as requested."
},
"finish_reason": "stop"
}]
}
```
If `reasoning_content` is `null` with raw `</think>` tags in `content`, the reasoning parser is not configured. Ensure the worker has `--dyn-reasoning-parser kimi_k25`.
## Verifying Tool Calling
The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls into OpenAI-compatible structured `tool_calls`. Send a request with tool definitions:
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"max_tokens": 300
}' | python3 -m json.tool
```
**Expected behavior:**
- `message.tool_calls` contains a structured array with `name`, `arguments`, and `id`
- `message.content` contains only the natural language portion
- `message.reasoning_content` contains the model's reasoning about which tool to call
- `finish_reason` is `"tool_calls"`
- No raw `<|tool_calls_section_begin|>` tokens in `content`
**Example response:**
```json
{
"choices": [{
"message": {
"content": "I'll check the weather in San Francisco for you.",
"tool_calls": [{
"id": "functions.get_weather:0",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"San Francisco\"}"
}
}],
"role": "assistant",
"reasoning_content": "The user is asking for the weather in San Francisco. I have a function called get_weather that can retrieve weather information. I need to call this function with \"San Francisco\" as the location parameter."
},
"finish_reason": "tool_calls"
}]
}
```
If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `content`, the tool call parser is not configured. Ensure the worker has `--dyn-tool-call-parser kimi_k2`.
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
storageClassName: "your-storage-class-name"
resources:
requests:
storage: 700Gi
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: baseten-admin/Kimi-2.5-text-nvfp4-v3 # text-only variant
- name: HF_HOME
value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download $MODEL_NAME
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
data:
config.yaml: |
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 8448
trust_remote_code: true
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- kimi-k25-agg-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 80Gi
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 8 \
--dyn-reasoning-parser kimi_k25 \
--dyn-tool-call-parser kimi_k2
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: MODEL_NAME
value: baseten-admin/Kimi-2.5-text-nvfp4-v3
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config
readOnly: true
workingDir: /workspace/examples/backends/trtllm
volumes:
- configMap:
name: llm-config
name: llm-config
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment