README.md 8.6 KB
Newer Older
1
2
# Kimi-K2.5 Recipes

3
Deployment recipes for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
4
5
6

## Available Configurations

7
8
There are two model weight variants, each with its own model download and deploy manifests:

9
10
11
| Variant | Model | Status | Modality | Deploy Configs | Notes |
|---------|-------|--------|----------|---------------|-------|
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
12
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
13
14
15

All configurations use TP8, EP8, aggregated mode with KV-aware routing.

16
17
18
## Prerequisites

1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
19
2. **GPU cluster** with B200 GPUs (8x per worker) or GB200 GPUs (4 workers, 2x4 per worker)
20
21
3. **HuggingFace token** with access to the model

22
23
24
25
26
## Hardware Requirements

| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |
27
| Aggregated Speculative Decoding | 8x4 GB200 (4 workers, each worker spanning 2 nodes) |
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

---

## baseten-admin/Kimi-2.5-text-nvfp4-v3

**Status:** Functional (not yet performance-optimized) | **Modality:** Text only

The baseten variant uses a text-only backend built on the underlying DeepSeek-V3 architecture, which means it works out of the box with the stock TensorRT-LLM container image -- no patching or custom builds required. This recipe is functional for text-based inference with reasoning and tool calling, but has not yet been performance-tuned or benchmarked.

### Quick Start

The baseten deploy manifest ships with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Update the `image:` fields in [`trtllm/agg/baseten/deploy.yaml`](trtllm/agg/baseten/deploy.yaml) to your actual Dynamo release tag before deploying.

```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

# Download model (update storageClassName in model-cache/model-cache.yaml first!)
kubectl apply -f model-cache/baseten/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

# Update the image tag in trtllm/agg/baseten/deploy.yaml to your Dynamo release tag

# Deploy
kubectl apply -f trtllm/agg/baseten/deploy.yaml -n ${NAMESPACE}
```

### Test the Deployment

```bash
# Port-forward the frontend
kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}

# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

---

## nvidia/Kimi-K2.5-NVFP4

82
**Status:** Functional | **Modality:** Text only upstream support
83

84
85
86
87
> **Text only:** Current upstream TensorRT-LLM supports Kimi-K2.5 models by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.

88
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment `deploy-specdec.yaml` that uses speculative decoding.
89
90
91

### Quick Start

92
The nvidia deploy manifests use the placeholder top-of-tree image: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`
93

94
Before deploying, update the `image:` fields in the manifest you plan to use.
95
96
97
98
99
100
101
102
103
104
105
106

```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

# Download model (update storageClassName in model-cache/model-cache.yaml first!)
107
kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
108
109
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

110
# Update the image in the deploy manifest to use the container tag (or the patched tag)
111

112
# Deploy
113
kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
114
115
```

116
### Test the Deployment
117
118
119
120
121
122
123
124
125

```bash
# Port-forward the frontend
kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}

# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
126
    "model": "nvidia/Kimi-K2.5-NVFP4",
127
128
129
130
131
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

132
133
---

134
135
136
137
138
## Model Details

- **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
- **Backend**: TensorRT-LLM (PyTorch backend)
- **Parallelism**: TP8, EP8 (Expert Parallel)
139
- **Quantization**: NV FP4
140
141
142
143
144
145
146
147
148

## Verifying Reasoning

The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's chain-of-thought into a separate `reasoning_content` field. Verify that reasoning is properly separated from the final answer:

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
149
    "model": "nvidia/Kimi-K2.5-NVFP4",
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
    "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
    "max_tokens": 200
  }' | python3 -m json.tool
```

**Expected behavior:**

- `message.reasoning_content` contains the model's thinking process
- `message.content` contains only the final answer (e.g., `"4"`)
- No raw `</think>` tags appear in either field

**Example response:**

```json
{
  "choices": [{
    "message": {
      "content": "4",
      "role": "assistant",
      "reasoning_content": "The user is asking a simple math question: \"What is 2+2?\" and wants a brief answer.\n\n2+2 equals 4.\n\nI should answer briefly as requested."
    },
    "finish_reason": "stop"
  }]
}
```

If `reasoning_content` is `null` with raw `</think>` tags in `content`, the reasoning parser is not configured. Ensure the worker has `--dyn-reasoning-parser kimi_k25`.

## Verifying Tool Calling

The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls into OpenAI-compatible structured `tool_calls`. Send a request with tool definitions:

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
186
    "model": "nvidia/Kimi-K2.5-NVFP4",
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "max_tokens": 300
  }' | python3 -m json.tool
```

**Expected behavior:**

- `message.tool_calls` contains a structured array with `name`, `arguments`, and `id`
- `message.content` contains only the natural language portion
- `message.reasoning_content` contains the model's reasoning about which tool to call
- `finish_reason` is `"tool_calls"`
- No raw `<|tool_calls_section_begin|>` tokens in `content`

**Example response:**

```json
{
  "choices": [{
    "message": {
      "content": "I'll check the weather in San Francisco for you.",
      "tool_calls": [{
        "id": "functions.get_weather:0",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"San Francisco\"}"
        }
      }],
      "role": "assistant",
      "reasoning_content": "The user is asking for the weather in San Francisco. I have a function called get_weather that can retrieve weather information. I need to call this function with \"San Francisco\" as the location parameter."
    },
    "finish_reason": "tool_calls"
  }]
}
```

If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `content`, the tool call parser is not configured. Ensure the worker has `--dyn-tool-call-parser kimi_k2`.

## Notes

241
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying