README.md 6.41 KB
Newer Older
1
2
# Kimi-K2.5 Recipes

3
4
5
6
7
8
> 🚧 **Work-in-Progress — Experimental Recipe**
>
> The TensorRT-LLM Python package used for Dynamo's TRT-LLM integration does not yet include
> native Kimi K2.5 support. This recipe is an **experimental** effort to bring
> Kimi K2.5 to Dynamo ahead of upstream availability. It needs to patch the container image on top of released dynamo image.

9
10
11
12
Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.

## Available Configurations

13
14
15
16
17
18
19
20
21
22
There are two model weight variants, each with its own model download and deploy manifests:

| Variant | Model | Deploy Configs | Notes |
|---------|-------|---------------|-------|
| **nvidia** 🚧 | `nvidia/Kimi-K2.5-NVFP4` | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) |
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image |

All configurations use TP8, EP8, aggregated mode with KV-aware routing.

The **nvidia** variant also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
23
24
25
26
27
28
29

## Prerequisites

1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with B200 GPUs (8x per worker)
3. **HuggingFace token** with access to the model

30
## Quick Start (nvidia variant)
31
32
33
34
35
36
37
38
39
40
41
42

```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

# Download model (update storageClassName in model-cache/model-cache.yaml first!)
43
kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
44
45
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

46
47
48
49
50
# Patch the container image (required for nvidia weights)
cd trtllm/agg/nvidia/patch
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
cd -

51
# Deploy
52
kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
53
54
```

55
56
For baseten weights, use `model-cache/baseten/` and `trtllm/agg/baseten/deploy.yaml` instead — no image patch needed.

57
58
59
60
61
62
63
64
65
66
## Test the Deployment

```bash
# Port-forward the frontend
kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}

# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
67
    "model": "nvidia/Kimi-K2.5-NVFP4",
68
69
70
71
72
73
74
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

## Model Details

75
- **Model**: `nvidia/Kimi-K2.5-NVFP4` (NV FP4 quantized, text-only)
76
77
78
- **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
- **Backend**: TensorRT-LLM (PyTorch backend)
- **Parallelism**: TP8, EP8 (Expert Parallel)
79

80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

## Hardware Requirements

| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |

## Verifying Reasoning

The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's chain-of-thought into a separate `reasoning_content` field. Verify that reasoning is properly separated from the final answer:

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
95
    "model": "nvidia/Kimi-K2.5-NVFP4",
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
    "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
    "max_tokens": 200
  }' | python3 -m json.tool
```

**Expected behavior:**

- `message.reasoning_content` contains the model's thinking process
- `message.content` contains only the final answer (e.g., `"4"`)
- No raw `</think>` tags appear in either field

**Example response:**

```json
{
  "choices": [{
    "message": {
      "content": "4",
      "role": "assistant",
      "reasoning_content": "The user is asking a simple math question: \"What is 2+2?\" and wants a brief answer.\n\n2+2 equals 4.\n\nI should answer briefly as requested."
    },
    "finish_reason": "stop"
  }]
}
```

If `reasoning_content` is `null` with raw `</think>` tags in `content`, the reasoning parser is not configured. Ensure the worker has `--dyn-reasoning-parser kimi_k25`.

## Verifying Tool Calling

The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls into OpenAI-compatible structured `tool_calls`. Send a request with tool definitions:

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
132
    "model": "nvidia/Kimi-K2.5-NVFP4",
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "max_tokens": 300
  }' | python3 -m json.tool
```

**Expected behavior:**

- `message.tool_calls` contains a structured array with `name`, `arguments`, and `id`
- `message.content` contains only the natural language portion
- `message.reasoning_content` contains the model's reasoning about which tool to call
- `finish_reason` is `"tool_calls"`
- No raw `<|tool_calls_section_begin|>` tokens in `content`

**Example response:**

```json
{
  "choices": [{
    "message": {
      "content": "I'll check the weather in San Francisco for you.",
      "tool_calls": [{
        "id": "functions.get_weather:0",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"San Francisco\"}"
        }
      }],
      "role": "assistant",
      "reasoning_content": "The user is asking for the weather in San Francisco. I have a function called get_weather that can retrieve weather information. I need to call this function with \"San Francisco\" as the location parameter."
    },
    "finish_reason": "tool_calls"
  }]
}
```

If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `content`, the tool call parser is not configured. Ensure the worker has `--dyn-tool-call-parser kimi_k2`.

## Notes

187
188
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream