Unverified Commit 01baf4a3 authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat: kimi2.5 with nvidia's model weights and trtllm patch (#6842)

parent 3c0a0d75
...@@ -2,13 +2,18 @@ ...@@ -2,13 +2,18 @@
Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing. Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
> **Note:** Support for the official **`nvidia/Kimi-K2.5-NVFP4`** checkpoint is in progress and will be added soon. The current recipe uses **`baseten-admin/Kimi-2.5-text-nvfp4-v3`**, a text-only variant where users can experience Kimi-K2.5 and its tool calling and reasoning capabilities.
## Available Configurations ## Available Configurations
| Configuration | GPUs | Mode | Description | There are two model weight variants, each with its own model download and deploy manifests:
|--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 8x GPU | Aggregated | TP8, EP8, KV-aware routing | | Variant | Model | Deploy Configs | Notes |
|---------|-------|---------------|-------|
| **nvidia** 🚧 | `nvidia/Kimi-K2.5-NVFP4` | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) |
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image |
All configurations use TP8, EP8, aggregated mode with KV-aware routing.
The **nvidia** variant also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
## Prerequisites ## Prerequisites
...@@ -16,7 +21,7 @@ Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware ro ...@@ -16,7 +21,7 @@ Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware ro
2. **GPU cluster** with B200 GPUs (8x per worker) 2. **GPU cluster** with B200 GPUs (8x per worker)
3. **HuggingFace token** with access to the model 3. **HuggingFace token** with access to the model
## Quick Start ## Quick Start (nvidia variant)
```bash ```bash
# Set namespace # Set namespace
...@@ -29,13 +34,20 @@ kubectl create secret generic hf-token-secret \ ...@@ -29,13 +34,20 @@ kubectl create secret generic hf-token-secret \
-n ${NAMESPACE} -n ${NAMESPACE}
# Download model (update storageClassName in model-cache/model-cache.yaml first!) # Download model (update storageClassName in model-cache/model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE} kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Patch the container image (required for nvidia weights)
cd trtllm/agg/nvidia/patch
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
cd -
# Deploy # Deploy
kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE} kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
``` ```
For baseten weights, use `model-cache/baseten/` and `trtllm/agg/baseten/deploy.yaml` instead — no image patch needed.
## Test the Deployment ## Test the Deployment
```bash ```bash
...@@ -46,7 +58,7 @@ kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE} ...@@ -46,7 +58,7 @@ kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3", "model": "nvidia/Kimi-K2.5-NVFP4",
"messages": [{"role": "user", "content": "Hello!"}], "messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100 "max_tokens": 100
}' }'
...@@ -54,11 +66,11 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -54,11 +66,11 @@ curl http://localhost:8000/v1/chat/completions \
## Model Details ## Model Details
- **Model**: `baseten-admin/Kimi-2.5-text-nvfp4-v3` (NV FP4 quantized, text-only) - **Model**: `nvidia/Kimi-K2.5-NVFP4` (NV FP4 quantized, text-only)
- **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture - **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
- **Backend**: TensorRT-LLM (PyTorch backend) - **Backend**: TensorRT-LLM (PyTorch backend)
- **Parallelism**: TP8, EP8 (Expert Parallel) - **Parallelism**: TP8, EP8 (Expert Parallel)
- **Features**: Reasoning (chain-of-thought), tool calling (function calling)
## Hardware Requirements ## Hardware Requirements
...@@ -74,7 +86,7 @@ The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's cha ...@@ -74,7 +86,7 @@ The deployment uses `--dyn-reasoning-parser kimi_k25` to extract the model's cha
curl -s http://localhost:8000/v1/chat/completions \ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3", "model": "nvidia/Kimi-K2.5-NVFP4",
"messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}], "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
"max_tokens": 200 "max_tokens": 200
}' | python3 -m json.tool }' | python3 -m json.tool
...@@ -111,7 +123,7 @@ The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls i ...@@ -111,7 +123,7 @@ The deployment uses `--dyn-tool-call-parser kimi_k2` to extract function calls i
curl -s http://localhost:8000/v1/chat/completions \ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3", "model": "nvidia/Kimi-K2.5-NVFP4",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}], "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{ "tools": [{
"type": "function", "type": "function",
...@@ -166,4 +178,5 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co ...@@ -166,4 +178,5 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
## Notes ## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
\ No newline at end of file - The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream
...@@ -23,7 +23,7 @@ spec: ...@@ -23,7 +23,7 @@ spec:
name: hf-token-secret name: hf-token-secret
env: env:
- name: MODEL_NAME - name: MODEL_NAME
value: baseten-admin/Kimi-2.5-text-nvfp4-v3 # text-only variant value: baseten-admin/Kimi-2.5-text-nvfp4-v3
- name: HF_HOME - name: HF_HOME
value: /model-store value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER - name: HF_HUB_ENABLE_HF_TRANSFER
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: nvidia/Kimi-K2.5-NVFP4
- name: HF_HOME
value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download $MODEL_NAME
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
data:
config.yaml: |
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 8448
trust_remote_code: true
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- kimi-k25-agg-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 80Gi
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 8 \
--dyn-reasoning-parser kimi_k25 \
--dyn-tool-call-parser kimi_k2
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: MODEL_NAME
value: baseten-admin/Kimi-2.5-text-nvfp4-v3
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config
readOnly: true
workingDir: /workspace/examples/backends/trtllm
volumes:
- configMap:
name: llm-config
name: llm-config
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config-kimi-agg-kvbm
data:
config.yaml: |
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 8448
trust_remote_code: true
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg-kvbm
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- kimi-k25-agg-kvbm-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 80Gi
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 8 \
--dyn-reasoning-parser kimi_k25 \
--dyn-tool-call-parser kimi_k2
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: MODEL_NAME
value: nvidia/Kimi-K2.5-NVFP4
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
# Adjust CPU cache size as needed
- name: DYN_KVBM_CPU_CACHE_GB
value: "100"
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config-kimi-agg-kvbm
readOnly: true
workingDir: /workspace/examples/backends/trtllm
volumes:
- configMap:
name: llm-config-kimi-agg-kvbm
name: llm-config-kimi-agg-kvbm
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
data:
config.yaml: |
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 8448
trust_remote_code: true
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- kimi-k25-agg-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 80Gi
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 8 \
--dyn-reasoning-parser kimi_k25 \
--dyn-tool-call-parser kimi_k2
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: MODEL_NAME
value: nvidia/Kimi-K2.5-NVFP4
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config
readOnly: true
workingDir: /workspace/examples/backends/trtllm
volumes:
- configMap:
name: llm-config
name: llm-config
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
\ No newline at end of file
# Kimi K2.5 TensorRT-LLM Patch
Kimi K2.5 support has not yet been released in TensorRT-LLM ([tracking branch](https://github.com/NVIDIA/TensorRT-LLM/compare/main...feat/k25-demo)).
This directory contains an append-only patch that registers `KimiK25ForConditionalGeneration` on top of the existing DeepSeek-V3 model code, letting you run Kimi K2.5 on TensorRT-LLM today.
## Quick start
Patch a Dynamo docker image by running:
```bash
./patch-container.sh <docker-image>
```
For example:
```bash
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
# produces image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
```
If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
## Files
| File | Description |
|------|-------------|
| `patch-container.sh` | Builds a patched docker image from a base Dynamo image |
| `kimi.patch` | Appended to `modeling_deepseekv3.py` inside the container -- adds a thin `DeepseekV3ForCausalLM` subclass that extracts the Kimi text backbone config and remaps weight prefixes |
@register_auto_model("KimiK25ForConditionalGeneration")
class KimiK25ForConditionalGeneration(DeepseekV3ForCausalLM):
"""Kimi-K2.5 multimodal model (text-only path).
Extracts the DeepSeek-V3 text backbone from the composite config
and strips the ``language_model.`` weight prefix so that the
standard DeepseekV3ForCausalLM loading path works unchanged.
NOTE: Kimi-K2.5's text backbone sets ``num_nextn_predict_layers = 0``,
so MTP-based speculative decoding is not applicable to this model.
"""
_LANG_PREFIX = "language_model."
def __init__(self, model_config: ModelConfig[PretrainedConfig]):
model_config = copy.copy(model_config)
if hasattr(model_config.pretrained_config, 'text_config'):
model_config._frozen = False
model_config.pretrained_config = model_config.pretrained_config.text_config
if model_config.quant_config.exclude_modules:
model_config.quant_config = copy.copy(model_config.quant_config)
p = self._LANG_PREFIX
mapped = []
for m in model_config.quant_config.exclude_modules:
if m.startswith(p):
rest = m[len(p):]
if rest.startswith('layers.'):
rest = 'model.' + rest
mapped.append(rest)
else:
mapped.append(m)
model_config.quant_config.exclude_modules = mapped
model_config._frozen = True
super().__init__(model_config)
def load_weights(self, weights: ConsumableWeightsDict):
has_prefix = any(k.startswith("language_model.") for k in weights)
if has_prefix:
weights = filter_weights("language_model", weights)
weights = ConsumableWeightsDict(weights)
super().load_weights(weights)
\ No newline at end of file
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -euo pipefail
if [[ $# -ne 1 ]]; then
echo "Usage: $0 <docker-image>"
echo " Patches modeling_deepseekv3.py with KimiK25ForConditionalGeneration class."
echo " Outputs: <docker-image>-patched"
exit 1
fi
SRC_IMAGE="$1"
DST_IMAGE="${SRC_IMAGE}-patched"
TARGET_FILE="/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PATCH_FILE="${SCRIPT_DIR}/kimi.patch"
if [[ ! -f "$PATCH_FILE" ]]; then
echo "ERROR: Patch file not found: $PATCH_FILE"
exit 1
fi
TMPDIR="$(mktemp -d)"
trap 'rm -rf "$TMPDIR"' EXIT
cp "$PATCH_FILE" "$TMPDIR/kimi.patch"
cat > "$TMPDIR/Dockerfile" <<'DOCKERFILE'
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
ARG TARGET_FILE
USER root
COPY kimi.patch /opt/kimi.patch
RUN if grep -q 'KimiK25ForConditionalGeneration' "${TARGET_FILE}"; then \
echo "Patch already applied, skipping."; \
else \
if ! head -50 "${TARGET_FILE}" | grep -q '^import copy'; then \
sed -i '1s/^/import copy\n/' "${TARGET_FILE}"; \
fi && \
echo "" >> "${TARGET_FILE}" && \
cat /opt/kimi.patch >> "${TARGET_FILE}"; \
fi && \
rm -f /opt/kimi.patch
USER 1000
DOCKERFILE
echo "Building patched image: ${DST_IMAGE}"
docker build \
--build-arg BASE_IMAGE="$SRC_IMAGE" \
--build-arg TARGET_FILE="$TARGET_FILE" \
-t "$DST_IMAGE" \
"$TMPDIR"
echo "Done. Patched image: ${DST_IMAGE}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment