Unverified Commit d4ac0b59 authored by Ziqi Fan's avatar Ziqi Fan Committed by GitHub
Browse files

chore: rm kimi kvbm recipe | add qwen3 kvbm recipe (#8475)


Signed-off-by: default avatarZiqi Fan <ziqif@nvidia.com>
parent 55a949cb
...@@ -9,7 +9,7 @@ There are two model weight variants, each with its own model download and deploy ...@@ -9,7 +9,7 @@ There are two model weight variants, each with its own model download and deploy
| Variant | Model | Status | Modality | Deploy Configs | Notes | | Variant | Model | Status | Modality | Deploy Configs | Notes |
|---------|-------|--------|----------|---------------|-------| |---------|-------|--------|----------|---------------|-------|
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized | | **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional | | **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml) and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
All configurations use TP8, EP8, aggregated mode with KV-aware routing. All configurations use TP8, EP8, aggregated mode with KV-aware routing.
...@@ -85,7 +85,7 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -85,7 +85,7 @@ curl http://localhost:8000/v1/chat/completions \
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5. > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment `deploy-specdec.yaml` that uses speculative decoding. The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also ships `deploy-specdec.yaml` that uses speculative decoding.
### Quick Start ### Quick Start
......
...@@ -4,12 +4,11 @@ ...@@ -4,12 +4,11 @@
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5. > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
This directory contains three aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model. This directory contains two aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model.
| Deployment | Manifest | Description | Hardware Requirement | Deployment | Manifest | Description | Hardware Requirement
|-----------|----------|-------------|----| |-----------|----------|-------------|----|
| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing | 1x8 B200 node | | **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing | 1x8 B200 node |
| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) | 1x8 B200 node |
| **Aggregated + EAGLE SpecDec** | [`deploy-specdec.yaml`](deploy-specdec.yaml) | Performant aggregated deployment with EAGLE speculative decoding and KV-aware routing | 8x4 GB200 nodes | | **Aggregated + EAGLE SpecDec** | [`deploy-specdec.yaml`](deploy-specdec.yaml) | Performant aggregated deployment with EAGLE speculative decoding and KV-aware routing | 8x4 GB200 nodes |
## Prerequisites ## Prerequisites
...@@ -36,44 +35,6 @@ This creates: ...@@ -36,44 +35,6 @@ This creates:
--- ---
## Aggregated Deployment with KVBM
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
```bash
kubectl apply -f deploy-kvbm.yaml -n ${NAMESPACE}
```
This creates:
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
### KVBM Configuration
Key environment variables on the worker:
| Variable | Default | Description |
|---|---|---|
| `DYN_KVBM_CPU_CACHE_GB` | `10` | CPU cache size in GB for KVBM |
| `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint |
| `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint |
### Enable Prometheus Metrics Scraping
If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor:
```bash
kubectl apply -f podmonitor-kvbm.yaml -n monitoring
```
This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worker pods labeled with:
- `nvidia.com/dynamo-component-type: worker`
- `nvidia.com/metrics-enabled: "true"`
> **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
---
## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing ## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing
Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200. Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200.
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config-kimi-agg-kvbm
data:
config.yaml: |
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 8448
trust_remote_code: true
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg-kvbm
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- kimi-k25-agg-kvbm-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 80Gi
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
ports:
- name: system
containerPort: 9090
- name: nixl
containerPort: 19090
- name: kvbm
containerPort: 6880
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 8 \
--dyn-reasoning-parser kimi_k25 \
--dyn-tool-call-parser kimi_k2
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: MODEL_NAME
value: nvidia/Kimi-K2.5-NVFP4
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
# Adjust CPU cache size as needed; start small for faster startup
- name: DYN_KVBM_CPU_CACHE_GB
value: "10"
# Enable KVBM metrics
- name: DYN_KVBM_METRICS
value: "true"
- name: DYN_KVBM_METRICS_PORT
value: "6880"
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config-kimi-agg-kvbm
readOnly: true
workingDir: /workspace/examples/backends/trtllm
volumes:
- configMap:
name: llm-config-kimi-agg-kvbm
name: llm-config-kimi-agg-kvbm
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Standalone PodMonitor for KVBM metrics (port 6880).
# Apply this if you cannot upgrade the platform Helm chart.
#
# Usage: kubectl apply -f podmonitor-kvbm.yaml -n monitoring
#
# Scrapes KVBM metrics (port 6880) from worker pods in any namespace.
# Only workers with the kvbm port exposed (e.g. DYN_KVBM_METRICS=true) are scraped.
#
# If your Prometheus Operator watches a different namespace for PodMonitors,
# change metadata.namespace and apply there.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: dynamo-worker-kvbm
namespace: monitoring
spec:
namespaceSelector:
any: true
podMetricsEndpoints:
- interval: 5s
path: /metrics
port: kvbm
selector:
matchLabels:
nvidia.com/dynamo-component-type: worker
nvidia.com/metrics-enabled: "true"
# Qwen3-32B: Aggregated + KVBM (single GPU)
Single-GPU aggregated deployment of `Qwen/Qwen3-32B` with the KV Block Manager
(KVBM) enabled. KVBM offloads cold KV cache blocks to host memory so the
effective cache footprint extends beyond GPU HBM, which improves prefix-reuse
hit rate on long or repeated prompts without adding GPUs.
## Hardware
- **1x NVIDIA H200 (141 GB) or B200 (192 GB)**. Qwen3-32B in BF16 is ~64 GB of
weights plus KV cache and activations, so 80 GB H100 leaves very little room
and is likely to OOM under real load. If you only have H100 80 GB, see
`../../../qwen3-32b-fp8/` for the FP8 variant.
- **≥ ~150 GiB of host memory on the node**. `DYN_KVBM_CPU_CACHE_GB=100` is
pinned as page-locked host memory for KVBM's G2 tier. The worker declares
`resources.requests.memory: 150Gi` and `resources.limits.memory: 200Gi`
(100 GiB pinned KV pool + ~50 GiB headroom for Python, weight-loader
working memory, and CUDA/NCCL buffers). If you raise `DYN_KVBM_CPU_CACHE_GB`,
scale these up by roughly the same delta.
## Prerequisites
Same as the sibling recipes in this directory:
1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../../../docs/kubernetes/README.md).
2. **Pre-existing `model-cache` and `compilation-cache` PVCs** — see
[`../../model-cache/cache.yaml`](../../model-cache/cache.yaml) and
[`../../model-cache/model-download.yaml`](../../model-cache/model-download.yaml).
3. **HuggingFace token Secret** named `hf-token-secret` in your namespace.
```bash
export NAMESPACE=your-namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
```
## Deploy
```bash
kubectl apply -f deploy.yaml -n ${NAMESPACE}
kubectl wait --for=condition=ready pod \
-l nvidia.com/dynamo-graph-deployment-name=agg-kvbm-qwen3-32b \
-n ${NAMESPACE} --timeout=1200s
```
## Verify
```bash
kubectl port-forward svc/agg-kvbm-qwen3-32b-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'
```
## KVBM configuration
The connector is selected through the worker's `--kv-transfer-config`:
```json
{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}
```
Worker env var set by this recipe:
| Variable | Default in this recipe | Description |
|---|---|---|
| `DYN_KVBM_CPU_CACHE_GB` | `100` | CPU memory reserved for offloaded KV blocks. Raise for longer contexts or higher reuse; if you change this, also bump `resources.requests.memory` / `limits.memory` on the worker by roughly the same delta. |
### (Optional) Prometheus metrics
Metrics are **off** by default. To expose them, add the following to the
worker's `env` and `mainContainer.ports` in `deploy.yaml`:
```yaml
env:
- name: DYN_KVBM_METRICS
value: "true"
- name: DYN_KVBM_METRICS_PORT
value: "6880"
ports:
- name: kvbm
containerPort: 6880
```
Once enabled, scrape `:6880/metrics` for counters like
`kvbm_offload_blocks_d2h`, `kvbm_onboard_blocks_h2d`, `kvbm_matched_tokens`,
`kvbm_host_cache_hit_rate`, plus per-route transfer counters. If you run the
Prometheus Operator, add a `PodMonitor` selecting pods with
`nvidia.com/dynamo-component-type: worker` and port `kvbm`.
## Cleanup
```bash
kubectl delete dynamographdeployment agg-kvbm-qwen3-32b -n ${NAMESPACE}
```
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: agg-kvbm-qwen3-32b
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
Frontend:
componentType: frontend
envs:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
workingDir: /workspace
command:
- python3
- -m
- dynamo.frontend
args:
- --router-reset-states
replicas: 1
subComponentType: null
VllmDecodeWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-32B
- --kv-transfer-config
- '{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}'
command:
- python3
- -m
- dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
env:
- name: DYN_HEALTH_CHECK_ENABLED
value: "false"
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
# NOTE: change this to tune the CPU cache size for your system
- name: DYN_KVBM_CPU_CACHE_GB
value: "100"
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
memory: "200Gi"
requests:
gpu: "1"
memory: "150Gi"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment