Unverified Commit a620a9cf authored by Ziqi Fan's avatar Ziqi Fan Committed by GitHub
Browse files

feat: enable KVBM metrics on k8s for kimi k2.5 recipe (#6963)


Signed-off-by: default avatarZiqi Fan <ziqif@nvidia.com>
parent b97fde10
# Kimi-K2.5 Aggregated Deployment with KVBM on Kubernetes
## Prerequisites
- A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- 8× GPU nodes (e.g. H100/H200)
- A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC
- Replace the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in `deploy-kvbm.yaml` with your actual image
## Deploy
```bash
kubectl apply -f deploy-kvbm.yaml
```
This creates:
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
Key environment variables on the worker:
| Variable | Default | Description |
|---|---|---|
| `DYN_KVBM_CPU_CACHE_GB` | `10` | CPU cache size in GB for KVBM |
| `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint |
| `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint |
## Enable Prometheus Metrics Scraping
If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor:
```bash
kubectl apply -f podmonitor-kvbm.yaml -n monitoring
```
This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worker pods labeled with:
- `nvidia.com/dynamo-component-type: worker`
- `nvidia.com/metrics-enabled: "true"`
> **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
...@@ -76,6 +76,13 @@ spec: ...@@ -76,6 +76,13 @@ spec:
values: values:
- "true" - "true"
mainContainer: mainContainer:
ports:
- name: system
containerPort: 9090
- name: nixl
containerPort: 19090
- name: kvbm
containerPort: 6880
args: args:
- | - |
python3 -m dynamo.trtllm \ python3 -m dynamo.trtllm \
...@@ -98,9 +105,14 @@ spec: ...@@ -98,9 +105,14 @@ spec:
value: /opt/dynamo/configs/config.yaml value: /opt/dynamo/configs/config.yaml
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
# Adjust CPU cache size as needed # Adjust CPU cache size as needed; start small for faster startup
- name: DYN_KVBM_CPU_CACHE_GB - name: DYN_KVBM_CPU_CACHE_GB
value: "100" value: "10"
# Enable KVBM metrics
- name: DYN_KVBM_METRICS
value: "true"
- name: DYN_KVBM_METRICS_PORT
value: "6880"
volumeMounts: volumeMounts:
- mountPath: /opt/dynamo/configs - mountPath: /opt/dynamo/configs
name: llm-config-kimi-agg-kvbm name: llm-config-kimi-agg-kvbm
......
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Standalone PodMonitor for KVBM metrics (port 6880).
# Apply this if you cannot upgrade the platform Helm chart.
#
# Usage: kubectl apply -f podmonitor-kvbm.yaml -n monitoring
#
# Scrapes KVBM metrics (port 6880) from worker pods in any namespace.
# Only workers with the kvbm port exposed (e.g. DYN_KVBM_METRICS=true) are scraped.
#
# If your Prometheus Operator watches a different namespace for PodMonitors,
# change metadata.namespace and apply there.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: dynamo-worker-kvbm
namespace: monitoring
spec:
namespaceSelector:
any: true
podMetricsEndpoints:
- interval: 5s
path: /metrics
port: kvbm
selector:
matchLabels:
nvidia.com/dynamo-component-type: worker
nvidia.com/metrics-enabled: "true"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment