feat: enable KVBM metrics on k8s for kimi k2.5 recipe (#6963)

Signed-off-by: Ziqi Fan <ziqif@nvidia.com>

feat: enable KVBM metrics on k8s for kimi k2.5 recipe (#6963)
Signed-off-by: Ziqi Fan <ziqif@nvidia.com>
a620a9cf · Ziqi Fan · GitHub · b97fde10 · a620a9cf · a620a9cf
Unverified Commit a620a9cf authored Mar 05, 2026 by Ziqi Fan Committed by GitHub Mar 06, 2026
3 changed files
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
+# Kimi-K2.5 Aggregated Deployment with KVBM on Kubernetes
+## Prerequisites
+- A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
+- 8× GPU nodes (e.g. H100/H200)
+- A `hf-token-secret` Secret containing your Hugging Face token
+- A pre-existing `model-cache` PVC
+- Replace the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in `deploy-kvbm.yaml` with your actual image
+## Deploy
+```bash
+kubectl apply -f deploy-kvbm.yaml
+```
+This creates:
+- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
+- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
+Key environment variables on the worker:
+| Variable | Default | Description |
+|---|---|---|
+| `DYN_KVBM_CPU_CACHE_GB` | `10` | CPU cache size in GB for KVBM |
+| `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint |
+| `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint |
+## Enable Prometheus Metrics Scraping
+If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor:
+```bash
+kubectl apply -f podmonitor-kvbm.yaml -n monitoring
+```
+This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worker pods labeled with:
+- `nvidia.com/dynamo-component-type: worker`
+- `nvidia.com/metrics-enabled: "true"`
+> **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
@@ -76,6 +76,13 @@ spec:
                  values:
                  - "true"
        mainContainer:
+          ports:
+          - name: system
+            containerPort: 9090
+          - name: nixl
+            containerPort: 19090
+          - name: kvbm
+            containerPort: 6880
          args:
          - |
            python3 -m dynamo.trtllm \
@@ -98,9 +105,14 @@ spec:
            value: /opt/dynamo/configs/config.yaml
          - name: HF_HOME
            value: /opt/models
-          # Adjust CPU cache size as needed
+          # Adjust CPU cache size as needed; start small for faster startup
          - name: DYN_KVBM_CPU_CACHE_GB
-            value: "100"
+            value: "10"
+          # Enable KVBM metrics
+          - name: DYN_KVBM_METRICS
+            value: "true"
+          - name: DYN_KVBM_METRICS_PORT
+            value: "6880"
          volumeMounts:
          - mountPath: /opt/dynamo/configs
            name: llm-config-kimi-agg-kvbm

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/podmonitor-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/podmonitor-kvbm.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Standalone PodMonitor for KVBM metrics (port 6880).
+# Apply this if you cannot upgrade the platform Helm chart.
+#
+# Usage: kubectl apply -f podmonitor-kvbm.yaml -n monitoring
+#
+# Scrapes KVBM metrics (port 6880) from worker pods in any namespace.
+# Only workers with the kvbm port exposed (e.g. DYN_KVBM_METRICS=true) are scraped.
+#
+# If your Prometheus Operator watches a different namespace for PodMonitors,
+# change metadata.namespace and apply there.
+apiVersion: monitoring.coreos.com/v1
+kind: PodMonitor
+metadata:
+  name: dynamo-worker-kvbm
+  namespace: monitoring
+spec:
+  namespaceSelector:
+    any: true
+  podMetricsEndpoints:
+  - interval: 5s
+    path: /metrics
+    port: kvbm
+  selector:
+    matchLabels:
+      nvidia.com/dynamo-component-type: worker
+      nvidia.com/metrics-enabled: "true"