chore: rm kimi kvbm recipe | add qwen3 kvbm recipe (#8475)

Signed-off-by: Ziqi Fan <ziqif@nvidia.com>

chore: rm kimi kvbm recipe | add qwen3 kvbm recipe (#8475)
Signed-off-by: Ziqi Fan <ziqif@nvidia.com>
d4ac0b59 · Ziqi Fan · GitHub · 55a949cb · d4ac0b59 · d4ac0b59
Unverified Commit d4ac0b59 authored Apr 21, 2026 by Ziqi Fan Committed by GitHub Apr 21, 2026
6 changed files
--- a/recipes/kimi-k2.5/README.md
+++ b/recipes/kimi-k2.5/README.md
@@ -9,7 +9,7 @@ There are two model weight variants, each with its own model download and deploy
 | Variant | Model | Status | Modality | Deploy Configs | Notes |
 |---------|-------|--------|----------|---------------|-------|
 | **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
-| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml), and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |
+| **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml) and [`deploy-specdec.yaml`](trtllm/agg/nvidia/deploy-specdec.yaml) | All configs are compatible with a current top-of-tree Dynamo TRT-LLM image. Vision input is not yet functional |

 All configurations use TP8, EP8, aggregated mode with KV-aware routing.

@@ -85,7 +85,7 @@ curl http://localhost:8000/v1/chat/completions \
 > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
 > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.

-The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`, as well as a deployment `deploy-specdec.yaml` that uses speculative decoding.
+The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also ships `deploy-specdec.yaml` that uses speculative decoding.

 ### Quick Start


--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/README.md
@@ -4,12 +4,11 @@
 > text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
 > processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.

-This directory contains three aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model.
+This directory contains two aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model.

 | Deployment | Manifest | Description | Hardware Requirement
 |-----------|----------|-------------|----|
 | **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing | 1x8 B200 node |
-| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) | 1x8 B200 node |
 | **Aggregated + EAGLE SpecDec** | [`deploy-specdec.yaml`](deploy-specdec.yaml) | Performant aggregated deployment with EAGLE speculative decoding and KV-aware routing | 8x4 GB200 nodes |

 ## Prerequisites
@@ -36,44 +35,6 @@ This creates:

 ---

-## Aggregated Deployment with KVBM
-
-Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
-
-```bash
-kubectl apply -f deploy-kvbm.yaml -n ${NAMESPACE}
-```
-
-This creates:
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
-
-### KVBM Configuration
-
-Key environment variables on the worker:
-
-| Variable | Default | Description |
-|---|---|---|
-| `DYN_KVBM_CPU_CACHE_GB` | `10` | CPU cache size in GB for KVBM |
-| `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint |
-| `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint |
-
-### Enable Prometheus Metrics Scraping
-
-If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor:
-
-```bash
-kubectl apply -f podmonitor-kvbm.yaml -n monitoring
-```
-
-This scrapes `/metrics` on port `6880` (named `kvbm`) every 5 seconds from worker pods labeled with:
- `nvidia.com/dynamo-component-type: worker`
- `nvidia.com/metrics-enabled: "true"`
-
-> **Note:** If your Prometheus Operator watches a namespace other than `monitoring` for PodMonitors, change `metadata.namespace` in `podmonitor-kvbm.yaml` accordingly.
-
---
-
 ## Aggregated Deployment with EAGLE Speculative Decoding and KV-aware routing

 Uses [`deploy-specdec.yaml`](deploy-specdec.yaml). This performant configuration runs KV-aware aggregated serving with EAGLE speculative decoding on GB200.

--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/deploy-kvbm.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: llm-config-kimi-agg-kvbm
-data:
-  config.yaml: |
-    max_batch_size: 128
-    max_num_tokens: 8448
-    max_seq_len: 8212
-    tensor_parallel_size: 8
-    moe_expert_parallel_size: 8
-    enable_attention_dp: true
-    pipeline_parallel_size: 1
-    print_iter_log: true
-    kv_cache_config:
-      free_gpu_memory_fraction: 0.75
-      dtype: fp8
-    cache_transceiver_config:
-      backend: UCX
-      max_tokens_in_buffer: 8448
-    trust_remote_code: true
-    kv_connector_config:
-      connector_module: kvbm.trtllm_integration.connector
-      connector_scheduler_class: DynamoKVBMConnectorLeader
-      connector_worker_class: DynamoKVBMConnectorWorker
---
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: kimi-k25-agg-kvbm
-spec:
-  backendFramework: trtllm
-  pvcs:
-    - name: model-cache
-      create: false
-  services:
-    Frontend:
-      componentType: frontend
-      extraPodSpec:
-        affinity:
-          podAntiAffinity:
-            requiredDuringSchedulingIgnoredDuringExecution:
-            - labelSelector:
-                matchExpressions:
-                - key: nvidia.com/dynamo-graph-deployment-name
-                  operator: In
-                  values:
-                  - kimi-k25-agg-kvbm-frontend
-              topologyKey: kubernetes.io/hostname
-        mainContainer:
-          args:
-          - python3 -m dynamo.frontend --router-mode kv --http-port 8000
-          command:
-          - /bin/sh
-          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-      replicas: 1
-    TrtllmWorker:
-      componentType: worker
-      envFromSecret: hf-token-secret
-      volumeMounts:
-        - name: model-cache
-          mountPoint: /opt/models
-      sharedMemory:
-        size: 80Gi
-      extraPodSpec:
-        affinity:
-          nodeAffinity:
-            requiredDuringSchedulingIgnoredDuringExecution:
-              nodeSelectorTerms:
-              - matchExpressions:
-                - key: nvidia.com/gpu.present
-                  operator: In
-                  values:
-                  - "true"
-        mainContainer:
-          ports:
-          - name: system
-            containerPort: 9090
-          - name: nixl
-            containerPort: 19090
-          - name: kvbm
-            containerPort: 6880
-          args:
-          - |
-            python3 -m dynamo.trtllm \
-              --model-path "${MODEL_NAME}" \
-              --served-model-name "${MODEL_NAME}" \
-              --extra-engine-args "${ENGINE_ARGS}" \
-              --tensor-parallel-size 8 \
-              --dyn-reasoning-parser kimi_k25 \
-              --dyn-tool-call-parser kimi_k2
-          command:
-          - /bin/sh
-          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
-          env:
-          - name: TRTLLM_ENABLE_PDL
-            value: "1"
-          - name: MODEL_NAME
-            value: nvidia/Kimi-K2.5-NVFP4
-          - name: ENGINE_ARGS
-            value: /opt/dynamo/configs/config.yaml
-          - name: HF_HOME
-            value: /opt/models
-          # Adjust CPU cache size as needed; start small for faster startup
-          - name: DYN_KVBM_CPU_CACHE_GB
-            value: "10"
-          # Enable KVBM metrics
-          - name: DYN_KVBM_METRICS
-            value: "true"
-          - name: DYN_KVBM_METRICS_PORT
-            value: "6880"
-          volumeMounts:
-          - mountPath: /opt/dynamo/configs
-            name: llm-config-kimi-agg-kvbm
-            readOnly: true
-          workingDir: /workspace/examples/backends/trtllm
-        volumes:
-        - configMap:
-            name: llm-config-kimi-agg-kvbm
-          name: llm-config-kimi-agg-kvbm
-      replicas: 1
-      resources:
-        limits:
-          gpu: "8"
-        requests:
-          gpu: "8"
--- a/recipes/kimi-k2.5/trtllm/agg/nvidia/podmonitor-kvbm.yaml
+++ b/recipes/kimi-k2.5/trtllm/agg/nvidia/podmonitor-kvbm.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Standalone PodMonitor for KVBM metrics (port 6880).
-# Apply this if you cannot upgrade the platform Helm chart.
-#
-# Usage: kubectl apply -f podmonitor-kvbm.yaml -n monitoring
-#
-# Scrapes KVBM metrics (port 6880) from worker pods in any namespace.
-# Only workers with the kvbm port exposed (e.g. DYN_KVBM_METRICS=true) are scraped.
-#
-# If your Prometheus Operator watches a different namespace for PodMonitors,
-# change metadata.namespace and apply there.
-apiVersion: monitoring.coreos.com/v1
-kind: PodMonitor
-metadata:
-  name: dynamo-worker-kvbm
-  namespace: monitoring
-spec:
-  namespaceSelector:
-    any: true
-  podMetricsEndpoints:
-  - interval: 5s
-    path: /metrics
-    port: kvbm
-  selector:
-    matchLabels:
-      nvidia.com/dynamo-component-type: worker
-      nvidia.com/metrics-enabled: "true"
--- a/recipes/qwen3-32b/vllm/agg-kvbm/README.md
+++ b/recipes/qwen3-32b/vllm/agg-kvbm/README.md
+# Qwen3-32B: Aggregated + KVBM (single GPU)
+
+Single-GPU aggregated deployment of `Qwen/Qwen3-32B` with the KV Block Manager
+(KVBM) enabled. KVBM offloads cold KV cache blocks to host memory so the
+effective cache footprint extends beyond GPU HBM, which improves prefix-reuse
+hit rate on long or repeated prompts without adding GPUs.
+
+## Hardware
+
+- **1x NVIDIA H200 (141 GB) or B200 (192 GB)**. Qwen3-32B in BF16 is ~64 GB of
+  weights plus KV cache and activations, so 80 GB H100 leaves very little room
+  and is likely to OOM under real load. If you only have H100 80 GB, see
+  `../../../qwen3-32b-fp8/` for the FP8 variant.
+- **≥ ~150 GiB of host memory on the node**. `DYN_KVBM_CPU_CACHE_GB=100` is
+  pinned as page-locked host memory for KVBM's G2 tier. The worker declares
+  `resources.requests.memory: 150Gi` and `resources.limits.memory: 200Gi`
+  (100 GiB pinned KV pool + ~50 GiB headroom for Python, weight-loader
+  working memory, and CUDA/NCCL buffers). If you raise `DYN_KVBM_CPU_CACHE_GB`,
+  scale these up by roughly the same delta.
+
+## Prerequisites
+
+Same as the sibling recipes in this directory:
+
+1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../../../docs/kubernetes/README.md).
+2. **Pre-existing `model-cache` and `compilation-cache` PVCs** — see
+   [`../../model-cache/cache.yaml`](../../model-cache/cache.yaml) and
+   [`../../model-cache/model-download.yaml`](../../model-cache/model-download.yaml).
+3. **HuggingFace token Secret** named `hf-token-secret` in your namespace.
+
+```bash
+export NAMESPACE=your-namespace
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token" \
+  -n ${NAMESPACE}
+```
+
+## Deploy
+
+```bash
+kubectl apply -f deploy.yaml -n ${NAMESPACE}
+
+kubectl wait --for=condition=ready pod \
+  -l nvidia.com/dynamo-graph-deployment-name=agg-kvbm-qwen3-32b \
+  -n ${NAMESPACE} --timeout=1200s
+```
+
+## Verify
+
+```bash
+kubectl port-forward svc/agg-kvbm-qwen3-32b-frontend 8000:8000 -n ${NAMESPACE}
+
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-32B",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 64
+  }'
+```
+
+## KVBM configuration
+
+The connector is selected through the worker's `--kv-transfer-config`:
+
+```json
+{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}
+```
+
+Worker env var set by this recipe:
+
+| Variable | Default in this recipe | Description |
+|---|---|---|
+| `DYN_KVBM_CPU_CACHE_GB` | `100` | CPU memory reserved for offloaded KV blocks. Raise for longer contexts or higher reuse; if you change this, also bump `resources.requests.memory` / `limits.memory` on the worker by roughly the same delta. |
+
+### (Optional) Prometheus metrics
+
+Metrics are **off** by default. To expose them, add the following to the
+worker's `env` and `mainContainer.ports` in `deploy.yaml`:
+
+```yaml
+env:
+  - name: DYN_KVBM_METRICS
+    value: "true"
+  - name: DYN_KVBM_METRICS_PORT
+    value: "6880"
+ports:
+  - name: kvbm
+    containerPort: 6880
+```
+
+Once enabled, scrape `:6880/metrics` for counters like
+`kvbm_offload_blocks_d2h`, `kvbm_onboard_blocks_h2d`, `kvbm_matched_tokens`,
+`kvbm_host_cache_hit_rate`, plus per-route transfer counters. If you run the
+Prometheus Operator, add a `PodMonitor` selecting pods with
+`nvidia.com/dynamo-component-type: worker` and port `kvbm`.
+
+## Cleanup
+
+```bash
+kubectl delete dynamographdeployment agg-kvbm-qwen3-32b -n ${NAMESPACE}
+```
--- a/recipes/qwen3-32b/vllm/agg-kvbm/deploy.yaml
+++ b/recipes/qwen3-32b/vllm/agg-kvbm/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: agg-kvbm-qwen3-32b
+spec:
+  pvcs:
+  - create: false
+    name: model-cache
+  - create: false
+    name: compilation-cache
+  services:
+    Frontend:
+      componentType: frontend
+      envs:
+        - name: HF_HOME
+          value: /home/dynamo/.cache/huggingface
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.frontend
+          args:
+            - --router-reset-states
+      replicas: 1
+      subComponentType: null
+    VllmDecodeWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      volumeMounts:
+      - name: model-cache
+        mountPoint: /home/dynamo/.cache/huggingface
+      - name: compilation-cache
+        mountPoint: /home/dynamo/.cache/vllm
+        useAsCompilationCache: true
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-32B
+          - --kv-transfer-config
+          - '{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}'
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          env:
+          - name: DYN_HEALTH_CHECK_ENABLED
+            value: "false"
+          - name: HF_HOME
+            value: /home/dynamo/.cache/huggingface
+          # NOTE: change this to tune the CPU cache size for your system
+          - name: DYN_KVBM_CPU_CACHE_GB
+            value: "100"
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+          memory: "200Gi"
+        requests:
+          gpu: "1"
+          memory: "150Gi"