feat: Deepseek V3.2 TRTLLM Recipe (#6688)

Signed-off-by: Karen Chung <karenc@nvidia.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com>

feat: Deepseek V3.2 TRTLLM Recipe (#6688)
Signed-off-by: Karen Chung <karenc@nvidia.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
dc9f80b3 · Karen Chung · GitHub · c24882ff · dc9f80b3 · dc9f80b3
Unverified Commit dc9f80b3 authored Mar 05, 2026 by Karen Chung Committed by GitHub Mar 05, 2026
8 changed files
--- a/recipes/deepseek-v32-fp4/README.md
+++ b/recipes/deepseek-v32-fp4/README.md
+# DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP
+
+This **GB200 NVL72** recipe for DeepSeek V3.2 demonstrates the performance difference between **aggregated (round-robin) routing** and **disaggregated (KV-aware) routing + WideEP** on a synthetic trace dataset adapted from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).
+
+## Results
+
+https://github.com/user-attachments/assets/fcdb703c-7c1a-4109-a7ca-54196fcef885
+
+## Experiment Overview
+
+We compare two deployment modes on **32x GB200 GPUs across 8 nodes**:
+
+| Mode | Routing | Configuration |
+|------|---------|---------------|
+| **Aggregated** | Round-robin | 4x DEP8 workers |
+| **Disaggregated** | KV-aware | 2x prefill + 2x decode w/ WideEP (DEP8) |
+
+## Dataset: Mooncake-based Synthetic Coding Trace
+
+The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original [Mooncake conversation trace](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/conversation_trace.jsonl).
+
+To reproduce our benchmark, run Dynamo's [prefix data generator tool](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator) on the Mooncake `conversation_trace.jsonl`:
+```bash
+datagen synthesize \
+    --input-file conversation_trace.jsonl \
+    --prefix-len-multiplier 16 \
+    --prompt-len-multiplier 10 \
+    --max-isl 110000 \
+    --num-requests 10000
+# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`
+```
+
+The ISL/OSL/cache hit statistics of our trace is below.
+
+<details>
+<summary>Dataset statistics: Mooncake-based Synthetic Trace</summary>
+
+```
+============================================================
+  DATASET ANALYSIS: Mooncake-based Synthetic Trace
+  ============================================================
+  OVERVIEW
+  ----------------------------------------
+    Total Requests:      10,000
+    Unique Hash Blocks:  430,838
+    Total Hash Blocks:   770,934
+  INPUT SEQUENCE LENGTH (ISL)
+  ----------------------------------------
+    Average:             39,186 tokens
+    Maximum:             109,459 tokens
+    Minimum:             12,801 tokens
+  OUTPUT SEQUENCE LENGTH (OSL)
+  ----------------------------------------
+    Average:             344 tokens
+    Maximum:             2,000 tokens
+    Minimum:             1 tokens
+  KV CACHE / PREFIX REUSE
+  ----------------------------------------
+    Block-level Hit Rate: 44.1%
+    Token-level Hit Rate: 44.0%
+    Avg Context (shared): 22,400 tokens/req
+    Avg Unique Prompt:    16,786 tokens/req
+    Shared Prefix Ratio:  57.2%
+  ============================================================
+
+  Summary:
+  • ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
+  • ~57% of input tokens come from shared context prefixes
+  • Long-context workload: avg 39K input tokens, up to 109K max
+```
+
+</details>
+
+
+## Prerequisites
+
+1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **32x GB200 GPUs** across 8 nodes
+3. **HuggingFace token** configured:
+   ```bash
+   export NAMESPACE=your-namespace
+   kubectl create secret generic hf-token-secret \
+     --from-literal=HF_TOKEN="your-token" \
+     -n ${NAMESPACE}
+   ```
+
+## Quick Start
+
+### 1. Create Storage
+
+> **Note:** Edit `model-cache/model-cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).
+
+```bash
+kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
+```
+
+### 2. Configure K8 Benchmarking Environment
+For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)
+```bash
+kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}
+```
+Make sure to apply any name modifications to this file to the deployment yamls, under `extraPodSpec.resourceClaims` and `mainContainer.resources.claims`.
+
+
+### 3. Setup Model and Data
+We use NVIDIA's official NVFP4-quantized checkpoint ([Huggingface](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4)). Copy it into the PVC storage:
+
+```bash
+kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
+```
+Similarly, copy the trace file for the benchmark into the PVC:
+```bash
+# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
+kubectl cp <local_trace.jsonl> your-namespace/<helper-pod>:/model-cache/traces/
+```
+
+### 4. Deploy & Benchmark
+
+**Option A: Aggregated (Round-Robin Baseline)**
+
+```bash
+# Deploy
+kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}
+
+# Wait for ready
+kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
+  -n ${NAMESPACE} --timeout=1200s
+
+# Run benchmark
+kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
+```
+
+**Option B: Disaggregated (KV-Aware Routing)**
+
+```bash
+# Deploy
+kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
+
+# Wait for ready
+kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
+  -n ${NAMESPACE} --timeout=1200s
+
+# Run benchmark
+kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
+```
+
+### 4. Monitor Benchmark Progress
+
+The benchmark runs inside a tmux session for easy monitoring:
+
+```bash
+# Find the benchmark pod
+kubectl get pods -n ${NAMESPACE} | grep benchmark
+
+# Attach to the tmux session to see intermediate results
+kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark
+
+# Detach from tmux: Ctrl+B, then D
+```
+
+### 5. View Results
+
+Results are saved to the `perf-cache` PVC:
+
+```bash
+# Check artifact directory
+kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/
+
+# Copy results to local machine
+kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results
+```
+
+## Expected Results
+
+Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:
+
+| Metric | Why It Matters |
+|--------|----------------|
+| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
+| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
+| **Total Request Latency** | Combined benefit of both optimizations |
+
+For production contexts, we can further evaluate the deployments with **goodput**, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.
+
+## Cleanup
+
+```bash
+# Delete benchmark pods
+kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}
+
+# Delete deployments
+kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
+kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}
+```
+
+## References
+
+- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data
+- [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.html) - TRTLLM tech blog on available optimizations for DSV3.2 on GB200
+
--- a/recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml
+++ b/recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: resource.nvidia.com/v1beta1
+kind: ComputeDomain
+metadata:
+  name: your-compute-domain
+  namespace: your-namespace
+spec:
+  # 0 = on-demand allocation (nodes assigned when pods request them via resourceClaims).
+  numNodes: 0
+  channel:
+    resourceClaimTemplate:
+      name: your-compute-domain-channel
\ No newline at end of file
--- a/recipes/deepseek-v32-fp4/model-cache/model-cache.yaml
+++ b/recipes/deepseek-v32-fp4/model-cache/model-cache.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 400Gi
+  storageClassName: "your-storage-class-name"
\ No newline at end of file
--- a/recipes/deepseek-v32-fp4/model-cache/model-download.yaml
+++ b/recipes/deepseek-v32-fp4/model-cache/model-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
+            seccompProfile:
+              type: RuntimeDefault
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: nvidia/DeepSeek-V3.2-NVFP4
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download $MODEL_NAME
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
--- a/recipes/deepseek-v32-fp4/trtllm/agg-round-robin/deploy.yaml
+++ b/recipes/deepseek-v32-fp4/trtllm/agg-round-robin/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: agg-round-robin-dsv32-nvfp4
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      extraPodSpec:
+        containers: null
+        mainContainer:
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.frontend
+            - --router-mode
+            - round-robin
+            - --router-reset-states
+            - --request-plane
+            - nats
+          env:
+            - name: POD_UID
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.uid
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          name: ""
+          resources: {}
+        # tolerations:     # uncomment to populate any tolerations for the gpu nodes
+      replicas: 1
+    agg:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        containers: null
+        mainContainer:
+          args:
+            - --model-path
+            - nvidia/DeepSeek-V3.2-NVFP4
+            - --served-model-name
+            - nvidia/DeepSeek-V3.2-NVFP4
+            - --extra-engine-args
+            - /config/aggregated.yaml
+            - --publish-events-and-metrics
+            - --request-plane
+            - nats
+            - --kv-block-size
+            - "64"
+          command:
+            - python3
+            - -m
+            - dynamo.trtllm
+          env:
+            - name: POD_UID
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.uid
+            - name: HF_HOME
+              value: /model-cache
+            - name: TRITON_CACHE_DIR
+              value: /model-cache/.triton-cache
+            - name: NCCL_DEBUG
+              value: INFO
+            - name: NCCL_MNNVL_ENABLE
+              value: "1"
+            - name: NCCL_CUMEM_ENABLE
+              value: "1"
+            - name: NCCL_NVLS_ENABLE
+              value: "1"
+            - name: NVIDIA_GDRCOPY
+              value: "1"
+            - name: UCX_CUDA_IPC_ENABLE_MNNVL
+              value: "1"
+            - name: NCCL_SOCKET_IFNAME
+              value: eth0
+            - name: GLOO_SOCKET_IFNAME
+              value: eth0
+            - name: NCCL_STORE_TIMEOUT
+              value: "7200"
+            - name: TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER
+              value: "1"
+            - name: TRTLLM_ENABLE_PDL
+              value: "1"
+            - name: TRTLLM_SERVER_DISABLE_GC
+              value: "1"
+            - name: TRTLLM_WORKER_DISABLE_GC
+              value: "1"
+            - name: NCCL_GRAPH_MIXING_SUPPORT
+              value: "0"
+            - name: TRTLLM_FORCE_COMM_METHOD
+              value: NVLINK_TWO_SIDED
+            - name: ENABLE_CONFIGURABLE_MOE
+              value: "1"
+            - name: TLLM_LOG_LEVEL
+              value: "INFO"
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          name: ""
+          resources: {}
+          securityContext:
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 60
+            httpGet:
+              path: /live
+              port: 9090
+            periodSeconds: 60
+            timeoutSeconds: 5
+          volumeMounts:
+            - mountPath: /model-cache
+              name: model-cache
+            - mountPath: /config
+              name: trtllm-config
+              readOnly: true
+          workingDir: /workspace/
+        nodeSelector:
+          kubernetes.io/arch: arm64
+        #tolerations: :     # uncomment to populate any tolerations for the gpu nodes
+        resourceClaims:
+          - name: compute-domain-channel
+            resourceClaimTemplateName: your-compute-domain-channel
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
+          - name: trtllm-config
+            configMap:
+              name: dsv32-trtllm-config
+      multinode:
+        nodeCount: 2
+      replicas: 4
+      resources:
+        limits:
+          gpu: "4"
+        claims:
+          - name: compute-domain-channel
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: dsv32-trtllm-config
+data:
+  aggregated.yaml: |
+    allreduce_strategy: MNNVL
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 120000
+    max_num_tokens: 8192
+    enable_chunked_prefill: true
+    disable_overlap_scheduler: true
+    cuda_graph_config:
+      max_batch_size: 8
+      enable_padding: true
+    enable_attention_dp: true
+    kv_cache_config:
+      dtype: fp8
+      enable_block_reuse: false
+      free_gpu_memory_fraction: 0.9
+      tokens_per_block: 64
+    max_batch_size: 8
+    max_seq_len: 121000
+    moe_config:
+      backend: TRTLLM
+      use_low_precision_moe_combine: true
+    moe_expert_parallel_size: 8
+    num_postprocess_workers: 8
+    print_iter_log: true
+    stream_interval: 10
+    tensor_parallel_size: 8
\ No newline at end of file
--- a/recipes/deepseek-v32-fp4/trtllm/agg-round-robin/perf.yaml
+++ b/recipes/deepseek-v32-fp4/trtllm/agg-round-robin/perf.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# AIPerf trace-replay benchmark for DeepSeek-V3.2 NVFP4 (agg).
+#
+# Replays requests from a Mooncake-format trace file at their original timestamps
+# using aiperf --custom-dataset-type mooncake_trace --fixed-schedule.
+#
+# Prerequisites:
+#   - DGD deployed and in "normal" or "successful" state
+#   - model-cache PVC exists in your namespace
+#   - Trace file copied to PVC: /model-cache/traces/<trace>.jsonl
+#
+# Results: /model-cache/perf/<epoch>_<job-name>/
+#
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: agg-round-robin-dsv32-nvfp4-bench
+  namespace: your-namespace
+spec:
+  backoffLimit: 1
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: agg-round-robin-dsv32-nvfp4-bench
+    spec:
+      affinity:
+        podAntiAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                  - key: nvidia.com/dynamo-graph-deployment-name
+                    operator: In
+                    values:
+                      - agg-round-robin-dsv32-nvfp4
+              topologyKey: kubernetes.io/hostname
+      containers:
+      - command:
+        - /bin/bash
+        - -c
+        - |
+          set -euo pipefail
+          ulimit -n 600000
+          echo "File descriptor limit set to: $(ulimit -n)"
+          echo 2097152 > /proc/sys/fs/inotify/max_user_watches 2>/dev/null || true
+          echo 1024 > /proc/sys/fs/inotify/max_user_instances 2>/dev/null || true
+          apt-get update && apt-get install -y curl jq procps git && apt-get clean
+          # pip install git+https://github.com/ai-dynamo/aiperf.git
+          pip install aiperf==0.5.0
+          echo "aiperf installation completed"
+          sysctl -w net.ipv4.ip_local_port_range="1024 65000" 2>/dev/null || true
+          export COLUMNS=200
+          EPOCH=$(date +%s)
+
+          wait_for_model_ready() {
+            echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
+            while ! curl -sf "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+              echo "[$(date '+%H:%M:%S')] Model not ready yet, sleeping 5s..."
+              sleep 5
+            done
+            echo "Model '$TARGET_MODEL' is now available!"
+            curl -s "http://$ENDPOINT/v1/models" | jq .
+          }
+
+          wait_for_model_ready
+          mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
+
+          # Validate trace file
+          if [ ! -f "${TRACE_FILE}" ]; then
+            echo "ERROR: Trace file not found: ${TRACE_FILE}"
+            echo "Copy trace to PVC first: kubectl cp <local_trace> your-namespace/<pod>:/model-cache/traces/"
+            exit 1
+          fi
+          TRACE_LINES=$(wc -l < "${TRACE_FILE}")
+          echo "Trace contains ${TRACE_LINES} requests"
+
+          printf '{"deployment":"agg-round-robin-dsv32-nvfp4","model":"%s","trace_file":"%s","trace_requests":%d,"ttft_threshold_ms":%s,"itl_threshold_ms":%s,"endpoint":"%s"}\n' \
+            "nvidia/DeepSeek-V3.2-NVFP4" "${TRACE_FILE}" "${TRACE_LINES}" "${TTFT_THRESHOLD_MS}" "${ITL_THRESHOLD_MS}" "${ENDPOINT}" \
+            > "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json"
+
+          TRACE_BASE_NAME="$(basename "${TRACE_FILE}" .jsonl)"
+          export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${TRACE_BASE_NAME}"
+          mkdir -p "$ARTIFACT_DIR"
+
+          # Server metrics args
+          SERVER_METRICS_ARGS=()
+          if [ -n "${AIPERF_SERVER_METRICS_URLS:-}" ]; then
+            IFS=',' read -r -a server_metrics_urls <<< "${AIPERF_SERVER_METRICS_URLS}"
+            if [ ${#server_metrics_urls[@]} -gt 0 ]; then
+              SERVER_METRICS_ARGS+=(--server-metrics "${server_metrics_urls[@]}")
+            fi
+          fi
+
+          echo "=============================================="
+          echo "Trace Replay Benchmark (aiperf)"
+          echo "=============================================="
+          echo "Endpoint: http://${ENDPOINT}"
+          echo "Model: nvidia/DeepSeek-V3.2-NVFP4"
+          echo "Trace file: Mooncake-based Synthetic Coding Trace"
+          echo "TTFT Threshold: ${TTFT_THRESHOLD_MS}ms"
+          echo "ITL Threshold: ${ITL_THRESHOLD_MS}ms"
+          echo "Artifact dir: ${ARTIFACT_DIR}"
+          echo "=============================================="
+
+          echo ""
+          echo "Running warmup benchmark..."
+          set +e
+          aiperf profile \
+            -m "nvidia/DeepSeek-V3.2-NVFP4" \
+            --tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
+            --url "http://${ENDPOINT}" \
+            --streaming \
+            --ui dashboard \
+            --synthetic-input-tokens-mean 10000 \
+            --synthetic-input-tokens-stddev 0 \
+            --output-tokens-mean 200 \
+            --output-tokens-stddev 0 \
+            --extra-inputs "max_tokens:200" \
+            --extra-inputs "min_tokens:200" \
+            --extra-inputs "ignore_eos:true" \
+            --concurrency 4 \
+            --request-count 10
+          echo "Warmup complete"
+
+          # Trace replay
+          echo ""
+          echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting trace replay benchmark"
+          aiperf profile \
+            -m "nvidia/DeepSeek-V3.2-NVFP4" \
+            --tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
+            --input-file "${TRACE_FILE}" \
+            --custom-dataset-type mooncake_trace \
+            --fixed-schedule \
+            --url "http://${ENDPOINT}" \
+            --streaming \
+            --random-seed 42 \
+            --ui dashboard \
+            --artifact-dir "${ARTIFACT_DIR}" \
+            --workers-max 200 \
+            --request-timeout-seconds 1000 \
+            --profile-export-level records \
+            --record-processors 8 \
+            "${SERVER_METRICS_ARGS[@]}" \
+            --goodput "time_to_first_token:${TTFT_THRESHOLD_MS} inter_token_latency:${ITL_THRESHOLD_MS}"
+
+          BENCH_EXIT_CODE=$?
+          echo ""
+          echo "$(date '+%Y-%m-%d %H:%M:%S') - Benchmark complete (exit code: ${BENCH_EXIT_CODE})"
+          echo "Results: ${ARTIFACT_DIR}"
+          ls -la "${ARTIFACT_DIR}" 2>/dev/null || true
+          echo "Benchmark complete!"
+          exit $BENCH_EXIT_CODE
+          set -e
+        env:
+        - name: TARGET_MODEL
+          value: nvidia/DeepSeek-V3.2-NVFP4
+        - name: ENDPOINT
+          value: agg-round-robin-dsv32-nvfp4-frontend:8000
+        - name: TRACE_FILE
+          value: /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl
+        - name: TTFT_THRESHOLD_MS
+          value: "20000"
+        - name: ITL_THRESHOLD_MS
+          value: "50"
+        - name: AIPERF_HTTP_CONNECTION_LIMIT
+          value: "200"
+        - name: AIPERF_HTTP_SO_RCVTIMEO
+          value: "120"
+        - name: AIPERF_SERVER_METRICS_URLS
+          value: "http://agg-round-robin-dsv32-nvfp4-dec-0-dec-wkr:9090/metrics,http://agg-round-robin-dsv32-nvfp4-prefill-0:9090/metrics"
+        - name: JOB_NAME
+          valueFrom:
+            fieldRef:
+              apiVersion: v1
+              fieldPath: metadata.labels['job-name']
+        - name: ROOT_ARTIFACT_DIR
+          value: /model-cache/perf
+        - name: HF_HOME
+          value: /model-cache
+        - name: PYTHONUNBUFFERED
+          value: "1"
+        image: python:3.12-slim
+        imagePullPolicy: IfNotPresent
+        name: perf
+        securityContext:
+          privileged: true
+        volumeMounts:
+        - name: model-cache
+          mountPath: /model-cache
+        workingDir: /workspace
+      restartPolicy: Never
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
--- a/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/deploy.yaml
+++ b/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: disagg-kv-dsv32-nvfp4
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      extraPodSpec:
+        containers: null
+        mainContainer:
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.frontend
+            - --router-mode
+            - kv
+            - --router-reset-states
+            - --request-plane
+            - nats
+          env:
+            - name: POD_UID
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.uid
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          name: ""
+          resources: {}
+        nodeSelector:
+          kubernetes.io/arch: arm64
+      replicas: 1
+    prefill:
+      componentType: worker
+      subComponentType: prefill
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        containers: null
+        mainContainer:
+          args:
+            - --model-path
+            - nvidia/DeepSeek-V3.2-NVFP4
+            - --served-model-name
+            - nvidia/DeepSeek-V3.2-NVFP4
+            - --extra-engine-args
+            - /config/prefill.yaml
+            - --disaggregation-mode
+            - prefill
+            - --publish-events-and-metrics
+            - --request-plane
+            - nats
+            - --kv-block-size
+            - "64"
+          command:
+            - python3
+            - -m
+            - dynamo.trtllm
+          env:
+            - name: POD_UID
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.uid
+            - name: HF_HOME
+              value: /model-cache
+            - name: TRITON_CACHE_DIR
+              value: /model-cache/.triton-cache
+            - name: NCCL_DEBUG
+              value: INFO
+            - name: NCCL_MNNVL_ENABLE
+              value: "1"
+            - name: NCCL_CUMEM_ENABLE
+              value: "1"
+            - name: NCCL_NVLS_ENABLE
+              value: "1"
+            - name: NVIDIA_GDRCOPY
+              value: "1"
+            - name: UCX_CUDA_IPC_ENABLE_MNNVL
+              value: "1"
+            - name: NCCL_SOCKET_IFNAME
+              value: eth0
+            - name: GLOO_SOCKET_IFNAME
+              value: eth0
+            - name: NCCL_STORE_TIMEOUT
+              value: "7200"
+            - name: TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER
+              value: "1"
+            - name: TRTLLM_ENABLE_PDL
+              value: "1"
+            - name: TRTLLM_SERVER_DISABLE_GC
+              value: "1"
+            - name: TRTLLM_WORKER_DISABLE_GC
+              value: "1"
+            - name: NCCL_GRAPH_MIXING_SUPPORT
+              value: "0"
+            - name: TLLM_LOG_LEVEL
+              value: "INFO"
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          name: ""
+          resources: {}
+          securityContext:
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 60
+            httpGet:
+              path: /live
+              port: 9090
+            periodSeconds: 60
+            timeoutSeconds: 5
+          volumeMounts:
+            - mountPath: /model-cache
+              name: model-cache
+            - mountPath: /config
+              name: trtllm-config
+              readOnly: true
+          workingDir: /workspace/
+        nodeSelector:
+          kubernetes.io/arch: arm64
+        tolerations:
+          - effect: NoSchedule
+            key: dedicated
+            operator: Equal
+            value: user-workload
+          - effect: NoExecute
+            key: dedicated
+            operator: Equal
+            value: user-workload
+          - effect: NoSchedule
+            key: dedicated
+            operator: Equal
+            value: system-workload
+          - effect: NoExecute
+            key: dedicated
+            operator: Equal
+            value: system-workload
+        resourceClaims:
+          - name: compute-domain-channel
+            resourceClaimTemplateName: your-compute-domain-channel
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
+          - name: trtllm-config
+            configMap:
+              name: dsv32-trtllm-config
+      multinode:
+        nodeCount: 2
+      replicas: 2
+      resources:
+        limits:
+          gpu: "4"
+        claims:
+          - name: compute-domain-channel
+    dec:
+      componentType: worker
+      subComponentType: decode
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        containers: null
+        mainContainer:
+          args:
+            - --model-path
+            - nvidia/DeepSeek-V3.2-NVFP4
+            - --served-model-name
+            - nvidia/DeepSeek-V3.2-NVFP4
+            - --extra-engine-args
+            - /config/decode.yaml
+            - --disaggregation-mode
+            - decode
+            - --publish-events-and-metrics
+            - --request-plane
+            - nats
+            - --kv-block-size
+            - "64"
+          command:
+            - python3
+            - -m
+            - dynamo.trtllm
+          env:
+            - name: POD_UID
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.uid
+            - name: HF_HOME
+              value: /model-cache
+            - name: TRITON_CACHE_DIR
+              value: /model-cache/.triton-cache
+            - name: NCCL_DEBUG
+              value: INFO
+            - name: NCCL_MNNVL_ENABLE
+              value: "1"
+            - name: NCCL_CUMEM_ENABLE
+              value: "1"
+            - name: NCCL_NVLS_ENABLE
+              value: "1"
+            - name: NVIDIA_GDRCOPY
+              value: "1"
+            - name: UCX_CUDA_IPC_ENABLE_MNNVL
+              value: "1"
+            - name: NCCL_SOCKET_IFNAME
+              value: eth0
+            - name: GLOO_SOCKET_IFNAME
+              value: eth0
+            - name: NCCL_STORE_TIMEOUT
+              value: "7200"
+            - name: TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER
+              value: "1"
+            - name: TRTLLM_ENABLE_PDL
+              value: "1"
+            - name: TRTLLM_SERVER_DISABLE_GC
+              value: "1"
+            - name: TRTLLM_WORKER_DISABLE_GC
+              value: "1"
+            - name: ENROOT_ALLOW_DEV
+              value: "yes"
+            - name: NCCL_GRAPH_MIXING_SUPPORT
+              value: "0"
+            - name: TRTLLM_FORCE_COMM_METHOD
+              value: NVLINK_TWO_SIDED
+            - name: ENABLE_CONFIGURABLE_MOE
+              value: "1"
+            - name: TLLM_LOG_LEVEL
+              value: "INFO"
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          name: ""
+          resources: {}
+          securityContext:
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 60
+            httpGet:
+              path: /live
+              port: 9090
+            periodSeconds: 60
+            timeoutSeconds: 5
+          volumeMounts:
+            - mountPath: /model-cache
+              name: model-cache
+            - mountPath: /config
+              name: trtllm-config
+              readOnly: true
+          workingDir: /workspace/
+        nodeSelector:
+          kubernetes.io/arch: arm64
+        resourceClaims:
+          - name: compute-domain-channel
+            resourceClaimTemplateName: your-compute-domain-channel
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
+          - name: trtllm-config
+            configMap:
+              name: dsv32-trtllm-config
+      multinode:
+        nodeCount: 2
+      replicas: 2
+      resources:
+        limits:
+          gpu: "4"
+        claims:
+          - name: compute-domain-channel
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: dsv32-trtllm-config
+data:
+  prefill.yaml: |
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 120000
+    max_num_tokens: 8192
+    enable_chunked_prefill: true
+    cuda_graph_config:
+      max_batch_size: 32
+      enable_padding: true
+    disable_overlap_scheduler: true
+    enable_attention_dp: true
+    kv_cache_config:
+      dtype: fp8
+      enable_block_reuse: true
+      free_gpu_memory_fraction: 0.9
+      tokens_per_block: 64
+    max_batch_size: 32
+    max_seq_len: 121000
+    moe_config:
+      backend: TRTLLM
+    moe_expert_parallel_size: 8
+    print_iter_log: true
+    tensor_parallel_size: 8
+  decode.yaml: |
+    allreduce_strategy: MNNVL
+    cache_transceiver_config:
+      backend: UCX
+      max_tokens_in_buffer: 120000
+    max_num_tokens: 8192
+    cuda_graph_config:
+      max_batch_size: 8
+      enable_padding: true
+    enable_attention_dp: true
+    kv_cache_config:
+      dtype: fp8
+      enable_block_reuse: false
+      free_gpu_memory_fraction: 0.9
+      tokens_per_block: 64
+    max_batch_size: 8
+    max_seq_len: 121000
+    moe_config:
+      backend: WIDEEP
+      use_low_precision_moe_combine: true
+    moe_expert_parallel_size: 8
+    num_postprocess_workers: 8
+    print_iter_log: true
+    stream_interval: 10
+    tensor_parallel_size: 8
\ No newline at end of file
--- a/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml
+++ b/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# AIPerf trace-replay benchmark for DeepSeek-V3.2 NVFP4 (disagg).
+#
+# Replays requests from a Mooncake-format trace file at their original timestamps
+# using aiperf --custom-dataset-type mooncake_trace --fixed-schedule.
+#
+# Prerequisites:
+#   - DGD deployed and in "normal" or "successful" state
+#   - model-cache PVC exists in your namespace
+#   - Trace file copied to PVC: /model-cache/traces/<trace>.jsonl
+#
+# Results: /model-cache/perf/<epoch>_<job-name>/
+#
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: disagg-kv-dsv32-nvfp4-bench
+  namespace: your-namespace
+spec:
+  backoffLimit: 1
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: disagg-kv-dsv32-nvfp4-bench
+    spec:
+      affinity:
+        podAntiAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                  - key: nvidia.com/dynamo-graph-deployment-name
+                    operator: In
+                    values:
+                      - disagg-kv-dsv32-nvfp4
+              topologyKey: kubernetes.io/hostname
+      containers:
+      - command:
+        - /bin/bash
+        - -c
+        - |
+          set -euo pipefail
+          ulimit -n 600000
+          echo "File descriptor limit set to: $(ulimit -n)"
+          echo 2097152 > /proc/sys/fs/inotify/max_user_watches 2>/dev/null || true
+          echo 1024 > /proc/sys/fs/inotify/max_user_instances 2>/dev/null || true
+          apt-get update && apt-get install -y curl jq procps git && apt-get clean
+          # pip install git+https://github.com/ai-dynamo/aiperf.git
+          pip install aiperf==0.5.0
+          echo "aiperf installation completed"
+          sysctl -w net.ipv4.ip_local_port_range="1024 65000" 2>/dev/null || true
+          export COLUMNS=200
+          EPOCH=$(date +%s)
+
+          wait_for_model_ready() {
+            echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
+            while ! curl -sf "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+              echo "[$(date '+%H:%M:%S')] Model not ready yet, sleeping 5s..."
+              sleep 5
+            done
+            echo "Model '$TARGET_MODEL' is now available!"
+            curl -s "http://$ENDPOINT/v1/models" | jq .
+          }
+
+          wait_for_model_ready
+          mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
+
+          # Validate trace file
+          if [ ! -f "${TRACE_FILE}" ]; then
+            echo "ERROR: Trace file not found: ${TRACE_FILE}"
+            echo "Copy trace to PVC first: kubectl cp <local_trace> your-namespace/<pod>:/model-cache/traces/"
+            exit 1
+          fi
+          TRACE_LINES=$(wc -l < "${TRACE_FILE}")
+          echo "Trace contains ${TRACE_LINES} requests"
+
+          printf '{"deployment":"disagg-kv-dsv32-nvfp4","model":"%s","trace_file":"%s","trace_requests":%d,"ttft_threshold_ms":%s,"itl_threshold_ms":%s,"endpoint":"%s"}\n' \
+            "nvidia/DeepSeek-V3.2-NVFP4" "${TRACE_FILE}" "${TRACE_LINES}" "${TTFT_THRESHOLD_MS}" "${ITL_THRESHOLD_MS}" "${ENDPOINT}" \
+            > "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json"
+
+          TRACE_BASE_NAME="$(basename "${TRACE_FILE}" .jsonl)"
+          export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${TRACE_BASE_NAME}"
+          mkdir -p "$ARTIFACT_DIR"
+
+          # Server metrics args
+          SERVER_METRICS_ARGS=()
+          if [ -n "${AIPERF_SERVER_METRICS_URLS:-}" ]; then
+            IFS=',' read -r -a server_metrics_urls <<< "${AIPERF_SERVER_METRICS_URLS}"
+            if [ ${#server_metrics_urls[@]} -gt 0 ]; then
+              SERVER_METRICS_ARGS+=(--server-metrics "${server_metrics_urls[@]}")
+            fi
+          fi
+
+          echo "=============================================="
+          echo "Trace Replay Benchmark (aiperf)"
+          echo "=============================================="
+          echo "Endpoint: http://${ENDPOINT}"
+          echo "Model: nvidia/DeepSeek-V3.2-NVFP4"
+          echo "Trace file: Mooncake-based Synthetic Coding Trace"
+          echo "TTFT Threshold: ${TTFT_THRESHOLD_MS}ms"
+          echo "ITL Threshold: ${ITL_THRESHOLD_MS}ms"
+          echo "Artifact dir: ${ARTIFACT_DIR}"
+          echo "=============================================="
+
+          echo ""
+          echo "Running warmup benchmark..."
+          aiperf profile \
+            -m "nvidia/DeepSeek-V3.2-NVFP4" \
+            --tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
+            --url "http://${ENDPOINT}" \
+            --streaming \
+            --ui dashboard \
+            --synthetic-input-tokens-mean 10000 \
+            --synthetic-input-tokens-stddev 0 \
+            --output-tokens-mean 200 \
+            --output-tokens-stddev 0 \
+            --extra-inputs "max_tokens:200" \
+            --extra-inputs "min_tokens:200" \
+            --extra-inputs "ignore_eos:true" \
+            --concurrency 4 \
+            --request-count 10
+          echo "Warmup complete"
+
+          # Trace replay
+          echo ""
+          echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting trace replay benchmark"
+          aiperf profile \
+            -m "nvidia/DeepSeek-V3.2-NVFP4" \
+            --tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
+            --input-file "${TRACE_FILE}" \
+            --custom-dataset-type mooncake_trace \
+            --fixed-schedule \
+            --url "http://${ENDPOINT}" \
+            --streaming \
+            --random-seed 42 \
+            --ui dashboard \
+            --artifact-dir "${ARTIFACT_DIR}" \
+            --workers-max 200 \
+            --request-timeout-seconds 1000 \
+            --profile-export-level records \
+            --record-processors 8 \
+            "${SERVER_METRICS_ARGS[@]}" \
+            --goodput "time_to_first_token:${TTFT_THRESHOLD_MS} inter_token_latency:${ITL_THRESHOLD_MS}"
+
+          BENCH_EXIT_CODE=$?
+          echo ""
+          echo "$(date '+%Y-%m-%d %H:%M:%S') - Benchmark complete (exit code: ${BENCH_EXIT_CODE})"
+          echo "Results: ${ARTIFACT_DIR}"
+          ls -la "${ARTIFACT_DIR}" 2>/dev/null || true
+          echo "Benchmark complete!"
+          set -e
+          exit $BENCH_EXIT_CODE
+        env:
+        - name: TARGET_MODEL
+          value: nvidia/DeepSeek-V3.2-NVFP4
+        - name: ENDPOINT
+          value: disagg-kv-dsv32-nvfp4-frontend:8000
+        - name: TRACE_FILE
+          value: /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl
+        - name: TTFT_THRESHOLD_MS
+          value: "20000"
+        - name: ITL_THRESHOLD_MS
+          value: "50"
+        - name: AIPERF_HTTP_CONNECTION_LIMIT
+          value: "200"
+        - name: AIPERF_HTTP_SO_RCVTIMEO
+          value: "120"
+        - name: AIPERF_SERVER_METRICS_URLS
+          value: "http://disagg-kv-dsv32-nvfp4-dec-0-dec-wkr:9090/metrics,http://disagg-kv-dsv32-nvfp4-prefill-0:9090/metrics"
+        - name: JOB_NAME
+          valueFrom:
+            fieldRef:
+              apiVersion: v1
+              fieldPath: metadata.labels['job-name']
+        - name: ROOT_ARTIFACT_DIR
+          value: /model-cache/perf
+        - name: HF_HOME
+          value: /model-cache
+        - name: PYTHONUNBUFFERED
+          value: "1"
+        image: python:3.12-slim
+        imagePullPolicy: IfNotPresent
+        name: perf
+        securityContext:
+          privileged: true
+        volumeMounts:
+        - name: model-cache
+          mountPath: /model-cache
+        workingDir: /workspace
+      restartPolicy: Never
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache