Unverified Commit dc9f80b3 authored by Karen Chung's avatar Karen Chung Committed by GitHub
Browse files

feat: Deepseek V3.2 TRTLLM Recipe (#6688)


Signed-off-by: default avatarKaren Chung <karenc@nvidia.com>
Co-authored-by: default avatarBiswa Panda <biswa.panda@gmail.com>
parent c24882ff
# DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP
This **GB200 NVL72** recipe for DeepSeek V3.2 demonstrates the performance difference between **aggregated (round-robin) routing** and **disaggregated (KV-aware) routing + WideEP** on a synthetic trace dataset adapted from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).
## Results
https://github.com/user-attachments/assets/fcdb703c-7c1a-4109-a7ca-54196fcef885
## Experiment Overview
We compare two deployment modes on **32x GB200 GPUs across 8 nodes**:
| Mode | Routing | Configuration |
|------|---------|---------------|
| **Aggregated** | Round-robin | 4x DEP8 workers |
| **Disaggregated** | KV-aware | 2x prefill + 2x decode w/ WideEP (DEP8) |
## Dataset: Mooncake-based Synthetic Coding Trace
The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original [Mooncake conversation trace](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/conversation_trace.jsonl).
To reproduce our benchmark, run Dynamo's [prefix data generator tool](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator) on the Mooncake `conversation_trace.jsonl`:
```bash
datagen synthesize \
--input-file conversation_trace.jsonl \
--prefix-len-multiplier 16 \
--prompt-len-multiplier 10 \
--max-isl 110000 \
--num-requests 10000
# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`
```
The ISL/OSL/cache hit statistics of our trace is below.
<details>
<summary>Dataset statistics: Mooncake-based Synthetic Trace</summary>
```
============================================================
DATASET ANALYSIS: Mooncake-based Synthetic Trace
============================================================
OVERVIEW
----------------------------------------
Total Requests: 10,000
Unique Hash Blocks: 430,838
Total Hash Blocks: 770,934
INPUT SEQUENCE LENGTH (ISL)
----------------------------------------
Average: 39,186 tokens
Maximum: 109,459 tokens
Minimum: 12,801 tokens
OUTPUT SEQUENCE LENGTH (OSL)
----------------------------------------
Average: 344 tokens
Maximum: 2,000 tokens
Minimum: 1 tokens
KV CACHE / PREFIX REUSE
----------------------------------------
Block-level Hit Rate: 44.1%
Token-level Hit Rate: 44.0%
Avg Context (shared): 22,400 tokens/req
Avg Unique Prompt: 16,786 tokens/req
Shared Prefix Ratio: 57.2%
============================================================
Summary:
• ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
• ~57% of input tokens come from shared context prefixes
• Long-context workload: avg 39K input tokens, up to 109K max
```
</details>
## Prerequisites
1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **32x GB200 GPUs** across 8 nodes
3. **HuggingFace token** configured:
```bash
export NAMESPACE=your-namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
```
## Quick Start
### 1. Create Storage
> **Note:** Edit `model-cache/model-cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).
```bash
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
```
### 2. Configure K8 Benchmarking Environment
For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)
```bash
kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}
```
Make sure to apply any name modifications to this file to the deployment yamls, under `extraPodSpec.resourceClaims` and `mainContainer.resources.claims`.
### 3. Setup Model and Data
We use NVIDIA's official NVFP4-quantized checkpoint ([Huggingface](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4)). Copy it into the PVC storage:
```bash
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
```
Similarly, copy the trace file for the benchmark into the PVC:
```bash
# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
kubectl cp <local_trace.jsonl> your-namespace/<helper-pod>:/model-cache/traces/
```
### 4. Deploy & Benchmark
**Option A: Aggregated (Round-Robin Baseline)**
```bash
# Deploy
kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
```
**Option B: Disaggregated (KV-Aware Routing)**
```bash
# Deploy
kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
```
### 4. Monitor Benchmark Progress
The benchmark runs inside a tmux session for easy monitoring:
```bash
# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark
# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark
# Detach from tmux: Ctrl+B, then D
```
### 5. View Results
Results are saved to the `perf-cache` PVC:
```bash
# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/
# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results
```
## Expected Results
Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:
| Metric | Why It Matters |
|--------|----------------|
| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
| **Total Request Latency** | Combined benefit of both optimizations |
For production contexts, we can further evaluate the deployments with **goodput**, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.
## Cleanup
```bash
# Delete benchmark pods
kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}
# Delete deployments
kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}
```
## References
- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data
- [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.html) - TRTLLM tech blog on available optimizations for DSV3.2 on GB200
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: your-compute-domain
namespace: your-namespace
spec:
# 0 = on-demand allocation (nodes assigned when pods request them via resourceClaims).
numNodes: 0
channel:
resourceClaimTemplate:
name: your-compute-domain-channel
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 400Gi
storageClassName: "your-storage-class-name"
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: nvidia/DeepSeek-V3.2-NVFP4
- name: HF_HOME
value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download $MODEL_NAME
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: agg-round-robin-dsv32-nvfp4
spec:
services:
Frontend:
componentType: frontend
extraPodSpec:
containers: null
mainContainer:
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- round-robin
- --router-reset-states
- --request-plane
- nats
env:
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
name: ""
resources: {}
# tolerations: # uncomment to populate any tolerations for the gpu nodes
replicas: 1
agg:
componentType: worker
envFromSecret: hf-token-secret
extraPodSpec:
containers: null
mainContainer:
args:
- --model-path
- nvidia/DeepSeek-V3.2-NVFP4
- --served-model-name
- nvidia/DeepSeek-V3.2-NVFP4
- --extra-engine-args
- /config/aggregated.yaml
- --publish-events-and-metrics
- --request-plane
- nats
- --kv-block-size
- "64"
command:
- python3
- -m
- dynamo.trtllm
env:
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: HF_HOME
value: /model-cache
- name: TRITON_CACHE_DIR
value: /model-cache/.triton-cache
- name: NCCL_DEBUG
value: INFO
- name: NCCL_MNNVL_ENABLE
value: "1"
- name: NCCL_CUMEM_ENABLE
value: "1"
- name: NCCL_NVLS_ENABLE
value: "1"
- name: NVIDIA_GDRCOPY
value: "1"
- name: UCX_CUDA_IPC_ENABLE_MNNVL
value: "1"
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_STORE_TIMEOUT
value: "7200"
- name: TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER
value: "1"
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: TRTLLM_SERVER_DISABLE_GC
value: "1"
- name: TRTLLM_WORKER_DISABLE_GC
value: "1"
- name: NCCL_GRAPH_MIXING_SUPPORT
value: "0"
- name: TRTLLM_FORCE_COMM_METHOD
value: NVLINK_TWO_SIDED
- name: ENABLE_CONFIGURABLE_MOE
value: "1"
- name: TLLM_LOG_LEVEL
value: "INFO"
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
name: ""
resources: {}
securityContext:
runAsUser: 0
startupProbe:
failureThreshold: 60
httpGet:
path: /live
port: 9090
periodSeconds: 60
timeoutSeconds: 5
volumeMounts:
- mountPath: /model-cache
name: model-cache
- mountPath: /config
name: trtllm-config
readOnly: true
workingDir: /workspace/
nodeSelector:
kubernetes.io/arch: arm64
#tolerations: : # uncomment to populate any tolerations for the gpu nodes
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: your-compute-domain-channel
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
- name: trtllm-config
configMap:
name: dsv32-trtllm-config
multinode:
nodeCount: 2
replicas: 4
resources:
limits:
gpu: "4"
claims:
- name: compute-domain-channel
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dsv32-trtllm-config
data:
aggregated.yaml: |
allreduce_strategy: MNNVL
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 120000
max_num_tokens: 8192
enable_chunked_prefill: true
disable_overlap_scheduler: true
cuda_graph_config:
max_batch_size: 8
enable_padding: true
enable_attention_dp: true
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.9
tokens_per_block: 64
max_batch_size: 8
max_seq_len: 121000
moe_config:
backend: TRTLLM
use_low_precision_moe_combine: true
moe_expert_parallel_size: 8
num_postprocess_workers: 8
print_iter_log: true
stream_interval: 10
tensor_parallel_size: 8
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# AIPerf trace-replay benchmark for DeepSeek-V3.2 NVFP4 (agg).
#
# Replays requests from a Mooncake-format trace file at their original timestamps
# using aiperf --custom-dataset-type mooncake_trace --fixed-schedule.
#
# Prerequisites:
# - DGD deployed and in "normal" or "successful" state
# - model-cache PVC exists in your namespace
# - Trace file copied to PVC: /model-cache/traces/<trace>.jsonl
#
# Results: /model-cache/perf/<epoch>_<job-name>/
#
apiVersion: batch/v1
kind: Job
metadata:
name: agg-round-robin-dsv32-nvfp4-bench
namespace: your-namespace
spec:
backoffLimit: 1
completions: 1
parallelism: 1
template:
metadata:
labels:
app: agg-round-robin-dsv32-nvfp4-bench
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- agg-round-robin-dsv32-nvfp4
topologyKey: kubernetes.io/hostname
containers:
- command:
- /bin/bash
- -c
- |
set -euo pipefail
ulimit -n 600000
echo "File descriptor limit set to: $(ulimit -n)"
echo 2097152 > /proc/sys/fs/inotify/max_user_watches 2>/dev/null || true
echo 1024 > /proc/sys/fs/inotify/max_user_instances 2>/dev/null || true
apt-get update && apt-get install -y curl jq procps git && apt-get clean
# pip install git+https://github.com/ai-dynamo/aiperf.git
pip install aiperf==0.5.0
echo "aiperf installation completed"
sysctl -w net.ipv4.ip_local_port_range="1024 65000" 2>/dev/null || true
export COLUMNS=200
EPOCH=$(date +%s)
wait_for_model_ready() {
echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
while ! curl -sf "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
echo "[$(date '+%H:%M:%S')] Model not ready yet, sleeping 5s..."
sleep 5
done
echo "Model '$TARGET_MODEL' is now available!"
curl -s "http://$ENDPOINT/v1/models" | jq .
}
wait_for_model_ready
mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
# Validate trace file
if [ ! -f "${TRACE_FILE}" ]; then
echo "ERROR: Trace file not found: ${TRACE_FILE}"
echo "Copy trace to PVC first: kubectl cp <local_trace> your-namespace/<pod>:/model-cache/traces/"
exit 1
fi
TRACE_LINES=$(wc -l < "${TRACE_FILE}")
echo "Trace contains ${TRACE_LINES} requests"
printf '{"deployment":"agg-round-robin-dsv32-nvfp4","model":"%s","trace_file":"%s","trace_requests":%d,"ttft_threshold_ms":%s,"itl_threshold_ms":%s,"endpoint":"%s"}\n' \
"nvidia/DeepSeek-V3.2-NVFP4" "${TRACE_FILE}" "${TRACE_LINES}" "${TTFT_THRESHOLD_MS}" "${ITL_THRESHOLD_MS}" "${ENDPOINT}" \
> "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json"
TRACE_BASE_NAME="$(basename "${TRACE_FILE}" .jsonl)"
export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${TRACE_BASE_NAME}"
mkdir -p "$ARTIFACT_DIR"
# Server metrics args
SERVER_METRICS_ARGS=()
if [ -n "${AIPERF_SERVER_METRICS_URLS:-}" ]; then
IFS=',' read -r -a server_metrics_urls <<< "${AIPERF_SERVER_METRICS_URLS}"
if [ ${#server_metrics_urls[@]} -gt 0 ]; then
SERVER_METRICS_ARGS+=(--server-metrics "${server_metrics_urls[@]}")
fi
fi
echo "=============================================="
echo "Trace Replay Benchmark (aiperf)"
echo "=============================================="
echo "Endpoint: http://${ENDPOINT}"
echo "Model: nvidia/DeepSeek-V3.2-NVFP4"
echo "Trace file: Mooncake-based Synthetic Coding Trace"
echo "TTFT Threshold: ${TTFT_THRESHOLD_MS}ms"
echo "ITL Threshold: ${ITL_THRESHOLD_MS}ms"
echo "Artifact dir: ${ARTIFACT_DIR}"
echo "=============================================="
echo ""
echo "Running warmup benchmark..."
set +e
aiperf profile \
-m "nvidia/DeepSeek-V3.2-NVFP4" \
--tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
--url "http://${ENDPOINT}" \
--streaming \
--ui dashboard \
--synthetic-input-tokens-mean 10000 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 200 \
--output-tokens-stddev 0 \
--extra-inputs "max_tokens:200" \
--extra-inputs "min_tokens:200" \
--extra-inputs "ignore_eos:true" \
--concurrency 4 \
--request-count 10
echo "Warmup complete"
# Trace replay
echo ""
echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting trace replay benchmark"
aiperf profile \
-m "nvidia/DeepSeek-V3.2-NVFP4" \
--tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
--input-file "${TRACE_FILE}" \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--url "http://${ENDPOINT}" \
--streaming \
--random-seed 42 \
--ui dashboard \
--artifact-dir "${ARTIFACT_DIR}" \
--workers-max 200 \
--request-timeout-seconds 1000 \
--profile-export-level records \
--record-processors 8 \
"${SERVER_METRICS_ARGS[@]}" \
--goodput "time_to_first_token:${TTFT_THRESHOLD_MS} inter_token_latency:${ITL_THRESHOLD_MS}"
BENCH_EXIT_CODE=$?
echo ""
echo "$(date '+%Y-%m-%d %H:%M:%S') - Benchmark complete (exit code: ${BENCH_EXIT_CODE})"
echo "Results: ${ARTIFACT_DIR}"
ls -la "${ARTIFACT_DIR}" 2>/dev/null || true
echo "Benchmark complete!"
exit $BENCH_EXIT_CODE
set -e
env:
- name: TARGET_MODEL
value: nvidia/DeepSeek-V3.2-NVFP4
- name: ENDPOINT
value: agg-round-robin-dsv32-nvfp4-frontend:8000
- name: TRACE_FILE
value: /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl
- name: TTFT_THRESHOLD_MS
value: "20000"
- name: ITL_THRESHOLD_MS
value: "50"
- name: AIPERF_HTTP_CONNECTION_LIMIT
value: "200"
- name: AIPERF_HTTP_SO_RCVTIMEO
value: "120"
- name: AIPERF_SERVER_METRICS_URLS
value: "http://agg-round-robin-dsv32-nvfp4-dec-0-dec-wkr:9090/metrics,http://agg-round-robin-dsv32-nvfp4-prefill-0:9090/metrics"
- name: JOB_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['job-name']
- name: ROOT_ARTIFACT_DIR
value: /model-cache/perf
- name: HF_HOME
value: /model-cache
- name: PYTHONUNBUFFERED
value: "1"
image: python:3.12-slim
imagePullPolicy: IfNotPresent
name: perf
securityContext:
privileged: true
volumeMounts:
- name: model-cache
mountPath: /model-cache
workingDir: /workspace
restartPolicy: Never
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: disagg-kv-dsv32-nvfp4
spec:
services:
Frontend:
componentType: frontend
extraPodSpec:
containers: null
mainContainer:
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- kv
- --router-reset-states
- --request-plane
- nats
env:
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
name: ""
resources: {}
nodeSelector:
kubernetes.io/arch: arm64
replicas: 1
prefill:
componentType: worker
subComponentType: prefill
envFromSecret: hf-token-secret
extraPodSpec:
containers: null
mainContainer:
args:
- --model-path
- nvidia/DeepSeek-V3.2-NVFP4
- --served-model-name
- nvidia/DeepSeek-V3.2-NVFP4
- --extra-engine-args
- /config/prefill.yaml
- --disaggregation-mode
- prefill
- --publish-events-and-metrics
- --request-plane
- nats
- --kv-block-size
- "64"
command:
- python3
- -m
- dynamo.trtllm
env:
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: HF_HOME
value: /model-cache
- name: TRITON_CACHE_DIR
value: /model-cache/.triton-cache
- name: NCCL_DEBUG
value: INFO
- name: NCCL_MNNVL_ENABLE
value: "1"
- name: NCCL_CUMEM_ENABLE
value: "1"
- name: NCCL_NVLS_ENABLE
value: "1"
- name: NVIDIA_GDRCOPY
value: "1"
- name: UCX_CUDA_IPC_ENABLE_MNNVL
value: "1"
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_STORE_TIMEOUT
value: "7200"
- name: TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER
value: "1"
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: TRTLLM_SERVER_DISABLE_GC
value: "1"
- name: TRTLLM_WORKER_DISABLE_GC
value: "1"
- name: NCCL_GRAPH_MIXING_SUPPORT
value: "0"
- name: TLLM_LOG_LEVEL
value: "INFO"
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
name: ""
resources: {}
securityContext:
runAsUser: 0
startupProbe:
failureThreshold: 60
httpGet:
path: /live
port: 9090
periodSeconds: 60
timeoutSeconds: 5
volumeMounts:
- mountPath: /model-cache
name: model-cache
- mountPath: /config
name: trtllm-config
readOnly: true
workingDir: /workspace/
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: user-workload
- effect: NoExecute
key: dedicated
operator: Equal
value: user-workload
- effect: NoSchedule
key: dedicated
operator: Equal
value: system-workload
- effect: NoExecute
key: dedicated
operator: Equal
value: system-workload
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: your-compute-domain-channel
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
- name: trtllm-config
configMap:
name: dsv32-trtllm-config
multinode:
nodeCount: 2
replicas: 2
resources:
limits:
gpu: "4"
claims:
- name: compute-domain-channel
dec:
componentType: worker
subComponentType: decode
envFromSecret: hf-token-secret
extraPodSpec:
containers: null
mainContainer:
args:
- --model-path
- nvidia/DeepSeek-V3.2-NVFP4
- --served-model-name
- nvidia/DeepSeek-V3.2-NVFP4
- --extra-engine-args
- /config/decode.yaml
- --disaggregation-mode
- decode
- --publish-events-and-metrics
- --request-plane
- nats
- --kv-block-size
- "64"
command:
- python3
- -m
- dynamo.trtllm
env:
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: HF_HOME
value: /model-cache
- name: TRITON_CACHE_DIR
value: /model-cache/.triton-cache
- name: NCCL_DEBUG
value: INFO
- name: NCCL_MNNVL_ENABLE
value: "1"
- name: NCCL_CUMEM_ENABLE
value: "1"
- name: NCCL_NVLS_ENABLE
value: "1"
- name: NVIDIA_GDRCOPY
value: "1"
- name: UCX_CUDA_IPC_ENABLE_MNNVL
value: "1"
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_STORE_TIMEOUT
value: "7200"
- name: TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER
value: "1"
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: TRTLLM_SERVER_DISABLE_GC
value: "1"
- name: TRTLLM_WORKER_DISABLE_GC
value: "1"
- name: ENROOT_ALLOW_DEV
value: "yes"
- name: NCCL_GRAPH_MIXING_SUPPORT
value: "0"
- name: TRTLLM_FORCE_COMM_METHOD
value: NVLINK_TWO_SIDED
- name: ENABLE_CONFIGURABLE_MOE
value: "1"
- name: TLLM_LOG_LEVEL
value: "INFO"
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
name: ""
resources: {}
securityContext:
runAsUser: 0
startupProbe:
failureThreshold: 60
httpGet:
path: /live
port: 9090
periodSeconds: 60
timeoutSeconds: 5
volumeMounts:
- mountPath: /model-cache
name: model-cache
- mountPath: /config
name: trtllm-config
readOnly: true
workingDir: /workspace/
nodeSelector:
kubernetes.io/arch: arm64
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: your-compute-domain-channel
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
- name: trtllm-config
configMap:
name: dsv32-trtllm-config
multinode:
nodeCount: 2
replicas: 2
resources:
limits:
gpu: "4"
claims:
- name: compute-domain-channel
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dsv32-trtllm-config
data:
prefill.yaml: |
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 120000
max_num_tokens: 8192
enable_chunked_prefill: true
cuda_graph_config:
max_batch_size: 32
enable_padding: true
disable_overlap_scheduler: true
enable_attention_dp: true
kv_cache_config:
dtype: fp8
enable_block_reuse: true
free_gpu_memory_fraction: 0.9
tokens_per_block: 64
max_batch_size: 32
max_seq_len: 121000
moe_config:
backend: TRTLLM
moe_expert_parallel_size: 8
print_iter_log: true
tensor_parallel_size: 8
decode.yaml: |
allreduce_strategy: MNNVL
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 120000
max_num_tokens: 8192
cuda_graph_config:
max_batch_size: 8
enable_padding: true
enable_attention_dp: true
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.9
tokens_per_block: 64
max_batch_size: 8
max_seq_len: 121000
moe_config:
backend: WIDEEP
use_low_precision_moe_combine: true
moe_expert_parallel_size: 8
num_postprocess_workers: 8
print_iter_log: true
stream_interval: 10
tensor_parallel_size: 8
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# AIPerf trace-replay benchmark for DeepSeek-V3.2 NVFP4 (disagg).
#
# Replays requests from a Mooncake-format trace file at their original timestamps
# using aiperf --custom-dataset-type mooncake_trace --fixed-schedule.
#
# Prerequisites:
# - DGD deployed and in "normal" or "successful" state
# - model-cache PVC exists in your namespace
# - Trace file copied to PVC: /model-cache/traces/<trace>.jsonl
#
# Results: /model-cache/perf/<epoch>_<job-name>/
#
apiVersion: batch/v1
kind: Job
metadata:
name: disagg-kv-dsv32-nvfp4-bench
namespace: your-namespace
spec:
backoffLimit: 1
completions: 1
parallelism: 1
template:
metadata:
labels:
app: disagg-kv-dsv32-nvfp4-bench
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- disagg-kv-dsv32-nvfp4
topologyKey: kubernetes.io/hostname
containers:
- command:
- /bin/bash
- -c
- |
set -euo pipefail
ulimit -n 600000
echo "File descriptor limit set to: $(ulimit -n)"
echo 2097152 > /proc/sys/fs/inotify/max_user_watches 2>/dev/null || true
echo 1024 > /proc/sys/fs/inotify/max_user_instances 2>/dev/null || true
apt-get update && apt-get install -y curl jq procps git && apt-get clean
# pip install git+https://github.com/ai-dynamo/aiperf.git
pip install aiperf==0.5.0
echo "aiperf installation completed"
sysctl -w net.ipv4.ip_local_port_range="1024 65000" 2>/dev/null || true
export COLUMNS=200
EPOCH=$(date +%s)
wait_for_model_ready() {
echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
while ! curl -sf "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
echo "[$(date '+%H:%M:%S')] Model not ready yet, sleeping 5s..."
sleep 5
done
echo "Model '$TARGET_MODEL' is now available!"
curl -s "http://$ENDPOINT/v1/models" | jq .
}
wait_for_model_ready
mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
# Validate trace file
if [ ! -f "${TRACE_FILE}" ]; then
echo "ERROR: Trace file not found: ${TRACE_FILE}"
echo "Copy trace to PVC first: kubectl cp <local_trace> your-namespace/<pod>:/model-cache/traces/"
exit 1
fi
TRACE_LINES=$(wc -l < "${TRACE_FILE}")
echo "Trace contains ${TRACE_LINES} requests"
printf '{"deployment":"disagg-kv-dsv32-nvfp4","model":"%s","trace_file":"%s","trace_requests":%d,"ttft_threshold_ms":%s,"itl_threshold_ms":%s,"endpoint":"%s"}\n' \
"nvidia/DeepSeek-V3.2-NVFP4" "${TRACE_FILE}" "${TRACE_LINES}" "${TTFT_THRESHOLD_MS}" "${ITL_THRESHOLD_MS}" "${ENDPOINT}" \
> "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json"
TRACE_BASE_NAME="$(basename "${TRACE_FILE}" .jsonl)"
export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${TRACE_BASE_NAME}"
mkdir -p "$ARTIFACT_DIR"
# Server metrics args
SERVER_METRICS_ARGS=()
if [ -n "${AIPERF_SERVER_METRICS_URLS:-}" ]; then
IFS=',' read -r -a server_metrics_urls <<< "${AIPERF_SERVER_METRICS_URLS}"
if [ ${#server_metrics_urls[@]} -gt 0 ]; then
SERVER_METRICS_ARGS+=(--server-metrics "${server_metrics_urls[@]}")
fi
fi
echo "=============================================="
echo "Trace Replay Benchmark (aiperf)"
echo "=============================================="
echo "Endpoint: http://${ENDPOINT}"
echo "Model: nvidia/DeepSeek-V3.2-NVFP4"
echo "Trace file: Mooncake-based Synthetic Coding Trace"
echo "TTFT Threshold: ${TTFT_THRESHOLD_MS}ms"
echo "ITL Threshold: ${ITL_THRESHOLD_MS}ms"
echo "Artifact dir: ${ARTIFACT_DIR}"
echo "=============================================="
echo ""
echo "Running warmup benchmark..."
aiperf profile \
-m "nvidia/DeepSeek-V3.2-NVFP4" \
--tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
--url "http://${ENDPOINT}" \
--streaming \
--ui dashboard \
--synthetic-input-tokens-mean 10000 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 200 \
--output-tokens-stddev 0 \
--extra-inputs "max_tokens:200" \
--extra-inputs "min_tokens:200" \
--extra-inputs "ignore_eos:true" \
--concurrency 4 \
--request-count 10
echo "Warmup complete"
# Trace replay
echo ""
echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting trace replay benchmark"
aiperf profile \
-m "nvidia/DeepSeek-V3.2-NVFP4" \
--tokenizer "nvidia/DeepSeek-V3.2-NVFP4" \
--input-file "${TRACE_FILE}" \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--url "http://${ENDPOINT}" \
--streaming \
--random-seed 42 \
--ui dashboard \
--artifact-dir "${ARTIFACT_DIR}" \
--workers-max 200 \
--request-timeout-seconds 1000 \
--profile-export-level records \
--record-processors 8 \
"${SERVER_METRICS_ARGS[@]}" \
--goodput "time_to_first_token:${TTFT_THRESHOLD_MS} inter_token_latency:${ITL_THRESHOLD_MS}"
BENCH_EXIT_CODE=$?
echo ""
echo "$(date '+%Y-%m-%d %H:%M:%S') - Benchmark complete (exit code: ${BENCH_EXIT_CODE})"
echo "Results: ${ARTIFACT_DIR}"
ls -la "${ARTIFACT_DIR}" 2>/dev/null || true
echo "Benchmark complete!"
set -e
exit $BENCH_EXIT_CODE
env:
- name: TARGET_MODEL
value: nvidia/DeepSeek-V3.2-NVFP4
- name: ENDPOINT
value: disagg-kv-dsv32-nvfp4-frontend:8000
- name: TRACE_FILE
value: /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl
- name: TTFT_THRESHOLD_MS
value: "20000"
- name: ITL_THRESHOLD_MS
value: "50"
- name: AIPERF_HTTP_CONNECTION_LIMIT
value: "200"
- name: AIPERF_HTTP_SO_RCVTIMEO
value: "120"
- name: AIPERF_SERVER_METRICS_URLS
value: "http://disagg-kv-dsv32-nvfp4-dec-0-dec-wkr:9090/metrics,http://disagg-kv-dsv32-nvfp4-prefill-0:9090/metrics"
- name: JOB_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['job-name']
- name: ROOT_ARTIFACT_DIR
value: /model-cache/perf
- name: HF_HOME
value: /model-cache
- name: PYTHONUNBUFFERED
value: "1"
image: python:3.12-slim
imagePullPolicy: IfNotPresent
name: perf
securityContext:
privileged: true
volumeMounts:
- name: model-cache
mountPath: /model-cache
workingDir: /workspace
restartPolicy: Never
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment