"...git@developer.sourcefind.cn:2222/OpenDAS/vllm_cscc.git" did not exist on "e11222333f43c8466c57d0223380dcf297b02bac"
Unverified Commit edd50f64 authored by Neal Vaidya's avatar Neal Vaidya Committed by GitHub
Browse files

docs: Add Nemotron-3-Super-FP8 deployment recipes (#7216)


Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent 5101f08c
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Production-Ready Recipes # Dynamo Production-Ready Recipes
Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo. Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
...@@ -40,9 +45,20 @@ These recipes demonstrate aggregated or disaggregated serving: ...@@ -40,9 +45,20 @@ These recipes demonstrate aggregated or disaggregated serving:
*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC. *1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
### Non-Optimized Recipes
These recipes demonstrate functional deployments with Dynamo features, but have not yet been tuned for best performance or paired with benchmark manifests.
| Model | Framework | Mode | GPUs | Deployment | Notes |
|-------|-----------|-------|------|------------|-------|
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
**Legend:** **Legend:**
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided - **Benchmark Recipe**: In the production-ready table above, ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
## Recipe Structure ## Recipe Structure
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Nemotron-3-Super FP8 Recipes
Functional deployments for **nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8** (~124B hybrid Mamba/Attention/MoE) across multiple backends.
These recipes target **Dynamo 1.0**. See [Dynamo 0.9.1 Compatibility](#dynamo-091-compatibility) for notes on running with older containers.
## Available Configurations
| Configuration | GPUs | Backend | Mode | Description |
|--------------|------|---------|------|-------------|
| [**vllm/agg**](vllm/agg/) | 4x H100/H200 | vLLM | Aggregated | TP=4, KV-aware routing |
| [**sglang/agg**](sglang/agg/) | 4x H100/H200 | SGLang | Aggregated | TP=4, KV-aware routing (not working on 0.9.1) |
| [**trtllm/disagg**](trtllm/disagg/) | 4x H100/H200 | TensorRT-LLM | Disaggregated | TP=2 P/D split, UCX KV transfer |
| [**sglang/disagg**](sglang/disagg/) | 4x H100/H200 | SGLang | Disaggregated | TP=2 P/D split, nixl KV transfer (not working on 0.9.1) |
## Prerequisites
1. **Dynamo Platform installed** -- See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with 4x H100 80GB (or H200) GPUs
3. **HuggingFace token** with access to NVIDIA models
## Quick Start
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Deploy (choose one configuration)
kubectl apply -f vllm/agg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f sglang/agg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f sglang/disagg/deploy.yaml -n ${NAMESPACE}
```
## Test the Deployment
```bash
# Port-forward the frontend
# If deployed vllm/agg:
kubectl port-forward svc/nemotron-super-fp8-vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
# If deployed trtllm/disagg:
# kubectl port-forward svc/nemotron-super-fp8-trtllm-disagg-frontend 8000:8000 -n ${NAMESPACE}
# If deployed sglang/agg:
# kubectl port-forward svc/nemotron-super-fp8-sglang-agg-frontend 8000:8000 -n ${NAMESPACE}
# If deployed sglang/disagg:
# kubectl port-forward svc/nemotron-super-fp8-sglang-disagg-frontend 8000:8000 -n ${NAMESPACE}
# Basic chat (with reasoning)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# Tool calling
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
"messages": [{"role": "user", "content": "What is the weather in SF?"}],
"tools": [{"type": "function", "function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}}],
"max_tokens": 256
}'
# Disable thinking (only works with nemotron_nano reasoning parser in 1.0+)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"chat_template_kwargs": {"enable_thinking": false},
"max_tokens": 64
}'
```
## Model Details
- **Model**: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8`
- **Architecture**: Nemotron-H (hybrid Mamba/Attention/MoE, 88 layers)
- **Parameters**: ~124B total (~119B FP8, ~4.7B BF16)
- **Quantization**: ModelOpt FP8 (F8_E4M3) with FP8 KV cache
## Parser Configuration
All recipes include tool call and reasoning parsers:
- `--dyn-reasoning-parser nemotron_nano` -- Extracts `<think>...</think>` into `reasoning_content`. Correctly handles both `enable_thinking: true` and `enable_thinking: false`.
- `--dyn-tool-call-parser nemotron_nano` -- Parses `<tool_call><function=name>` into structured `tool_calls`.
To disable reasoning at request time, pass `"chat_template_kwargs": {"enable_thinking": false}`. The model also supports `"chat_template_kwargs": {"low_effort": true}` for lighter-weight reasoning.
## Routing
- **vLLM** and **SGLang** recipes use **approximate KV-aware routing** (`--router-mode kv --no-kv-events` on the frontend). The frontend uses prefix hashing to route requests to workers most likely to have relevant KV cache blocks, which helps workloads with shared system prompts or multi-turn conversations.
- The **TensorRT-LLM** disaggregated recipe uses **round-robin routing**. Nemotron-H on TRT-LLM still requires `enable_block_reuse: false`, so KV overlap routing does not provide a real cache-reuse benefit here and only adds misleading overlap bookkeeping.
Approximate (hash-based) routing is used for the vLLM and SGLang variants because hybrid Mamba+Attention models do not yet have a reliable KV-event path in these recipes (`--kv-events-config` for vLLM/SGLang, `--publish-events-and-metrics` for TRT-LLM).
## Backend-Specific Notes
### vLLM
- No connector flags needed in 1.0 (default is no connector)
- Requires `--is-decode-worker` to skip KV event publisher setup
- Requires `--mamba-cache-mode align` to work around [vllm#34865](https://github.com/vllm-project/vllm/issues/34865): prefix caching with the default `mamba_cache_mode="all"` produces NaN logprobs and garbage tokens for Nemotron-H. Fixed in vLLM 0.17.0 ([vllm#34874](https://github.com/vllm-project/vllm/pull/34874)); the 1.0 container ships vLLM 0.16.0, so the workaround is needed.
- **Attention backend**: On Hopper the default (`FLASH_ATTN`) is safe. On Blackwell, vLLM defaults to FlashInfer, which has a [stale NaN bug](https://github.com/vllm-project/vllm/issues/35138) with hybrid Mamba models ([vllm#35219](https://github.com/vllm-project/vllm/pull/35219)). For Blackwell, specify `--attention-backend FLASH_ATTN` or `--attention-backend TRITON_ATTN` to avoid the issue.
- Sets `VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm` to avoid a [hang during CUDA graph capture](https://github.com/vllm-project/vllm/issues/35772) with TP>1. This is the [new default](https://github.com/vllm-project/vllm/pull/35793) in later vLLM versions but must be set explicitly in 0.16.0.
### TensorRT-LLM
- Uses PyTorch backend (`backend: pytorch` in engine config)
- Block reuse is still not supported for Nemotron-H / Mamba hybrid cache. Set `enable_block_reuse: false` explicitly in all TRT-LLM Nemotron configs. If the field is omitted, current TRT-LLM builds may still start only because the Nemotron model class silently applies a model default of `enable_block_reuse: false`; block reuse is not actually active.
- The TRT-LLM disaggregated recipe uses `--router-mode round-robin` rather than KV routing. With block reuse disabled, KV-overlap scoring does not correspond to a real runtime win for Nemotron-H.
- **Disaggregated mode** requires `cache_transceiver_config: backend: UCX`. NIXL and MOONCAKE backends do not support hybrid models with Mamba SSM state — only UCX (or MPI) can transfer both attention KV cache and Mamba conv/SSM state between workers.
### SGLang
- Requires sglang >= v0.5.9 (1.0 ships v0.5.9; 0.9.1 ships v0.5.8 which has blocking bugs)
- **Disaggregated mode works** with nixl KV transfer (TP=2 per worker, 2 GPUs each). Mooncake (`--disaggregation-transfer-backend mooncake`) is also supported as an alternative transfer backend.
- Known issue: prefill warmup logs `Prefill warmup failed: 'SamplingParams' object is not subscriptable` -- non-blocking, does not affect functionality
## Dynamo 0.9.1 Compatibility
These recipes target Dynamo 1.0. To run on 0.9.1 containers, the following changes are needed:
### vLLM (`vllm-runtime:0.9.1`)
- Change image tags from `:1.0.0` to `:0.9.1`
- **Add** `--connector none` to worker args (required in 0.9.1 to disable nixl KV connector; rejected in 1.0)
- Change `--dyn-reasoning-parser` from `nemotron_nano` to `deepseek_r1` (nemotron_nano reasoning parser is broken in 0.9.1)
- `enable_thinking: false` will **not work** with `deepseek_r1` parser (response content goes to `reasoning_content`, `content` is null)
- `--mamba-cache-mode align` is still needed (0.9.1 ships vLLM 0.14.1, also affected by [vllm#34865](https://github.com/vllm-project/vllm/issues/34865))
### TensorRT-LLM (`tensorrtllm-runtime:0.9.1`)
- Change image tags from `:1.0.0` to `:0.9.1`
- Change `--dyn-reasoning-parser` from `nemotron_nano` to `deepseek_r1`
- Same `enable_thinking: false` caveat as vLLM above
- Keep `enable_block_reuse: false` in `kv_cache_config` in the ConfigMap. This is still the effective setting for Nemotron-H on current TRT-LLM builds; omitting the field can appear to work only because TRT-LLM silently applies the same model default later.
### SGLang (`sglang-runtime:0.9.1`)
- **Not supported.** The bundled sglang v0.5.8 has two blocking bugs:
1. FP8 quantization bug (`ModelOptFp8LinearMethod.create_weights()` signature mismatch)
2. Config format mismatch (`hybrid_override_pattern` vs `layers_block_type`)
- Both are fixed in sglang v0.5.9 but the 0.9.1 container ships v0.5.8
## Notes
- **Disaggregated mode**: Supported with TRT-LLM via UCX (`trtllm/disagg`) and SGLang via nixl or mooncake (`sglang/disagg`). Not supported with vLLM due to hybrid KV cache incompatibilities. TRT-LLM disagg requires UCX because NIXL/MOONCAKE cannot transfer Mamba SSM state.
- **Storage class**: Update `storageClassName` in `model-cache/model-cache.yaml` before deploying.
- **Model size**: ~240GB download; expect 30-60 minutes depending on bandwidth.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 300Gi
storageClassName: "your-storage-class-name"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- name: HF_HOME
value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download $MODEL_NAME
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# NOTE: This recipe requires dynamo 1.0+ with sglang >= v0.5.9.
#
# NOT working on dynamo 0.9.1 (sglang v0.5.8) due to two blocking bugs:
#
# 1. FP8 quantization bug: ModelOptFp8LinearMethod.create_weights() missing
# input_size/output_size parameters, causing TypeError on model load.
# Fixed in sglang v0.5.9 (commit 0ff24159a5).
#
# 2. Config format mismatch: sglang expects hybrid_override_pattern (string)
# but the model provides layers_block_type (list). Workaround: patch the
# model's config.json to add a hybrid_override_pattern field.
#
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: nemotron-super-fp8-sglang-agg
spec:
backendFramework: sglang
envs:
- name: HF_HOME
value: /opt/models
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
# Approximate KV-aware routing: uses prefix hashing to route
# requests to workers likely to have relevant KV cache, without
# requiring KV events from the backend.
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode kv --no-kv-events --http-port 8000
SglangWorker:
componentType: worker
envFromSecret: hf-token-secret
replicas: 1
resources:
limits:
gpu: "4"
requests:
gpu: "4"
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 16Gi
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
workingDir: /workspace
command:
- python3
- -m
- dynamo.sglang
args:
- --model-path
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --served-model-name
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --tp
- "4"
- --trust-remote-code
- --dyn-tool-call-parser
- nemotron_nano
- --dyn-reasoning-parser
- nemotron_nano
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Disaggregated SGLang deployment: prefill/decode split with nixl KV transfer.
# Tested with dynamo 1.0 (SGLang 0.5.9).
#
# Uses TP=2 per worker (prefill: 2 GPUs, decode: 2 GPUs) for a total of 4 GPUs.
# KV cache is transferred between workers via nixl (GPU-direct).
#
# NOT working on dynamo 0.9.1 — same blocking bugs as sglang/agg.
#
# Known issue: Prefill warmup logs a non-blocking warning:
# "Prefill warmup failed: 'SamplingParams' object is not subscriptable"
# This does not affect functionality.
#
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: nemotron-super-fp8-sglang-disagg
spec:
backendFramework: sglang
envs:
- name: HF_HOME
value: /opt/models
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode kv --no-kv-events --http-port 8000
prefill:
componentType: worker
subComponentType: prefill
envFromSecret: hf-token-secret
replicas: 1
resources:
limits:
gpu: "2"
requests:
gpu: "2"
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 16Gi
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
workingDir: /workspace
command:
- python3
- -m
- dynamo.sglang
args:
- --model-path
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --served-model-name
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --tp
- "2"
- --trust-remote-code
- --disaggregation-mode
- prefill
- --disaggregation-bootstrap-port
- "12345"
- --disaggregation-transfer-backend
- nixl
- --host
- 0.0.0.0
- --dyn-tool-call-parser
- nemotron_nano
- --dyn-reasoning-parser
- nemotron_nano
decode:
componentType: worker
subComponentType: decode
envFromSecret: hf-token-secret
replicas: 1
resources:
limits:
gpu: "2"
requests:
gpu: "2"
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 16Gi
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
workingDir: /workspace
command:
- python3
- -m
- dynamo.sglang
args:
- --model-path
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --served-model-name
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --tp
- "2"
- --trust-remote-code
- --disaggregation-mode
- decode
- --disaggregation-bootstrap-port
- "12345"
- --disaggregation-transfer-backend
- nixl
- --host
- 0.0.0.0
- --dyn-tool-call-parser
- nemotron_nano
- --dyn-reasoning-parser
- nemotron_nano
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Disaggregated TRT-LLM deployment: prefill/decode split with UCX KV transfer.
# Tested with dynamo 1.0 (TRT-LLM PyTorch backend).
#
# Uses TP=2 per worker (prefill: 2 GPUs, decode: 2 GPUs) for a total of 4 GPUs.
# KV cache (attention + Mamba SSM state) is transferred between workers via UCX.
#
# IMPORTANT: Must use UCX backend, not NIXL. NIXL and MOONCAKE backends do not
# support hybrid models with Mamba SSM state:
# ValueError: NIXL or MOONCAKE backend does not support hybrid models with
# RNN (Mamba) states. Please use UCX or MPI backend for cache transfer with
# hybrid models.
#
# Dynamo 0.9.1 compatibility notes:
# - Change image tags from :1.0.0 to :0.9.1
# - Change --dyn-reasoning-parser from nemotron_nano to deepseek_r1
# - With deepseek_r1 parser, enable_thinking: false will not work correctly
# - Keep enable_block_reuse: false in both kv_cache_config blocks. Current
# TRT-LLM builds still disable block reuse for Nemotron-H / Mamba hybrid cache.
#
apiVersion: v1
kind: ConfigMap
metadata:
name: nemotron-super-prefill-config
data:
config.yaml: |
backend: pytorch
tensor_parallel_size: 2
moe_expert_parallel_size: 1
enable_attention_dp: false
enable_chunked_prefill: true
max_batch_size: 16
max_num_tokens: 8192
trust_remote_code: true
kv_cache_config:
free_gpu_memory_fraction: 0.85
# Nemotron-H uses a Mamba hybrid cache. Block reuse is still unsupported,
# and explicit true still trips:
# "mamba hybrid cache requires block reuse to be disabled"
# Keep this explicit instead of relying on TRT-LLM's silent model default.
enable_block_reuse: false
moe_config:
backend: TRTLLM
cache_transceiver_config:
# UCX is required for hybrid Mamba+Attention models.
# NIXL/MOONCAKE do not support Mamba SSM state transfer.
backend: UCX
cuda_graph_config:
enable_padding: true
max_batch_size: 16
disable_overlap_scheduler: true
---
apiVersion: v1
kind: ConfigMap
metadata:
name: nemotron-super-decode-config
data:
config.yaml: |
backend: pytorch
tensor_parallel_size: 2
moe_expert_parallel_size: 1
enable_attention_dp: false
enable_chunked_prefill: true
max_batch_size: 16
max_num_tokens: 8192
trust_remote_code: true
kv_cache_config:
free_gpu_memory_fraction: 0.85
# Nemotron-H uses a Mamba hybrid cache. Block reuse is still unsupported,
# and explicit true still trips:
# "mamba hybrid cache requires block reuse to be disabled"
# Keep this explicit instead of relying on TRT-LLM's silent model default.
enable_block_reuse: false
moe_config:
backend: TRTLLM
cache_transceiver_config:
# UCX is required for hybrid Mamba+Attention models.
# NIXL/MOONCAKE do not support Mamba SSM state transfer.
backend: UCX
cuda_graph_config:
enable_padding: true
max_batch_size: 16
disable_overlap_scheduler: false
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: nemotron-super-fp8-trtllm-disagg
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
# Round-robin routing is the simplest correct choice here.
# Nemotron-H on TRT-LLM has block reuse disabled, so KV-overlap
# routing does not provide a real cache reuse benefit.
args:
- python3 -m dynamo.frontend --router-mode round-robin --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
TrtllmPrefillWorker:
componentType: worker
subComponentType: prefill
envFromSecret: hf-token-secret
replicas: 1
resources:
limits:
gpu: "2"
requests:
gpu: "2"
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 16Gi
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
# TRT-LLM startup is slow (~7 min) due to CUDA graph compilation
failureThreshold: 600
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
env:
- name: HF_HOME
value: "/opt/models"
- name: ENGINE_ARGS
value: "/opt/dynamo/configs/config.yaml"
- name: MODEL_PATH
value: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8"
command:
- /bin/sh
- -c
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_PATH}" \
--served-model-name "${MODEL_PATH}" \
--extra-engine-args "${ENGINE_ARGS}" \
--disaggregation-mode prefill \
--dyn-tool-call-parser nemotron_nano \
--dyn-reasoning-parser nemotron_nano
volumeMounts:
- mountPath: /opt/dynamo/configs
name: nemotron-super-prefill-config
readOnly: true
volumes:
- configMap:
name: nemotron-super-prefill-config
name: nemotron-super-prefill-config
TrtllmDecodeWorker:
componentType: worker
subComponentType: decode
envFromSecret: hf-token-secret
replicas: 1
resources:
limits:
gpu: "2"
requests:
gpu: "2"
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 16Gi
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 600
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
env:
- name: HF_HOME
value: "/opt/models"
- name: ENGINE_ARGS
value: "/opt/dynamo/configs/config.yaml"
- name: MODEL_PATH
value: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8"
command:
- /bin/sh
- -c
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_PATH}" \
--served-model-name "${MODEL_PATH}" \
--extra-engine-args "${ENGINE_ARGS}" \
--disaggregation-mode decode \
--dyn-tool-call-parser nemotron_nano \
--dyn-reasoning-parser nemotron_nano
volumeMounts:
- mountPath: /opt/dynamo/configs
name: nemotron-super-decode-config
readOnly: true
volumes:
- configMap:
name: nemotron-super-decode-config
name: nemotron-super-decode-config
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Tested with dynamo 1.0 (vLLM 0.16.0).
#
# Dynamo 0.9.1 compatibility notes:
# - Change image tags from :1.0.0 to :0.9.1
# - Add `--connector none` to args (required in 0.9.1, rejected in 1.0)
# - Change --dyn-reasoning-parser from nemotron_nano to deepseek_r1
# (nemotron_nano reasoning parser is broken in 0.9.1)
# - With deepseek_r1 parser, enable_thinking: false will not work correctly
#
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: nemotron-super-fp8-vllm-agg
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
timeoutSeconds: 1800
failureThreshold: 60
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
# Approximate KV-aware routing: uses prefix hashing to route
# requests to workers likely to have relevant KV cache, without
# requiring KV events from the backend.
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode kv --no-kv-events --http-port 8000
VllmWorker:
componentType: worker
envFromSecret: hf-token-secret
replicas: 1
resources:
limits:
gpu: "4"
requests:
gpu: "4"
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 16Gi
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 120
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
env:
- name: HF_HOME
value: "/opt/models"
# Workaround for vllm/vllm#35772: FlashInfer allreduce can hang during
# CUDA graph capture with TP>1. Fixed in vllm/vllm#35793 (default changed
# to trtllm). The 1.0 container ships vLLM 0.16.0, so set explicitly.
- name: VLLM_FLASHINFER_ALLREDUCE_BACKEND
value: "trtllm"
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --served-model-name
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
- --tensor-parallel-size
- "4"
- --trust-remote-code
# On Blackwell, vLLM defaults to FlashInfer, which has a stale NaN bug
# with hybrid Mamba models (vllm/vllm#35138). FlashInfer's multiply-by-zero
# masking doesn't clear NaN from stale Mamba fp32 blocks reused by attention
# layers, causing progressive accuracy degradation.
# On Hopper, the default (FLASH_ATTN) is safe and this can be omitted.
# On Blackwell, use FLASH_ATTN or TRITON_ATTN to avoid the bug:
# --attention-backend FLASH_ATTN
# or:
# --attention-backend TRITON_ATTN
# Workaround for vllm/vllm#34865: prefix caching with mamba_cache_mode="all"
# (the default for Nemotron-H) produces NaN logprobs and garbage tokens.
# Fixed in vLLM 0.17.0 (vllm/vllm#34874). Use "align" until then.
- --mamba-cache-mode
- align
# --connector none is no longer needed in 1.0 (default is no connector).
# In 0.9.1, you must add: --connector none
#
# --is-decode-worker also automatically disables KV event publishing,
# which pairs with --no-kv-events on the frontend for approximate routing.
- --is-decode-worker
- --dyn-tool-call-parser
- nemotron_nano
# nemotron_nano reasoning parser handles both enable_thinking: true and false.
# In 0.9.1, use deepseek_r1 instead (nemotron_nano reasoning parser is broken),
# but note that enable_thinking: false will not work with deepseek_r1.
- --dyn-reasoning-parser
- nemotron_nano
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment