docs: Add Nemotron-3-Super-FP8 deployment recipes (#7216)

Signed-off-by: Neal Vaidya <nealv@nvidia.com>

docs: Add Nemotron-3-Super-FP8 deployment recipes (#7216)
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
edd50f64 · Neal Vaidya · GitHub · 5101f08c · edd50f64 · edd50f64
Unverified Commit edd50f64 authored Mar 11, 2026 by Neal Vaidya Committed by GitHub Mar 11, 2026
8 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
 # Dynamo Production-Ready Recipes
 Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
@@ -40,9 +45,20 @@ These recipes demonstrate aggregated or disaggregated serving:
 *1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
+### Non-Optimized Recipes
+These recipes demonstrate functional deployments with Dynamo features, but have not yet been tuned for best performance or paired with benchmark manifests.
+| Model | Framework | Mode | GPUs | Deployment | Notes |
+|-------|-----------|-------|------|------------|-------|
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing |
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
 **Legend:**
 - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
+- **Benchmark Recipe**: In the production-ready table above, ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
 ## Recipe Structure

--- a/recipes/nemotron-3-super-fp8/README.md
+++ b/recipes/nemotron-3-super-fp8/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Nemotron-3-Super FP8 Recipes
+Functional deployments for **nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8** (~124B hybrid Mamba/Attention/MoE) across multiple backends.
+These recipes target **Dynamo 1.0**. See [Dynamo 0.9.1 Compatibility](#dynamo-091-compatibility) for notes on running with older containers.
+## Available Configurations
+| Configuration | GPUs | Backend | Mode | Description |
+|--------------|------|---------|------|-------------|
+| [**vllm/agg**](vllm/agg/) | 4x H100/H200 | vLLM | Aggregated | TP=4, KV-aware routing |
+| [**sglang/agg**](sglang/agg/) | 4x H100/H200 | SGLang | Aggregated | TP=4, KV-aware routing (not working on 0.9.1) |
+| [**trtllm/disagg**](trtllm/disagg/) | 4x H100/H200 | TensorRT-LLM | Disaggregated | TP=2 P/D split, UCX KV transfer |
+| [**sglang/disagg**](sglang/disagg/) | 4x H100/H200 | SGLang | Disaggregated | TP=2 P/D split, nixl KV transfer (not working on 0.9.1) |
+## Prerequisites
+1. **Dynamo Platform installed** -- See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with 4x H100 80GB (or H200) GPUs
+3. **HuggingFace token** with access to NVIDIA models
+## Quick Start
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+# Download model (update storageClassName in model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+# Deploy (choose one configuration)
+kubectl apply -f vllm/agg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f sglang/agg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f sglang/disagg/deploy.yaml -n ${NAMESPACE}
+```
+## Test the Deployment
+```bash
+# Port-forward the frontend
+# If deployed vllm/agg:
+kubectl port-forward svc/nemotron-super-fp8-vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
+# If deployed trtllm/disagg:
+# kubectl port-forward svc/nemotron-super-fp8-trtllm-disagg-frontend 8000:8000 -n ${NAMESPACE}
+# If deployed sglang/agg:
+# kubectl port-forward svc/nemotron-super-fp8-sglang-agg-frontend 8000:8000 -n ${NAMESPACE}
+# If deployed sglang/disagg:
+# kubectl port-forward svc/nemotron-super-fp8-sglang-disagg-frontend 8000:8000 -n ${NAMESPACE}
+# Basic chat (with reasoning)
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+# Tool calling
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
+    "messages": [{"role": "user", "content": "What is the weather in SF?"}],
+    "tools": [{"type": "function", "function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}}],
+    "max_tokens": 256
+  }'
+# Disable thinking (only works with nemotron_nano reasoning parser in 1.0+)
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
+    "messages": [{"role": "user", "content": "What is 2+2?"}],
+    "chat_template_kwargs": {"enable_thinking": false},
+    "max_tokens": 64
+  }'
+```
+## Model Details
+- **Model**: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8`
+- **Architecture**: Nemotron-H (hybrid Mamba/Attention/MoE, 88 layers)
+- **Parameters**: ~124B total (~119B FP8, ~4.7B BF16)
+- **Quantization**: ModelOpt FP8 (F8_E4M3) with FP8 KV cache
+## Parser Configuration
+All recipes include tool call and reasoning parsers:
+- `--dyn-reasoning-parser nemotron_nano` -- Extracts `<think>...</think>` into `reasoning_content`. Correctly handles both `enable_thinking: true` and `enable_thinking: false`.
+- `--dyn-tool-call-parser nemotron_nano` -- Parses `<tool_call><function=name>` into structured `tool_calls`.
+To disable reasoning at request time, pass `"chat_template_kwargs": {"enable_thinking": false}`. The model also supports `"chat_template_kwargs": {"low_effort": true}` for lighter-weight reasoning.
+## Routing
+- **vLLM** and **SGLang** recipes use **approximate KV-aware routing** (`--router-mode kv --no-kv-events` on the frontend). The frontend uses prefix hashing to route requests to workers most likely to have relevant KV cache blocks, which helps workloads with shared system prompts or multi-turn conversations.
+- The **TensorRT-LLM** disaggregated recipe uses **round-robin routing**. Nemotron-H on TRT-LLM still requires `enable_block_reuse: false`, so KV overlap routing does not provide a real cache-reuse benefit here and only adds misleading overlap bookkeeping.
+Approximate (hash-based) routing is used for the vLLM and SGLang variants because hybrid Mamba+Attention models do not yet have a reliable KV-event path in these recipes (`--kv-events-config` for vLLM/SGLang, `--publish-events-and-metrics` for TRT-LLM).
+## Backend-Specific Notes
+### vLLM
+- No connector flags needed in 1.0 (default is no connector)
+- Requires `--is-decode-worker` to skip KV event publisher setup
+- Requires `--mamba-cache-mode align` to work around [vllm#34865](https://github.com/vllm-project/vllm/issues/34865): prefix caching with the default `mamba_cache_mode="all"` produces NaN logprobs and garbage tokens for Nemotron-H. Fixed in vLLM 0.17.0 ([vllm#34874](https://github.com/vllm-project/vllm/pull/34874)); the 1.0 container ships vLLM 0.16.0, so the workaround is needed.
+- **Attention backend**: On Hopper the default (`FLASH_ATTN`) is safe. On Blackwell, vLLM defaults to FlashInfer, which has a [stale NaN bug](https://github.com/vllm-project/vllm/issues/35138) with hybrid Mamba models ([vllm#35219](https://github.com/vllm-project/vllm/pull/35219)). For Blackwell, specify `--attention-backend FLASH_ATTN` or `--attention-backend TRITON_ATTN` to avoid the issue.
+- Sets `VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm` to avoid a [hang during CUDA graph capture](https://github.com/vllm-project/vllm/issues/35772) with TP>1. This is the [new default](https://github.com/vllm-project/vllm/pull/35793) in later vLLM versions but must be set explicitly in 0.16.0.
+### TensorRT-LLM
+- Uses PyTorch backend (`backend: pytorch` in engine config)
+- Block reuse is still not supported for Nemotron-H / Mamba hybrid cache. Set `enable_block_reuse: false` explicitly in all TRT-LLM Nemotron configs. If the field is omitted, current TRT-LLM builds may still start only because the Nemotron model class silently applies a model default of `enable_block_reuse: false`; block reuse is not actually active.
+- The TRT-LLM disaggregated recipe uses `--router-mode round-robin` rather than KV routing. With block reuse disabled, KV-overlap scoring does not correspond to a real runtime win for Nemotron-H.
+- **Disaggregated mode** requires `cache_transceiver_config: backend: UCX`. NIXL and MOONCAKE backends do not support hybrid models with Mamba SSM state — only UCX (or MPI) can transfer both attention KV cache and Mamba conv/SSM state between workers.
+### SGLang
+- Requires sglang >= v0.5.9 (1.0 ships v0.5.9; 0.9.1 ships v0.5.8 which has blocking bugs)
+- **Disaggregated mode works** with nixl KV transfer (TP=2 per worker, 2 GPUs each). Mooncake (`--disaggregation-transfer-backend mooncake`) is also supported as an alternative transfer backend.
+- Known issue: prefill warmup logs `Prefill warmup failed: 'SamplingParams' object is not subscriptable` -- non-blocking, does not affect functionality
+## Dynamo 0.9.1 Compatibility
+These recipes target Dynamo 1.0. To run on 0.9.1 containers, the following changes are needed:
+### vLLM (`vllm-runtime:0.9.1`)
+- Change image tags from `:1.0.0` to `:0.9.1`
+- **Add** `--connector none` to worker args (required in 0.9.1 to disable nixl KV connector; rejected in 1.0)
+- Change `--dyn-reasoning-parser` from `nemotron_nano` to `deepseek_r1` (nemotron_nano reasoning parser is broken in 0.9.1)
+- `enable_thinking: false` will **not work** with `deepseek_r1` parser (response content goes to `reasoning_content`, `content` is null)
+- `--mamba-cache-mode align` is still needed (0.9.1 ships vLLM 0.14.1, also affected by [vllm#34865](https://github.com/vllm-project/vllm/issues/34865))
+### TensorRT-LLM (`tensorrtllm-runtime:0.9.1`)
+- Change image tags from `:1.0.0` to `:0.9.1`
+- Change `--dyn-reasoning-parser` from `nemotron_nano` to `deepseek_r1`
+- Same `enable_thinking: false` caveat as vLLM above
+- Keep `enable_block_reuse: false` in `kv_cache_config` in the ConfigMap. This is still the effective setting for Nemotron-H on current TRT-LLM builds; omitting the field can appear to work only because TRT-LLM silently applies the same model default later.
+### SGLang (`sglang-runtime:0.9.1`)
+- **Not supported.** The bundled sglang v0.5.8 has two blocking bugs:
+  1. FP8 quantization bug (`ModelOptFp8LinearMethod.create_weights()` signature mismatch)
+  2. Config format mismatch (`hybrid_override_pattern` vs `layers_block_type`)
+- Both are fixed in sglang v0.5.9 but the 0.9.1 container ships v0.5.8
+## Notes
+- **Disaggregated mode**: Supported with TRT-LLM via UCX (`trtllm/disagg`) and SGLang via nixl or mooncake (`sglang/disagg`). Not supported with vLLM due to hybrid KV cache incompatibilities. TRT-LLM disagg requires UCX because NIXL/MOONCAKE cannot transfer Mamba SSM state.
+- **Storage class**: Update `storageClassName` in `model-cache/model-cache.yaml` before deploying.
+- **Model size**: ~240GB download; expect 30-60 minutes depending on bandwidth.
--- a/recipes/nemotron-3-super-fp8/model-cache/model-cache.yaml
+++ b/recipes/nemotron-3-super-fp8/model-cache/model-cache.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 300Gi
+  storageClassName: "your-storage-class-name"
--- a/recipes/nemotron-3-super-fp8/model-cache/model-download.yaml
+++ b/recipes/nemotron-3-super-fp8/model-cache/model-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download $MODEL_NAME
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
--- a/recipes/nemotron-3-super-fp8/sglang/agg/deploy.yaml
+++ b/recipes/nemotron-3-super-fp8/sglang/agg/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# NOTE: This recipe requires dynamo 1.0+ with sglang >= v0.5.9.
+#
+# NOT working on dynamo 0.9.1 (sglang v0.5.8) due to two blocking bugs:
+#
+# 1. FP8 quantization bug: ModelOptFp8LinearMethod.create_weights() missing
+#    input_size/output_size parameters, causing TypeError on model load.
+#    Fixed in sglang v0.5.9 (commit 0ff24159a5).
+#
+# 2. Config format mismatch: sglang expects hybrid_override_pattern (string)
+#    but the model provides layers_block_type (list). Workaround: patch the
+#    model's config.json to add a hybrid_override_pattern field.
+#
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: nemotron-super-fp8-sglang-agg
+spec:
+  backendFramework: sglang
+  envs:
+    - name: HF_HOME
+      value: /opt/models
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
+          # Approximate KV-aware routing: uses prefix hashing to route
+          # requests to workers likely to have relevant KV cache, without
+          # requiring KV events from the backend.
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode kv --no-kv-events --http-port 8000
+    SglangWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      replicas: 1
+      resources:
+        limits:
+          gpu: "4"
+        requests:
+          gpu: "4"
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 16Gi
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.sglang
+          args:
+            - --model-path
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --served-model-name
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --tp
+            - "4"
+            - --trust-remote-code
+            - --dyn-tool-call-parser
+            - nemotron_nano
+            - --dyn-reasoning-parser
+            - nemotron_nano
--- a/recipes/nemotron-3-super-fp8/sglang/disagg/deploy.yaml
+++ b/recipes/nemotron-3-super-fp8/sglang/disagg/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Disaggregated SGLang deployment: prefill/decode split with nixl KV transfer.
+# Tested with dynamo 1.0 (SGLang 0.5.9).
+#
+# Uses TP=2 per worker (prefill: 2 GPUs, decode: 2 GPUs) for a total of 4 GPUs.
+# KV cache is transferred between workers via nixl (GPU-direct).
+#
+# NOT working on dynamo 0.9.1 — same blocking bugs as sglang/agg.
+#
+# Known issue: Prefill warmup logs a non-blocking warning:
+#   "Prefill warmup failed: 'SamplingParams' object is not subscriptable"
+# This does not affect functionality.
+#
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: nemotron-super-fp8-sglang-disagg
+spec:
+  backendFramework: sglang
+  envs:
+    - name: HF_HOME
+      value: /opt/models
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode kv --no-kv-events --http-port 8000
+    prefill:
+      componentType: worker
+      subComponentType: prefill
+      envFromSecret: hf-token-secret
+      replicas: 1
+      resources:
+        limits:
+          gpu: "2"
+        requests:
+          gpu: "2"
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 16Gi
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.sglang
+          args:
+            - --model-path
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --served-model-name
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --tp
+            - "2"
+            - --trust-remote-code
+            - --disaggregation-mode
+            - prefill
+            - --disaggregation-bootstrap-port
+            - "12345"
+            - --disaggregation-transfer-backend
+            - nixl
+            - --host
+            - 0.0.0.0
+            - --dyn-tool-call-parser
+            - nemotron_nano
+            - --dyn-reasoning-parser
+            - nemotron_nano
+    decode:
+      componentType: worker
+      subComponentType: decode
+      envFromSecret: hf-token-secret
+      replicas: 1
+      resources:
+        limits:
+          gpu: "2"
+        requests:
+          gpu: "2"
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 16Gi
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.sglang
+          args:
+            - --model-path
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --served-model-name
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --tp
+            - "2"
+            - --trust-remote-code
+            - --disaggregation-mode
+            - decode
+            - --disaggregation-bootstrap-port
+            - "12345"
+            - --disaggregation-transfer-backend
+            - nixl
+            - --host
+            - 0.0.0.0
+            - --dyn-tool-call-parser
+            - nemotron_nano
+            - --dyn-reasoning-parser
+            - nemotron_nano
--- a/recipes/nemotron-3-super-fp8/trtllm/disagg/deploy.yaml
+++ b/recipes/nemotron-3-super-fp8/trtllm/disagg/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Disaggregated TRT-LLM deployment: prefill/decode split with UCX KV transfer.
+# Tested with dynamo 1.0 (TRT-LLM PyTorch backend).
+#
+# Uses TP=2 per worker (prefill: 2 GPUs, decode: 2 GPUs) for a total of 4 GPUs.
+# KV cache (attention + Mamba SSM state) is transferred between workers via UCX.
+#
+# IMPORTANT: Must use UCX backend, not NIXL. NIXL and MOONCAKE backends do not
+# support hybrid models with Mamba SSM state:
+#   ValueError: NIXL or MOONCAKE backend does not support hybrid models with
+#   RNN (Mamba) states. Please use UCX or MPI backend for cache transfer with
+#   hybrid models.
+#
+# Dynamo 0.9.1 compatibility notes:
+#   - Change image tags from :1.0.0 to :0.9.1
+#   - Change --dyn-reasoning-parser from nemotron_nano to deepseek_r1
+#   - With deepseek_r1 parser, enable_thinking: false will not work correctly
+#   - Keep enable_block_reuse: false in both kv_cache_config blocks. Current
+#     TRT-LLM builds still disable block reuse for Nemotron-H / Mamba hybrid cache.
+#
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nemotron-super-prefill-config
+data:
+  config.yaml: |
+    backend: pytorch
+    tensor_parallel_size: 2
+    moe_expert_parallel_size: 1
+    enable_attention_dp: false
+    enable_chunked_prefill: true
+    max_batch_size: 16
+    max_num_tokens: 8192
+    trust_remote_code: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.85
+      # Nemotron-H uses a Mamba hybrid cache. Block reuse is still unsupported,
+      # and explicit true still trips:
+      #   "mamba hybrid cache requires block reuse to be disabled"
+      # Keep this explicit instead of relying on TRT-LLM's silent model default.
+      enable_block_reuse: false
+    moe_config:
+      backend: TRTLLM
+    cache_transceiver_config:
+      # UCX is required for hybrid Mamba+Attention models.
+      # NIXL/MOONCAKE do not support Mamba SSM state transfer.
+      backend: UCX
+    cuda_graph_config:
+      enable_padding: true
+      max_batch_size: 16
+    disable_overlap_scheduler: true
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nemotron-super-decode-config
+data:
+  config.yaml: |
+    backend: pytorch
+    tensor_parallel_size: 2
+    moe_expert_parallel_size: 1
+    enable_attention_dp: false
+    enable_chunked_prefill: true
+    max_batch_size: 16
+    max_num_tokens: 8192
+    trust_remote_code: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.85
+      # Nemotron-H uses a Mamba hybrid cache. Block reuse is still unsupported,
+      # and explicit true still trips:
+      #   "mamba hybrid cache requires block reuse to be disabled"
+      # Keep this explicit instead of relying on TRT-LLM's silent model default.
+      enable_block_reuse: false
+    moe_config:
+      backend: TRTLLM
+    cache_transceiver_config:
+      # UCX is required for hybrid Mamba+Attention models.
+      # NIXL/MOONCAKE do not support Mamba SSM state transfer.
+      backend: UCX
+    cuda_graph_config:
+      enable_padding: true
+      max_batch_size: 16
+    disable_overlap_scheduler: false
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: nemotron-super-fp8-trtllm-disagg
+spec:
+  backendFramework: trtllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          # Round-robin routing is the simplest correct choice here.
+          # Nemotron-H on TRT-LLM has block reuse disabled, so KV-overlap
+          # routing does not provide a real cache reuse benefit.
+          args:
+          - python3 -m dynamo.frontend --router-mode round-robin --http-port 8000
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+    TrtllmPrefillWorker:
+      componentType: worker
+      subComponentType: prefill
+      envFromSecret: hf-token-secret
+      replicas: 1
+      resources:
+        limits:
+          gpu: "2"
+        requests:
+          gpu: "2"
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 16Gi
+      extraPodSpec:
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 10
+            timeoutSeconds: 10
+            # TRT-LLM startup is slow (~7 min) due to CUDA graph compilation
+            failureThreshold: 600
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          env:
+          - name: HF_HOME
+            value: "/opt/models"
+          - name: ENGINE_ARGS
+            value: "/opt/dynamo/configs/config.yaml"
+          - name: MODEL_PATH
+            value: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8"
+          command:
+          - /bin/sh
+          - -c
+          args:
+          - |
+            python3 -m dynamo.trtllm \
+              --model-path "${MODEL_PATH}" \
+              --served-model-name "${MODEL_PATH}" \
+              --extra-engine-args "${ENGINE_ARGS}" \
+              --disaggregation-mode prefill \
+              --dyn-tool-call-parser nemotron_nano \
+              --dyn-reasoning-parser nemotron_nano
+          volumeMounts:
+          - mountPath: /opt/dynamo/configs
+            name: nemotron-super-prefill-config
+            readOnly: true
+        volumes:
+        - configMap:
+            name: nemotron-super-prefill-config
+          name: nemotron-super-prefill-config
+    TrtllmDecodeWorker:
+      componentType: worker
+      subComponentType: decode
+      envFromSecret: hf-token-secret
+      replicas: 1
+      resources:
+        limits:
+          gpu: "2"
+        requests:
+          gpu: "2"
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 16Gi
+      extraPodSpec:
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 10
+            timeoutSeconds: 10
+            failureThreshold: 600
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
+          env:
+          - name: HF_HOME
+            value: "/opt/models"
+          - name: ENGINE_ARGS
+            value: "/opt/dynamo/configs/config.yaml"
+          - name: MODEL_PATH
+            value: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8"
+          command:
+          - /bin/sh
+          - -c
+          args:
+          - |
+            python3 -m dynamo.trtllm \
+              --model-path "${MODEL_PATH}" \
+              --served-model-name "${MODEL_PATH}" \
+              --extra-engine-args "${ENGINE_ARGS}" \
+              --disaggregation-mode decode \
+              --dyn-tool-call-parser nemotron_nano \
+              --dyn-reasoning-parser nemotron_nano
+          volumeMounts:
+          - mountPath: /opt/dynamo/configs
+            name: nemotron-super-decode-config
+            readOnly: true
+        volumes:
+        - configMap:
+            name: nemotron-super-decode-config
+          name: nemotron-super-decode-config
--- a/recipes/nemotron-3-super-fp8/vllm/agg/deploy.yaml
+++ b/recipes/nemotron-3-super-fp8/vllm/agg/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Tested with dynamo 1.0 (vLLM 0.16.0).
+#
+# Dynamo 0.9.1 compatibility notes:
+#   - Change image tags from :1.0.0 to :0.9.1
+#   - Add `--connector none` to args (required in 0.9.1, rejected in 1.0)
+#   - Change --dyn-reasoning-parser from nemotron_nano to deepseek_r1
+#     (nemotron_nano reasoning parser is broken in 0.9.1)
+#   - With deepseek_r1 parser, enable_thinking: false will not work correctly
+#
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: nemotron-super-fp8-vllm-agg
+spec:
+  backendFramework: vllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      extraPodSpec:
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 8000
+            periodSeconds: 10
+            timeoutSeconds: 1800
+            failureThreshold: 60
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          # Approximate KV-aware routing: uses prefix hashing to route
+          # requests to workers likely to have relevant KV cache, without
+          # requiring KV events from the backend.
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode kv --no-kv-events --http-port 8000
+    VllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      replicas: 1
+      resources:
+        limits:
+          gpu: "4"
+        requests:
+          gpu: "4"
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 16Gi
+      extraPodSpec:
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 10
+            timeoutSeconds: 10
+            failureThreshold: 120
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          env:
+            - name: HF_HOME
+              value: "/opt/models"
+            # Workaround for vllm/vllm#35772: FlashInfer allreduce can hang during
+            # CUDA graph capture with TP>1. Fixed in vllm/vllm#35793 (default changed
+            # to trtllm). The 1.0 container ships vLLM 0.16.0, so set explicitly.
+            - name: VLLM_FLASHINFER_ALLREDUCE_BACKEND
+              value: "trtllm"
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --served-model-name
+            - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
+            - --tensor-parallel-size
+            - "4"
+            - --trust-remote-code
+            # On Blackwell, vLLM defaults to FlashInfer, which has a stale NaN bug
+            # with hybrid Mamba models (vllm/vllm#35138). FlashInfer's multiply-by-zero
+            # masking doesn't clear NaN from stale Mamba fp32 blocks reused by attention
+            # layers, causing progressive accuracy degradation.
+            # On Hopper, the default (FLASH_ATTN) is safe and this can be omitted.
+            # On Blackwell, use FLASH_ATTN or TRITON_ATTN to avoid the bug:
+            #   --attention-backend FLASH_ATTN
+            # or:
+            #   --attention-backend TRITON_ATTN
+            # Workaround for vllm/vllm#34865: prefix caching with mamba_cache_mode="all"
+            # (the default for Nemotron-H) produces NaN logprobs and garbage tokens.
+            # Fixed in vLLM 0.17.0 (vllm/vllm#34874). Use "align" until then.
+            - --mamba-cache-mode
+            - align
+            # --connector none is no longer needed in 1.0 (default is no connector).
+            # In 0.9.1, you must add: --connector none
+            #
+            # --is-decode-worker also automatically disables KV event publishing,
+            # which pairs with --no-kv-events on the frontend for approximate routing.
+            - --is-decode-worker
+            - --dyn-tool-call-parser
+            - nemotron_nano
+            # nemotron_nano reasoning parser handles both enable_thinking: true and false.
+            # In 0.9.1, use deepseek_r1 instead (nemotron_nano reasoning parser is broken),
+            # but note that enable_thinking: false will not work with deepseek_r1.
+            - --dyn-reasoning-parser
+            - nemotron_nano