fix: gpt-oss-120b disagg recipe — fix MODEL_PATH and update README (#8133)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gpt-oss-120b disagg recipe — fix MODEL_PATH and update README (#8133)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
423c89ee · Ben Hamm · GitHub · 59f474a2 · 423c89ee · 423c89ee
Unverified Commit 423c89ee authored Apr 15, 2026 by Ben Hamm Committed by GitHub Apr 15, 2026
4 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -38,6 +38,7 @@ These recipes demonstrate aggregated or disaggregated serving:
 | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
 | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
 | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
+| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 5x Blackwell (GB200/B200) | ✅ | ✅ | Prefill/Decode split | ❌ |
 | **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use `model-download-sglang.yaml` | ❌ |
 | **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
 | **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |

--- a/recipes/gpt-oss-120b/README.md
+++ b/recipes/gpt-oss-120b/README.md
@@ -7,8 +7,7 @@ Production-ready deployment for **GPT-OSS-120B** using TensorRT-LLM on Blackwell
 | Configuration | GPUs | Mode | Description |
 |--------------|------|------|-------------|
 | [**trtllm/agg**](trtllm/agg/) | 4x GB200 | Aggregated | WideEP, ARM64 |
-
-> **Note:** A [disaggregated configuration](trtllm/disagg/) exists with engine configs but is not yet production-ready. See [trtllm/disagg/README.md](trtllm/disagg/README.md) for details.
+| [**trtllm/disagg**](trtllm/disagg/) | 5x Blackwell (GB200/B200) | Disaggregated | Prefill/Decode split |

 ## Prerequisites


--- a/recipes/gpt-oss-120b/trtllm/disagg/README.md
+++ b/recipes/gpt-oss-120b/trtllm/disagg/README.md
-# GPT-OSS-120B Disaggregated Mode
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->

-> **⚠️ INCOMPLETE**: This directory contains only engine configuration files and is not ready for Kubernetes deployment.
+# GPT-OSS-120B Disaggregated Prefill/Decode

-## Current Status
+Serves [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using TensorRT-LLM with
+disaggregated prefill/decode via Dynamo on GB200 nodes.

-This directory contains TensorRT-LLM engine configurations for disaggregated serving:
- `decode.yaml` - Decode worker engine configuration
- `prefill.yaml` - Prefill worker engine configuration
+## Topology

-## Missing Components
+| Role    | Nodes | GPUs/node | Total GPUs | Parallelism |
+|---------|-------|-----------|------------|-------------|
+| Prefill | 1     | 1         | 1          | TP1         |
+| Decode  | 1     | 4         | 4          | TP4         |

-To complete this recipe, the following files are needed:
- `deploy.yaml` - Kubernetes DynamoGraphDeployment manifest
- `perf.yaml` - Performance benchmarking job (optional)
+## Prerequisites

-## Alternative
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../../../docs/kubernetes/README.md)
+2. **Blackwell GPU nodes** (GB200 or B200)
+3. **HuggingFace token** with access to the model

-For a production-ready GPT-OSS-120B deployment, use the **aggregated mode**:
- [gpt-oss-120b/trtllm/agg/](../agg/) - Complete with `deploy.yaml` and `perf.yaml`
+## Deploy

-## Contributing
+Follow the [top-level Quick Start](../../README.md) to set up the namespace, HuggingFace
+token secret, and model download. Then:

-If you'd like to complete this recipe, see [recipes/CONTRIBUTING.md](../../../CONTRIBUTING.md) for guidelines on creating proper Kubernetes deployment manifests.
+```bash
+kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
+```

+Monitor startup (model loading takes ~15–30 minutes depending on storage speed):
+
+```bash
+kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/part-of=gpt-oss-disagg -w
+```
+
+## Test
+
+```bash
+kubectl port-forward svc/gpt-oss-disagg-frontend 8000:8000 -n ${NAMESPACE} &
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
+```
+
+## Benchmark (optional)
+
+Edit `perf.yaml` to set your namespace and PVC, then run:
+
+```bash
+kubectl apply -f trtllm/disagg/perf.yaml -n ${NAMESPACE}
+kubectl logs -f -l job-name=gpt-oss-120b-disagg-bench -n ${NAMESPACE}
+```
+
+## Key Configuration Notes
+
+### Engine Configs
+
+The `deploy.yaml` includes a ConfigMap with separate engine configurations for
+prefill and decode workers. Key differences:
+
+- **Prefill**: TP1, `max_batch_size=64`, `free_gpu_memory_fraction=0.8`, overlap scheduler disabled
+- **Decode**: TP4, `max_batch_size=1280`, `free_gpu_memory_fraction=0.85`, overlap scheduler enabled
+
+### KV Transfer
+
+Uses UCX-based cache transceiver (`max_tokens_in_buffer=9216`) for KV cache
+transfer between prefill and decode workers.
+
+### Quantization
+
+Uses `W4A8_MXFP4_MXFP8` quantization via the `OVERRIDE_QUANT_ALGO` environment variable.
--- a/recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml
+++ b/recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml
@@ -139,7 +139,7 @@ spec:
          - name: ENGINE_ARGS
            value: "/opt/dynamo/configs/prefill.yaml"
          - name: MODEL_PATH
-            value: "/opt/models/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
+            value: "openai/gpt-oss-120b"
          - name: HF_HOME
            value: /opt/models
          volumeMounts:
@@ -204,7 +204,7 @@ spec:
          - name: ENGINE_ARGS
            value: "/opt/dynamo/configs/decode.yaml"
          - name: MODEL_PATH
-            value: "/opt/models/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
+            value: "openai/gpt-oss-120b"
          - name: HF_HOME
            value: /opt/models
          volumeMounts: