Unverified Commit 423c89ee authored by Ben Hamm's avatar Ben Hamm Committed by GitHub
Browse files

fix: gpt-oss-120b disagg recipe — fix MODEL_PATH and update README (#8133)


Co-authored-by: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent 59f474a2
......@@ -38,6 +38,7 @@ These recipes demonstrate aggregated or disaggregated serving:
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 5x Blackwell (GB200/B200) | ✅ | ✅ | Prefill/Decode split | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
......
......@@ -7,8 +7,7 @@ Production-ready deployment for **GPT-OSS-120B** using TensorRT-LLM on Blackwell
| Configuration | GPUs | Mode | Description |
|--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 4x GB200 | Aggregated | WideEP, ARM64 |
> **Note:** A [disaggregated configuration](trtllm/disagg/) exists with engine configs but is not yet production-ready. See [trtllm/disagg/README.md](trtllm/disagg/README.md) for details.
| [**trtllm/disagg**](trtllm/disagg/) | 5x Blackwell (GB200/B200) | Disaggregated | Prefill/Decode split |
## Prerequisites
......
# GPT-OSS-120B Disaggregated Mode
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
> **⚠️ INCOMPLETE**: This directory contains only engine configuration files and is not ready for Kubernetes deployment.
# GPT-OSS-120B Disaggregated Prefill/Decode
## Current Status
Serves [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using TensorRT-LLM with
disaggregated prefill/decode via Dynamo on GB200 nodes.
This directory contains TensorRT-LLM engine configurations for disaggregated serving:
- `decode.yaml` - Decode worker engine configuration
- `prefill.yaml` - Prefill worker engine configuration
## Topology
## Missing Components
| Role | Nodes | GPUs/node | Total GPUs | Parallelism |
|---------|-------|-----------|------------|-------------|
| Prefill | 1 | 1 | 1 | TP1 |
| Decode | 1 | 4 | 4 | TP4 |
To complete this recipe, the following files are needed:
- `deploy.yaml` - Kubernetes DynamoGraphDeployment manifest
- `perf.yaml` - Performance benchmarking job (optional)
## Prerequisites
## Alternative
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../../../docs/kubernetes/README.md)
2. **Blackwell GPU nodes** (GB200 or B200)
3. **HuggingFace token** with access to the model
For a production-ready GPT-OSS-120B deployment, use the **aggregated mode**:
- [gpt-oss-120b/trtllm/agg/](../agg/) - Complete with `deploy.yaml` and `perf.yaml`
## Deploy
## Contributing
Follow the [top-level Quick Start](../../README.md) to set up the namespace, HuggingFace
token secret, and model download. Then:
If you'd like to complete this recipe, see [recipes/CONTRIBUTING.md](../../../CONTRIBUTING.md) for guidelines on creating proper Kubernetes deployment manifests.
```bash
kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
```
Monitor startup (model loading takes ~15–30 minutes depending on storage speed):
```bash
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/part-of=gpt-oss-disagg -w
```
## Test
```bash
kubectl port-forward svc/gpt-oss-disagg-frontend 8000:8000 -n ${NAMESPACE} &
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
```
## Benchmark (optional)
Edit `perf.yaml` to set your namespace and PVC, then run:
```bash
kubectl apply -f trtllm/disagg/perf.yaml -n ${NAMESPACE}
kubectl logs -f -l job-name=gpt-oss-120b-disagg-bench -n ${NAMESPACE}
```
## Key Configuration Notes
### Engine Configs
The `deploy.yaml` includes a ConfigMap with separate engine configurations for
prefill and decode workers. Key differences:
- **Prefill**: TP1, `max_batch_size=64`, `free_gpu_memory_fraction=0.8`, overlap scheduler disabled
- **Decode**: TP4, `max_batch_size=1280`, `free_gpu_memory_fraction=0.85`, overlap scheduler enabled
### KV Transfer
Uses UCX-based cache transceiver (`max_tokens_in_buffer=9216`) for KV cache
transfer between prefill and decode workers.
### Quantization
Uses `W4A8_MXFP4_MXFP8` quantization via the `OVERRIDE_QUANT_ALGO` environment variable.
......@@ -139,7 +139,7 @@ spec:
- name: ENGINE_ARGS
value: "/opt/dynamo/configs/prefill.yaml"
- name: MODEL_PATH
value: "/opt/models/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
value: "openai/gpt-oss-120b"
- name: HF_HOME
value: /opt/models
volumeMounts:
......@@ -204,7 +204,7 @@ spec:
- name: ENGINE_ARGS
value: "/opt/dynamo/configs/decode.yaml"
- name: MODEL_PATH
value: "/opt/models/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
value: "openai/gpt-oss-120b"
- name: HF_HOME
value: /opt/models
volumeMounts:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment