SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with vLLM
Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in vLLM where:
-**Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
-**Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
-**Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
## Prerequisites
This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/components/backends/vllm/README.md)
## Instructions
### 1. Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To
ensure the response is being processed correctly, the worker should be
launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 3. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H"Content-Type: application/json"\
-d'{
"model": "openai/gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 4. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
multi-turn as it involves tool selection and generation based on the tool result. Below is an example
of sending multi-round requests to complete a user query with reasoning and tool calling:
**Application setup (pseudocode)**
```Python
# The tool defined by the application
def get_system_health():
for component in system.components:
if not component.health():
return False
return True
# The JSON representation of the declaration in ChatCompletion tool style
tool_choice = '{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}'
# On user query, perform below workflow.
def user_query(app_request):
# first round
# create chat completion with prompt and tool choice
request = ...
response = send(request)
if response["finish_reason"] == "tool_calls":
# second round
function, params = parse_tool_call(response)
function_result = function(params)
# create request with prompt, assistant response, and function result
"content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
"role": "assistant",
"reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."