--- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 title: GPT-OSS --- Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs. ## Overview This deployment uses disaggregated serving in vLLM where: - **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism - **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput - **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing ## Prerequisites This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [the vLLM Backend guide](README.md) ## Instructions ### 1. Launch the Deployment Note that GPT-OSS is a reasoning model with tool calling support. To ensure the response is being processed correctly, the worker should be launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`. **Start frontend** ```bash python3 -m dynamo.frontend --http-port 8000 & ``` **Run decode worker** ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 python -m dynamo.vllm \ --model openai/gpt-oss-120b \ --tensor-parallel-size 4 \ --dyn-reasoning-parser gpt_oss \ --dyn-tool-call-parser harmony ``` **Run prefill workers** ```bash CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \ --model openai/gpt-oss-120b \ --tensor-parallel-size 4 \ --disaggregation-mode prefill \ --dyn-reasoning-parser gpt_oss \ --dyn-tool-call-parser harmony ``` ### 2. Verify the Deployment is Ready Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started: ``` curl http://localhost:8000/health ``` Make sure that both of the `generate` endpoints are available before sending an inference request: ``` { "status": "healthy", "endpoints": [ "dyn://dynamo.backend.generate" ], "instances": [ { "component": "backend", "endpoint": "generate", "namespace": "dynamo", "instance_id": 7587889712474989333, "transport": { "nats_tcp": "dynamo_backend.generate-694d997dbae9a315" } }, { "component": "prefill", "endpoint": "generate", "namespace": "dynamo", "instance_id": 7587889712474989350, "transport": { "nats_tcp": "dynamo_prefill.generate-694d997dbae9a326" } }, ... ] } ``` If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress. ### 3. Test the Deployment Send a test request to verify the deployment: ```bash curl -X POST http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-oss-120b", "input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.", "max_output_tokens": 200, "stream": false }' ``` The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs. ### 4. Reasoning and Tool Calling Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally multi-turn as it involves tool selection and generation based on the tool result. Below is an example of sending multi-round requests to complete a user query with reasoning and tool calling: **Application setup (pseudocode)** ```Python # The tool defined by the application def get_system_health(): for component in system.components: if not component.health(): return False return True # The JSON representation of the declaration in ChatCompletion tool style tool_choice = '{ "type": "function", "function": { "name": "get_system_health", "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.", "parameters": { "type": "object", "properties": {} } } }' # On user query, perform below workflow. def user_query(app_request): # first round # create chat completion with prompt and tool choice request = ... response = send(request) if response["finish_reason"] == "tool_calls": # second round function, params = parse_tool_call(response) function_result = function(params) # create request with prompt, assistant response, and function result request = ... response = send(request) return app_response(response) ``` **First request with tools** ```bash curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' { "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "Hey, quick check: is everything up and running?" } ], "tools": [ { "type": "function", "function": { "name": "get_system_health", "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.", "parameters": { "type": "object", "properties": {} } } } ], "response_format": { "type": "text" }, "stream": false, "max_tokens": 300 }' ``` **First response with tool choice** ```JSON { "id": "chatcmpl-d1c12219-6298-4c83-a6e3-4e7cef16e1a9", "choices": [ { "index": 0, "message": { "tool_calls": [ { "id": "call-1", "type": "function", "function": { "name": "get_system_health", "arguments": "{}" } } ], "role": "assistant", "reasoning_content": "We need to check system health. Use function." }, "finish_reason": "tool_calls" } ], "created": 1758758741, "model": "openai/gpt-oss-120b", "object": "chat.completion", "usage": null } ``` **Second request with tool calling result** ```bash curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' { "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "Hey, quick check: is everything up and running?" }, { "role": "assistant", "tool_calls": [ { "id": "call-1", "type": "function", "function": { "name": "get_system_health", "arguments": "{}" } } ] }, { "role": "tool", "tool_call_id": "call-1", "content": "{\"status\":\"ok\",\"uptime_seconds\":372045}" } ], "tools": [ { "type": "function", "function": { "name": "get_system_health", "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.", "parameters": { "type": "object", "properties": {} } } } ], "response_format": { "type": "text" }, "stream": false, "max_tokens": 300 }' ``` **Second response with final message** ```JSON { "id": "chatcmpl-9ebfe64a-68b9-4c1d-9742-644cf770ad0e", "choices": [ { "index": 0, "message": { "content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.", "role": "assistant", "reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up." }, "finish_reason": "stop" } ], "created": 1758758853, "model": "openai/gpt-oss-120b", "object": "chat.completion", "usage": null } ```