Unverified Commit f9050aae authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate existing docs to fern (#5445)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent f238d23a
{
"organization": "ai-dynamo",
"version": "3.29.1"
}
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Tool Calling with Dynamo"
---
You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose
to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information.
Tool calling (AKA function calling) is controlled using the `tool_choice` and `tools` request parameters.
## Prerequisites
To enable this feature, you should set the following flag while launching the backend worker
- `--dyn-tool-call-parser` : select the parser from the available parsers list using the below command
```bash
# <backend> can be vllm, sglang, trtllm, etc. based on your installation
python -m dynamo.<backend> --help"
```
<Note>
If no tool call parser is provided by the user, Dynamo will try to use default tool call parsing based on `<TOOLCALL>` and `<|python_tag|>` tool tags.
</Note>
<Tip>
If your model's default chat template doesn't support tool calling, but the model itself does, you can specify a custom chat template per worker
with `python -m dynamo.<backend> --custom-jinja-template </path/to/template.jinja>`.
</Tip>
Parser to Model Mapping
| Parser Name | Supported Models |
|-------------|-----------------------------------------------------------------------|
| hermes | Qwen/Qwen2.5-*, Qwen/QwQ-32B, NousResearch/Hermes-2-Pro-*, NousResearch/Hermes-2-Theta-*, NousResearch/Hermes-3-* |
| mistral | mistralai/Mistral-7B-Instruct-v0.3, Additional mistral function-calling models are compatible as well.|
| llama3_json | meta-llama/Llama-3.1-*, meta-llama/Llama-3.2-* |
| harmony | openai/gpt-oss-* |
| nemotron_deci | nvidia/nemotron-* |
| phi4 | Phi-4-* |
| deepseek_v3 | deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-0528 |
| deepseek_v3_1 | deepseek-ai/DeepSeek-V3.1 |
| pythonic | meta-llama/Llama-4-* |
| jamba | ai21labs/AI21-Jamba-*-1.5, ai21labs/AI21-Jamba-*-1.6, ai21labs/AI21-Jamba-*-1.7, |
## Examples
### Launch Dynamo Frontend and Backend
```bash
# launch backend worker
python -m dynamo.vllm --model openai/gpt-oss-20b --dyn-tool-call-parser harmony
# launch frontend worker
python -m dynamo.frontend
```
### Tool Calling Request Examples
- Example 1
```python
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8081/v1", api_key="dummy")
def get_weather(location: str, unit: str):
return f"Getting the weather for {location} in {unit}..."
tool_functions = {"get_weather": get_weather}
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location", "unit"]
}
}
}]
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[{"role": "user", "content": "What's the weather like in San Francisco in Celsius?"}],
tools=tools,
tool_choice="auto",
max_tokens=10000
)
print(f"{response}")
tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
print(f"Result: {tool_functions[tool_call.name](**json.loads(tool_call.arguments))}")
```
- Example 2
```python
# Use tools defined in example 1
time_tool = {
"type": "function",
"function": {
"name": "get_current_time_nyc",
"description": "Get the current time in NYC.",
"parameters": {}
}
}
tools.append(time_tool)
messages = [
{"role": "user", "content": "What's the current time in New York?"}
]
response = client.chat.completions.create(
model="openai/gpt-oss-20b", #client.models.list().data[1].id,
messages=messages,
tools=tools,
tool_choice="auto",
max_tokens=100,
)
print(f"{response}")
tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
```
- Example 3
```python
tools = [
{
"type": "function",
"function": {
"name": "get_tourist_attractions",
"description": "Get a list of top tourist attractions for a given city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The name of the city to find attractions for.",
}
},
"required": ["city"],
},
},
},
]
def get_messages():
return [
{
"role": "user",
"content": (
"I'm planning a trip to Tokyo next week. what are some top tourist attractions in Tokyo? "
),
},
]
messages = get_messages()
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=messages,
tools=tools,
tool_choice="auto",
max_tokens=100,
)
print(f"{response}")
tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo NIXL Connect"
---
Dynamo NIXL Connect specializes in moving data between models/workers in a Dynamo Graph, and for the use cases where registration and memory regions need to be dynamic.
Dynamo connect provides utilities for such use cases, using the NIXL-based I/O subsystem via a set of Python classes.
The relaxed registration comes with some performance overheads, but simplifies the integration process.
Especially for larger data transfer operations, such as between models in a multi-model graph, the overhead would be marginal.
The `dynamo.nixl_connect` library can be imported by any Dynamo container hosted application.
<Note>
Dynamo NIXL Connect will pick the best available method of data transfer available to it.
The available methods depend on the hardware and software configuration of the machines and network running the graph.
GPU Direct RDMA operations require that both ends of the operation have:
- NIC and GPU capable of performing RDMA operations
- Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
- Network that supports InfiniBand or RoCE
With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
</Note>
```python
import dynamo.nixl_connect
```
All operations using the NIXL Connect library begin with the [`Connector`](connector.md) class and the type of operation required.
There are four types of supported operations:
1. **Register local readable memory**:
Register local memory buffer(s) with the NIXL subsystem to enable a remote worker to read from.
2. **Register local writable memory**:
Register local memory buffer(s) with the NIXL subsystem to enable a remote worker to write to.
3. **Read from registered, remote memory**:
Read remote memory buffer(s), registered by a remote worker to be readable, into local memory buffer(s).
4. **Write to registered, remote memory**:
Write local memory buffer(s) to remote memory buffer(s) registered by a remote worker to writable.
When available, by connecting correctly paired operations, high-throughput GPU Direct RDMA data transfers can be completed.
Given the list above, the correct pairing of operations would be 1 & 3 or 2 & 4.
Where one side is a "(read|write)-able operation" and the other is its correctly paired "(read|write) operation".
Specifically, a read operation must be paired with a readable operation, and a write operation must be paired with a writable operation.
```mermaid
sequenceDiagram
participant LocalWorker
participant RemoteWorker
participant NIXL
LocalWorker ->> NIXL: Register memory (Descriptor)
RemoteWorker ->> NIXL: Register memory (Descriptor)
LocalWorker ->> LocalWorker: Create Readable/WritableOperation
LocalWorker ->> RemoteWorker: Send NIXL metadata (via HTTP/TCP+NATS)
RemoteWorker ->> NIXL: Begin Read/WriteOperation with metadata
NIXL -->> RemoteWorker: Data transfer
RemoteWorker -->> LocalWorker: Notify completion (unblock awaiter)
```
## Examples
### Generic Example
In the diagram below, Local creates a [`WritableOperation`](writable-operation.md) intended to receive data from Remote.
Local then sends metadata about the requested operation to Remote.
Remote then uses the metadata to create a [`WriteOperation`](write-operation.md) which will perform the GPU Direct RDMA memory transfer, when available, from Remote's GPU memory to Local's GPU memory.
```mermaid
---
title: Write Operation Between Two Workers (RDMA available)
---
flowchart LR
c1[Remote] --"3: .begin_write()"--- WriteOperation
WriteOperation e1@=="4: GPU Direct RDMA"==> WritableOperation
WritableOperation --"1: .create_writable()"--- c2[Local]
c2 e2@--"2: RDMA Metadata via HTTP"--> c1
e1@{ animate: true; }
e2@{ animate: true; }
```
<Note>
When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
</Note>
### Multimodal Example
In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):
1. The HTTP frontend accepts a text prompt and a URL to an image.
2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker.
3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker.
4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker.
5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker.
6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data.
7. Finally, Decode Worker performs the requested inference.
```mermaid
---
title: Multimodal Disaggregated Workflow
---
flowchart LR
p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor]
p0 i1@--"url"-->p1
p1 i2@--"prompt"-->dw[Decode Worker]
p1 i3@--"url"-->dw
dw i4@--"prompt"-->pw[Prefill Worker]
dw i5@--"url"-->pw
pw i6@--"url"-->ew[Encode Worker]
ew o0@=="image embeddings"==>pw
pw o1@=="kv_cache updates"==>dw
dw o2@--"inference results"-->p0
i0@{ animate: true; }
i1@{ animate: true; }
i2@{ animate: true; }
i3@{ animate: true; }
i4@{ animate: true; }
i5@{ animate: true; }
i6@{ animate: true; }
o0@{ animate: true; }
o1@{ animate: true; }
o2@{ animate: true; }
```
<Note>
In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
</Note>
#### Code Examples
See [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example,
for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable-operation.md),
sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
See [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example,
for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md),
a [`WriteOperation`](write-operation.md) is created using the metadata provided by the requesting worker,
and the worker awaits for the data transfer to complete for yielding a response.
## Python Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
## References
- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
- [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
- [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.Connector"
---
Core class for managing the connection between workers in a distributed environment.
Use this class to create readable and writable operations, or read and write data to remote workers.
This class provides a "pythonic" interface using NIXL library to utilize GPU Direct RDMA accelerated, when available, data transfers between models hosted by different workers in a Dynamo graph.
The connector provides two methods of moving data between workers:
- Preparing local memory to be written to by a remote worker.
- Preparing local memory to be read by a remote worker.
In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](descriptor.md) class and provided to the connector.
When RDMA is available, the connector then configures the RDMA subsystem to expose the memory for the requested operation and returns an operation control object;
otherwise the connector will select the best available RDMA alternative.
The operation control object, either a [`ReadableOperation`](readable-operation.md) or a [`WritableOperation`](writable-operation.md),
provides NIXL metadata ([RdmaMetadata](rdma-metadata.md)) via its `.metadata()` method, functionality to query the operation's current state, as well as the ability to cancel the operation prior to its completion.
The NIXL metadata must be provided to the remote worker expected to complete the operation.
The metadata contains required information (identifiers, keys, etc.) which enables the remote worker to interact with the provided memory.
<Warning>
NIXL metadata contains a worker's address as well as security keys to access specific registered memory descriptors.
This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
</Warning>
## Example Usage
```python
@async_on_start
async def async_init(self):
self.connector = dynamo.nixl_connect.Connector()
```
<Tip>
See [`ReadOperation`](read-operation.md#example-usage), [`ReadableOperation`](readable-operation.md#example-usage),
[`WritableOperation`](writable-operation.md#example-usage), and [`WriteOperation`](write-operation.md#example-usage)
for additional examples.
</Tip>
## Methods
### `begin_read`
```python
async def begin_read(
self,
remote_metadata: RdmaMetadata,
local_descriptors: Descriptor | list[Descriptor],
) -> ReadOperation:
```
Creates a [`ReadOperation`](read-operation.md) for transferring data from a remote worker.
To create the operation, the serialized request from a remote worker's [`ReadableOperation`](readable-operation.md)
along with a matching set of local memory descriptors which reference memory intended to receive data from the remote worker
must be provided.
The serialized request must be transferred from the remote to the local worker via a secondary channel, most likely HTTP or TCP+NATS.
Once created, data transfer will begin immediately.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
Use [`.wait_for_completion()`](read-operation.md#wait_for_completion) to block the caller until the operation has completed or encountered an error.
### `begin_write`
```python
async def begin_write(
self,
local_descriptors: Descriptor | list[Descriptor],
remote_metadata: RdmaMetadata,
) -> WriteOperation:
```
Creates a [`WriteOperation`](write-operation.md) for transferring data to a remote worker.
To create the operation, the serialized request from a remote worker's [`WritableOperation`](writable-operation.md)
along with a matching set of local memory descriptors which reference memory to be transferred to the remote worker
must be provided.
The serialized request must be transferred from the remote to the local worker via a secondary channel, most likely HTTP or TCP+NATS.
Once created, data transfer will begin immediately.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
Use [`.wait_for_completion()`](write-operation.md#wait_for_completion) to block the caller until the operation has completed or encountered an error.
### `create_readable`
```python
async def create_readable(
self,
local_descriptors: Descriptor | list[Descriptor],
) -> ReadableOperation:
```
Creates a [`ReadableOperation`](readable-operation.md) for transferring data to a remote worker.
To create the operation, a set of local memory descriptors must be provided that reference memory intended to be transferred to a remote worker.
Once created, the memory referenced by the provided descriptors becomes immediately readable by a remote worker with the necessary metadata.
The metadata required to access the memory referenced by the provided descriptors is accessible via the operation's `.metadata()` method.
Once acquired, the metadata needs to be provided to a remote worker via a secondary channel, most likely HTTP or TCP+NATS.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
Use [`.wait_for_completion()`](readable-operation.md#wait_for_completion) to block the caller until the operation has completed or encountered an error.
### `create_writable`
```python
async def create_writable(
self,
local_descriptors: Descriptor | list[Descriptor],
) -> WritableOperation:
```
Creates a [`WritableOperation`](writable-operation.md) for transferring data from a remote worker.
To create the operation, a set of local memory descriptors must be provided which reference memory intended to receive data from a remote worker.
Once created, the memory referenced by the provided descriptors becomes immediately writable by a remote worker with the necessary metadata.
The metadata required to access the memory referenced by the provided descriptors is accessible via the operation's `.metadata()` method.
Once acquired, the metadata needs to be provided to a remote worker via a secondary channel, most likely HTTP or TCP+NATS.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
Use [`.wait_for_completion()`](writable-operation.md#wait_for_completion) to block the caller until the operation has completed or encountered an error.
## Properties
### `hostname`
```python
@property
def hostname(self) -> str:
```
Gets the name of the current worker's host.
### `is_cuda_available`
```python
@cached_property
def is_cuda_available(self) -> bool:
```
Gets `True` when CUDA is available for the selected array module (most likely CuPy); otherwise `False`.
### `name`
```python
@property
def name(self) -> str | None:
```
Gets the Dynamo component name used by the connector.
## Related Classes
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.Descriptor"
---
Memory descriptor that ensures memory is registered with the NIXL-base I/O subsystem.
Memory must be registered with the NIXL subsystem to enable interaction with the memory.
Descriptor objects are administrative and do not copy, move, or otherwise modify the registered memory.
There are four ways to create a descriptor:
1. From a `torch.Tensor` object. Device information will be derived from the provided object.
2. From a `tuple` containing either a NumPy or CuPy `ndarray` and information describing where the memory resides (Host/CPU vs GPU).
3. From a Python `bytes` object. Memory is assumed to reside in CPU addressable host memory.
4. From a `tuple` comprised of the address of the memory, its size in bytes, and device information.
An optional reference to a Python object can be provided to avoid garbage collection issues.
## Methods
### `register_memory`
```python
def register_memory(self, connector: Connector) -> None:
```
Instructs the descriptor to register its memory buffer with the NIXL-based I/O subsystem.
Calling this method more than once on the same descriptor has no effect.
When the descriptor is assigned to a NIXL operation, it will be automatically registered if was not explicitly registered.
## Properties
### `device`
```python
@property
def device(self) -> Device:
```
Gets a reference to the [`Device`](device.md) that contains the buffer the descriptor represents.
### `size`
```python
@property
def size(self) -> int:
```
Gets the size of the memory allocation the descriptor represents.
## Related Classes
- [Connector](connector.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.DeviceKind(IntEnum)"
---
Represents the kind of device a [`Device`](device.md) object represents.
## Values
### `CUDA`
CUDA addressable device (GPU) memory.
### `HOST`
System (CPU) memory.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.Device"
---
`Device` class describes the device a given allocation resides in.
Usually host (`"cpu"`) or GPU (`"cuda"`) memory.
When a system contains multiple GPU devices, specific GPU devices can be identified by including their ordinal index number.
For example, to reference the second GPU in a system `"cuda:1"` can be used.
By default, when `"cuda"` is provided, it is assumed to be `"cuda:0"` or the first GPU enumerated by the system.
## Properties
### `id`
```python
@property
def id(self) -> int:
```
Gets the identity, or ordinal, of the device.
When the device is the [`HOST`](device-kind.md#host), this value is always `0`.
When the device is a [`GPU`](device-kind.md#cuda), this value identifies a specific GPU.
### `kind`
```python
@property
def kind(self) -> DeviceKind:
```
Gets the [`DeviceKind`](device-kind.md) of device the instance references.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [OperationStatus](operation-status.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [RdmaMetadata](rdma-metadata.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.OperationStatus(IntEnum)"
---
Represents the current state or status of an operation.
## Values
### `CANCELLED`
The operation has been cancelled by the user or system.
### `COMPLETE`
The operation has been completed successfully.
### `ERRORED`
The operation has encountered an error and cannot be completed.
### `IN_PROGRESS`
The operation has been initialized and is in-progress (not completed, errored, or cancelled).
### `INITIALIZED`
The operation has been initialized and is ready to be processed.
### `UNINITIALIZED`
The operation has not been initialized yet and is not in a valid state.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.RdmaMetadata"
---
A Pydantic type intended to provide JSON serialized NIXL metadata about a [`ReadableOperation`](readable-operation.md) or [`WritableOperation`](writable-operation.md) object.
NIXL metadata contains detailed information about a worker process and how to access memory regions registered with the corresponding agent.
This data is required to perform data transfers using the NIXL-based I/O subsystem.
<Warning>
NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
</Warning>
Use the respective class's `.metadata()` method to generate an `RdmaMetadata` object for an operation.
<Tip>
Classes using `RdmaMetadata` objects must be paired correctly.
[`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
[`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).
Incorrect pairing will result in an error being raised.
</Tip>
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.ReadOperation"
---
An operation which transfers data from a remote worker to the local worker.
To create the operation, NIXL metadata ([RdmaMetadata](rdma-metadata.md)) from a remote worker's [`ReadableOperation`](readable-operation.md)
along with a matching set of local [`Descriptor`](descriptor.md) objects which reference memory intended to receive data from the remote worker must be provided.
The NIXL metadata must be transferred from the remote to the local worker via a secondary channel, most likely HTTP or TCP+NATS.
Once created, data transfer will begin immediately.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
## Example Usage
```python
async def read_from_remote(
self,
remote_metadata: dynamo.nixl_connect.RdmaMetadata,
local_tensor: torch.Tensor
) -> None:
descriptor = dynamo.nixl_connect.Descriptor(local_tensor)
with await self.connector.begin_read(remote_metadata, descriptor) as read_op:
# Wait for the operation to complete writing data from the remote worker to local_tensor.
await read_op.wait_for_completion()
```
## Methods
### `cancel`
```python
def cancel(self) -> None:
```
Instructs the NIXL subsystem to cancel the operation.
Completed operations cannot be cancelled.
### `wait_for_completion`
```python
async def wait_for_completion(self) -> None:
```
Blocks the caller until the memory from the remote worker has been transferred to the provided buffers.
## Properties
### `status`
```python
@property
def status(self) -> OperationStatus:
```
Returns [`OperationStatus`](operation-status.md) which provides the current state (aka. status) of the operation.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.ReadableOperation"
---
An operation which enables a remote worker to read data from the local worker.
To create the operation, a set of local [`Descriptor`](descriptor.md) objects must be provided that reference memory intended to be transferred to a remote worker.
Once created, the memory referenced by the provided descriptors becomes immediately readable by a remote worker with the necessary metadata.
The NIXL metadata ([RdmaMetadata](rdma-metadata.md)) required to access the memory referenced by the provided descriptors is accessible via the operations `.metadata()` method.
Once acquired, the metadata needs to be provided to a remote worker via a secondary channel, most likely HTTP or TCP+NATS.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
## Example Usage
```python
async def send_data(
self,
local_tensor: torch.Tensor
) -> None:
descriptor = dynamo.nixl_connect.Descriptor(local_tensor)
with await self.connector.create_readable(descriptor) as read_op:
op_metadata = read_op.metadata()
# Send the metadata to the remote worker via sideband communication.
await self.notify_remote_data(op_metadata)
# Wait for the remote worker to complete its read operation of local_tensor.
# AKA send data to remote worker.
await read_op.wait_for_completion()
```
## Methods
### `metadata`
```python
def metadata(self) -> RdmaMetadata:
```
Generates and returns the NIXL metadata ([RdmaMetadata](rdma-metadata.md)) required for a remote worker to read from the operation.
Once acquired, the metadata needs to be provided to a remote worker via a secondary channel, most likely HTTP or TCP+NATS.
### `wait_for_completion`
```python
async def wait_for_completion(self) -> None:
```
Blocks the caller until the operation has received a completion signal from a remote worker.
## Properties
### `status`
```python
@property
def status(self) -> OperationStatus:
```
Returns [`OperationStatus`](operation-status.md) which provides the current state (aka. status) of the operation.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [WritableOperation](writable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.WritableOperation"
---
An operation which enables a remote worker to write data to the local worker.
To create the operation, a set of local [`Descriptor`](descriptor.md) objects must be provided which reference memory intended to receive data from a remote worker.
Once created, the memory referenced by the provided descriptors becomes immediately writable by a remote worker with the necessary metadata.
The NIXL metadata ([RdmaMetadata](rdma-metadata.md)) required to access the memory referenced by the provided descriptors is accessible via the operations `.metadata()` method.
Once acquired, the metadata needs to be provided to a remote worker via a secondary channel, most likely HTTP or TCP+NATS.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
Cancellation is handled asynchronously.
## Example Usage
```python
async def recv_data(
self,
local_tensor: torch.Tensor
) -> None:
descriptor = dynamo.nixl_connect.Descriptor(local_tensor)
with await self.connector.create_writable(descriptor) as write_op:
op_metadata = write_op.metadata()
# Send the metadata to the remote worker via sideband communication.
await self.request_remote_data(op_metadata)
# Wait the remote worker to complete its write operation to local_tensor.
# AKA receive data from remote worker.
await write_op.wait_for_completion()
```
## Methods
### `metadata`
```python
def metadata(self) -> RdmaMetadata:
```
Generates and returns the NIXL metadata ([RdmaMetadata](rdma-metadata.md)) required for a remote worker to write to the operation.
Once acquired, the metadata needs to be provided to a remote worker via a secondary channel, most likely HTTP or TCP+NATS.
### `wait_for_completion`
```python
async def wait_for_completion(self) -> None:
```
Blocks the caller until the operation has received a completion signal from a remote worker.
## Properties
### `status`
```python
@property
def status(self) -> OperationStatus:
```
Returns [`OperationStatus`](operation-status.md) which provides the current state (aka. status) of the operation.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WriteOperation](write-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.WriteOperation"
---
An operation which transfers data from the local worker to a remote worker.
To create the operation, NIXL metadata ([RdmaMetadata](rdma-metadata.md)) from a remote worker's [`WritableOperation`](writable-operation.md)
along with a matching set of local [`Descriptor`](descriptor.md) objects which reference memory to be transferred to the remote worker must be provided.
The NIXL metadata must be transferred from the remote to the local worker via a secondary channel, most likely HTTP or TCP+NATS.
Once created, data transfer will begin immediately.
Disposal of the object will instruct the NIXL subsystem to cancel the operation,
therefore the operation should be awaited until completed unless cancellation is intended.
Cancellation is handled asynchronously.
## Example Usage
```python
async def write_to_remote(
self,
remote_metadata: dynamo.nixl_connect.RdmaMetadata,
local_tensor: torch.Tensor
) -> None:
descriptor = dynamo.nixl_connect.Descriptor(local_tensor)
with await self.connector.begin_write(descriptor, remote_metadata) as write_op:
# Wait for the operation to complete writing local_tensor to the remote worker.
await write_op.wait_for_completion()
```
## Methods
### `cancel`
```python
def cancel(self) -> None:
```
Instructs the NIXL subsystem to cancel the operation.
Completed operations cannot be cancelled.
### `wait_for_completion`
```python
async def wait_for_completion(self) -> None:
```
Blocks the caller until all provided buffers have been transferred to the remote worker.
## Properties
### `status`
```python
@property
def status(self) -> OperationStatus:
```
Returns [`OperationStatus`](operation-status.md) which provides the current state (aka. status) of the operation.
## Related Classes
- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [OperationStatus](operation-status.md)
- [RdmaMetadata](rdma-metadata.md)
- [ReadOperation](read-operation.md)
- [ReadableOperation](readable-operation.md)
- [WritableOperation](writable-operation.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Running SGLang with Dynamo"
---
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Dynamo SGLang Integration](#dynamo-sglang-integration)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Aggregated Serving](#aggregated-serving)
- [Disaggregated Serving](#disaggregated-serving)
- [Deploy on SLURM or Kubernetes](#deployment)
## Feature Support Matrix
### Core Dynamo Features
| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/kvbm-architecture.md) | ❌ | Planned |
## Dynamo SGLang Integration
Dynamo SGLang integrates SGLang engines into Dynamo's distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang's engine arguments.
### Argument Handling
Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine arguments work identically**. You can pass any SGLang argument (like `--model-path`, `--tp`, `--trust-remote-code`) directly to `dynamo.sglang`.
#### Dynamo-Specific Arguments
| Argument | Description | Default | SGLang Equivalent |
|----------|-------------|---------|-------------------|
| `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault-tolerance/request-migration.md). | `0` (disabled) | N/A |
| `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
| `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
| `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
| `--custom-jinja-template` | Use custom chat template for that model (takes precedence over default chat template in model repo) | `None` | `--chat-template` |
#### Tokenizer Behavior
- **Default (`--use-sglang-tokenizer` not set)**: Dynamo handles tokenization/detokenization via our blazing fast frontend and passes `input_ids` to SGLang
- **With `--use-sglang-tokenizer`**: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
<Note>
When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
</Note>
### Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
#### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |
<Warning>
⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
</Warning>
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
## Installation
### Install latest release
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh`
<details>
<summary>Expand for instructions</summary>
```bash
# create a virtual env
uv venv --python 3.12 --seed
# install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]"
```
</details>
### Install editable version for development
<details>
<summary>Expand for instructions</summary>
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available.
```bash
# create a virtual env
uv venv --python 3.12 --seed
# build dynamo runtime bindings
uv pip install maturin
cd $DYNAMO_HOME/lib/bindings/python
maturin develop --uv
cd $DYNAMO_HOME
# installs sglang supported version along with dynamo
# include the prerelease flag to install flashinfer rc versions
uv pip install -e .
# install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2"
```
</details>
### Using docker containers
<details>
<summary>Expand for instructions</summary>
We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
```bash
cd $DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--tag dynamo-sglang:latest \
```
And then run it using
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-sglang:latest
```
</details>
## Quick Start
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start NATS and ETCD in the background
Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
```bash
docker compose -f deploy/docker-compose.yml up -d
```
<Tip>
Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
</Tip>
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh
```
### Aggregated Serving with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_router.sh
```
### Aggregated Serving for Embedding Models
Here's an example that uses the [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_embed.sh
```
<details>
<summary>Send the following request to verify your deployment:</summary>
```bash
curl localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Embedding-4B",
"input": "Hello, world!"
}'
```
</details>
### Disaggregated serving
See [SGLang Disaggregation](sglang-disaggregation.md) to learn more about how sglang and dynamo handle disaggregated serving.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg.sh
```
### Disaggregated Serving with KV Aware Prefill Routing
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_router.sh
```
### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
```bash
# note this will require 4 GPUs
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_dp_attn.sh
```
### Testing the Deployment
Send a test request to verify your deployment:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": true,
"max_tokens": 30
}'
```
## Deployment
We currently provide deployment examples for Kubernetes and SLURM.
## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**
## SLURM
- **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm_jobs/README.md)**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Expert Parallelism Load Balancer (EPLB) in SGLang"
---
Mixture-of-Experts (MoE) models utilize a technique called Expert Parallelism (EP), where experts are distributed across multiple GPUs. While this allows for much larger and more powerful models, it can lead to an uneven workload distribution. Because the load on different experts may vary depending on the workload, some GPUs can become bottlenecks, forcing the entire system to wait. This imbalance leads to wasted compute cycles and increased memory usage.
To address this, SGLang implements an Expert Parallelism Load Balancer (EPLB) inspired by the work in the DeepSeek-V3 paper. EPLB analyzes expert usage patterns and dynamically re-arranges the experts across the available GPUs to ensure a more balanced workload.
## The EPLB Algorithm: Core Concepts
The load balancing algorithm revolves around a few key ideas to achieve an optimal distribution of work.
### Redundant Experts for Flexibility
The core strategy is to create **redundant experts**. Instead of being limited to the model's original number of experts, EPLB can create duplicates of heavily-loaded experts. For example, if a model has 256 experts, you can configure EPLB to create an additional 32 "redundant" experts, bringing the total to 288. This pool of replicated experts is then strategically packed onto the available GPUs. A popular expert might be duplicated multiple times, while a moderately used expert might be grouped with several rarely used ones on a single GPU.
### Group-Limited Routing for Efficiency
Modern MoE models like DeepSeek-V3 use **group-limited expert routing**. In this design, experts are organized into groups, and routing decisions are constrained within these groups. EPLB can take advantage of this structure to reduce inter-node data traffic by attempting to place all experts from the same group onto the same node whenever possible.
### Load Balancing Policies
The algorithm comes with two policies for different scenarios:
1. **Hierarchical Load Balancing**: This policy is used when the number of server nodes evenly divides the number of expert groups. It first harnesses the group-limited routing by packing expert groups onto nodes to balance the load between nodes. Then, within each node, it replicates and packs the experts onto individual GPUs to balance the load locally. This is often used during prefill where the expert-parallel size might be smaller.
2. **Global Load Balancing**: In all other cases, a global policy is used. It replicates experts globally without regard to their group affiliation and packs them onto individual GPUs. This policy is more general and can be adopted during the decoding stage with a larger expert-parallel size.
## How SGLang Implements EPLB
SGLang provides a robust implementation of EPLB, allowing for dynamic, online rebalancing of expert locations based on real-world traffic.
### Dynamic Rebalancing
You can enable dynamic rebalancing by setting the `--enable-eplb` flag. When enabled, the `EPLBManager` runs in the background. It periodically triggers a rebalance after a certain number of requests, configured with `--eplb-rebalance-num-iterations`. At each rebalance, it computes a new expert placement plan based on the latest usage statistics and updates the model's expert locations on the fly.
### Expert Usage Recording
To make intelligent balancing decisions, SGLang needs to collect data on expert usage. The `ExpertDistributionRecorder` is responsible for this, and its behavior is controlled by the `--expert-distribution-recorder-mode` flag. This flag determines the granularity of the collected data. When `enable_eplb` is on, this mode defaults to `stat` to gather statistics for rebalancing. The available modes are:
- **`per_token`**: This is the most detailed mode. It records the specific expert choices for every single token processed by the model. While it provides the richest data, it also has the highest performance overhead. The raw, unaggregated data for each forward pass is stored.
- **`per_pass`**: In this mode, SGLang records the aggregated expert usage counts for each individual forward pass. The data is not aggregated across different passes, giving you a snapshot of expert popularity for each batch of requests.
- **`stat`**: This mode also records the exact expert usage counts for each forward pass, but it then aggregates these counts across multiple passes (the number of passes is determined by `--expert-distribution-recorder-buffer-size`). This provides a moving average of expert usage statistics and is the default when EPLB is enabled.
- **`stat_approx`**: This mode is similar to `stat` but gathers _approximate_ statistics, usually from the DeepEP dispatcher. This method has lower overhead than `stat` but is less precise, especially for small batch sizes. It is a good choice when performance is critical.
The collected statistics are then fed into the rebalancing algorithm to generate a new expert placement plan.
### Initializing with a Pre-computed Distribution
While SGLang can start with a simple default layout and learn a better one over time, you can also provide it with a pre-computed expert distribution to start with. The `--init-expert-location` flag allows you to specify a file path (`.pt` or `.json`) or a JSON string containing an expert layout. This is useful if you have already analyzed a representative workload offline and want the server to start immediately with a balanced configuration. If this flag is not set, it defaults to a `trivial` sequential layout.
### References and further reading
- [SGLang Large Scale P/D + WideEP Deployment](https://lmsys.org/blog/2025-05-05-large-scale-ep/#expert-parallelism-load-balancer)
- [Deepseek's EPLB repository](https://github.com/deepseek-ai/EPLB)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Running gpt-oss-120b Disaggregated with SGLang"
---
The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](../vllm/gpt-oss.md),
please ues the vLLM guide as a reference with the different deployment steps as highlighted below:
# Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To
ensure the response is being processed correctly, the worker should be
launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
**Start frontend**
```bash
python3 -m dynamo.frontend --http-port 8000 &
```
**Run decode worker**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.sglang \
--model-path openai/gpt-oss-120b \
--served-model-name openai/gpt-oss-120b \
--tp 4 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
**Run prefill workers**
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.sglang \
--model-path openai/gpt-oss-120b \
--served-model-name openai/gpt-oss-120b \
--tp 4 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Profiling SGLang Workers in Dynamo"
---
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
## Quick Start
1. **Start profiling:**
```bash
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
```
2. **Run some inference requests to generate profiling data**
3. **Stop profiling:**
```bash
curl -X POST http://localhost:9090/engine/stop_profile
```
4. **View the traces:**
The profiler outputs Chrome trace files in the specified `output_dir`. You can view them using:
- Chrome's `chrome://tracing`
- [Perfetto UI](https://ui.perfetto.dev/)
- TensorBoard with the PyTorch Profiler plugin
## Test Script
A test script is provided at [`examples/backends/sglang/test_sglang_profile.py`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/test_sglang_profile.py) that demonstrates the full profiling workflow:
```bash
python examples/backends/sglang/test_sglang_profile.py
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "SGLang Prometheus Metrics"
---
## Overview
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
**For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.ai/references/production_metrics.html).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
## Getting Started Quickly
This is a single machine example.
### Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
### Launch Dynamo Components
Launch a frontend and SGLang backend to test metrics:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
# Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model <model_name> --enable-metrics
```
Wait for the SGLang worker to start, then send requests and check metrics:
```bash
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^sglang:"
```
## Exposed Metrics
SGLang exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All SGLang engine metrics use the `sglang:` prefix and include labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) to identify the source.
**Example Prometheus Exposition Format text:**
```
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0
# HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075
```
**Note:** The specific metrics shown above are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.sglang.ai/references/production_metrics.html) for the current list.
### Metric Categories
SGLang provides metrics in the following categories (all prefixed with `sglang:`):
- **Throughput metrics** - Token processing rates
- **Resource usage** - System resource consumption
- **Latency metrics** - Request and token latency measurements
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.ai/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version.
## Available Metrics
The official SGLang documentation includes complete metric definitions with:
- HELP and TYPE descriptions
- Counter, Gauge, and Histogram metric types
- Metric labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`)
- Setup guide for Prometheus + Grafana monitoring
- Troubleshooting tips and configuration examples
For the complete and authoritative list of all SGLang metrics, see the [official SGLang Production Metrics documentation](https://docs.sglang.ai/references/production_metrics.html).
## Implementation Details
- SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector`
- Metrics are filtered by the `sglang:` prefix before being exposed
- The integration uses Dynamo's `register_engine_metrics_callback()` function
- Metrics appear after SGLang engine initialization completes
## Related Documentation
### SGLang Metrics
- [Official SGLang Production Metrics](https://docs.sglang.ai/references/production_metrics.html)
- [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)
### Dynamo Metrics
- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside SGLang metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Enable SGLang Hierarchical Cache (HiCache)"
---
This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
## 1) Start the SGLang worker with HiCache enabled
```bash
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl \
--log-level debug \
--skip-tokenizer-init
```
- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)
Then, start the frontend:
```bash
python -m dynamo.frontend --http-port 8000
```
## 2) Send a single request
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": false,
"max_tokens": 30
}'
```
## 3) (Optional) Benchmarking
Run the perf script:
```bash
bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
--model Qwen/Qwen3-0.6B \
--tensor-parallelism 1 \
--data-parallelism 1 \
--concurrency "2,4,8" \
--input-sequence-length 2048 \
--output-sequence-length 256
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "SGLang Disaggregated Serving"
---
This document explains how SGLang's disaggregated prefill-decode architecture works, both standalone and within Dynamo.
## Overview
Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:
- Independent scaling of prefill and decode resources
- Better resource utilization (prefill is compute-bound, decode is memory-bound)
- Efficient KV cache transfer between workers using RDMA
## How Dynamo Integrates with SGLang Disaggregation
**SGLang's standalone approach:**
1. The load balancer receives a request from the client
2. A random `(prefill, decode)` pair is selected from the pool of available workers
3. Request is sent to both `prefill` and `decode` workers via asyncio tasks
4. Internally disaggregation is done from prefill → decode
**Dynamo's approach:**
Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:
1. Route to a decode worker first
2. Choose a prefill worker via round-robin or KV-aware selection
3. Send the request to both workers
4. SGLang's bootstrap server (part of the `tokenizer_manager`) is used in conjunction with NIXL/Mooncake to handle the KV transfer
## Disaggregation Flow
The following diagram shows the complete request flow for disaggregated serving:
```mermaid
sequenceDiagram
participant Client
participant Decode
participant Prefill
Note over Decode,Prefill: 0. Setup Phase (One-Time)
Decode->>Prefill: Register RDMA connection info (base GPU memory pointers)
Note over Client,Prefill: Per-Request Phase
Client->>Decode: 1. Send request
Decode->>Prefill: 2. Forward request + get bootstrap_room
Prefill-->>Decode: Return bootstrap_room ID
Note over Decode: 3. Allocate GPU memory for KV cache
Decode->>Prefill: Send allocation info (page indices, metadata buffer)
Note over Prefill: 4. Prefill forward pass
par Decode polls
loop Poll transfer
Note over Decode: 5. Poll for KV arrival
end
and Prefill transfers
Note over Prefill: 6. RDMA write KV to decode
Prefill->>Decode: Transfer KV cache + metadata
end
Note over Prefill: 7. Poll RDMA handles
Note over Prefill: Transfer complete, deallocate metadata
Note over Decode: 8. KV received, start decode
loop Generate tokens
Note over Decode: Decode forward pass
Decode-->>Client: Stream output token
end
```
### Key Steps Explained
**Setup Phase (One-Time)**
- Decode workers register their RDMA connection information with prefill workers
- This includes base GPU memory pointers for direct memory access
**Per-Request Flow**
1. **Request initiation**: Client sends request to decode worker
2. **Bootstrap room allocation**: Decode forwards to prefill and receives a bootstrap_room ID for coordination
3. **Memory allocation**: Decode allocates GPU memory pages for incoming KV cache
4. **Prefill execution**: Prefill worker processes the prompt and generates KV cache
5. **KV transfer**: Prefill uses RDMA to write KV cache directly to decode's GPU memory (while decode polls for completion)
6. **Cleanup**: Prefill deallocates transfer metadata after confirming completion
7. **Decode phase**: Decode worker generates tokens using the transferred KV cache
8. **Streaming**: Tokens are streamed back to the client as they're generated
### Performance Characteristics
- **RDMA transfer**: Zero-copy GPU-to-GPU transfer with minimal CPU involvement
- **Parallel operations**: Decode can poll while prefill transfers data
- **One-time setup**: RDMA connections established once, reused for all requests
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment