Unverified Commit 7ca6a562 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: update Fern docs for main branch (#5706)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent 704c1dad
{
"organization": "ai-dynamo",
"version": "3.29.1"
"organization": "ai-dynamo",
"version": "3.52.0"
}
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Tool Calling with Dynamo"
---
# Tool Calling with Dynamo
You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose
to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information.
......@@ -21,14 +22,12 @@ To enable this feature, you should set the following flag while launching the ba
python -m dynamo.<backend> --help"
```
<Note>
If no tool call parser is provided by the user, Dynamo will try to use default tool call parsing based on `<TOOLCALL>` and `<|python_tag|>` tool tags.
</Note>
> [!NOTE]
> If no tool call parser is provided by the user, Dynamo will try to use default tool call parsing based on `<TOOLCALL>` and `<|python_tag|>` tool tags.
<Tip>
If your model's default chat template doesn't support tool calling, but the model itself does, you can specify a custom chat template per worker
with `python -m dynamo.<backend> --custom-jinja-template </path/to/template.jinja>`.
</Tip>
> [!TIP]
> If your model's default chat template doesn't support tool calling, but the model itself does, you can specify a custom chat template per worker
> with `python -m dynamo.<backend> --custom-jinja-template </path/to/template.jinja>`.
Parser to Model Mapping
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo NIXL Connect"
---
# Dynamo NIXL Connect
Dynamo NIXL Connect specializes in moving data between models/workers in a Dynamo Graph, and for the use cases where registration and memory regions need to be dynamic.
Dynamo connect provides utilities for such use cases, using the NIXL-based I/O subsystem via a set of Python classes.
The relaxed registration comes with some performance overheads, but simplifies the integration process.
Especially for larger data transfer operations, such as between models in a multi-model graph, the overhead would be marginal.
The `dynamo.nixl_connect` library can be imported by any Dynamo container hosted application.
<Note>
Dynamo NIXL Connect will pick the best available method of data transfer available to it.
The available methods depend on the hardware and software configuration of the machines and network running the graph.
GPU Direct RDMA operations require that both ends of the operation have:
- NIC and GPU capable of performing RDMA operations
- Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
- Network that supports InfiniBand or RoCE
With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
</Note>
> [!NOTE]
> Dynamo NIXL Connect will pick the best available method of data transfer available to it.
> The available methods depend on the hardware and software configuration of the machines and network running the graph.
> GPU Direct RDMA operations require that both ends of the operation have:
> - NIC and GPU capable of performing RDMA operations
> - Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
> - Network that supports InfiniBand or RoCE
> With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
> For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
```python
import dynamo.nixl_connect
......@@ -85,9 +85,8 @@ flowchart LR
e2@{ animate: true; }
```
<Note>
When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
</Note>
> [!NOTE]
> When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
### Multimodal Example
......@@ -135,10 +134,9 @@ flowchart LR
o2@{ animate: true; }
```
<Note>
In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
</Note>
> [!NOTE]
> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
> The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
#### Code Examples
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.Connector"
---
# dynamo.nixl_connect.Connector
Core class for managing the connection between workers in a distributed environment.
Use this class to create readable and writable operations, or read and write data to remote workers.
......@@ -23,10 +24,9 @@ provides NIXL metadata ([RdmaMetadata](rdma-metadata.md)) via its `.metadata()`
The NIXL metadata must be provided to the remote worker expected to complete the operation.
The metadata contains required information (identifiers, keys, etc.) which enables the remote worker to interact with the provided memory.
<Warning>
NIXL metadata contains a worker's address as well as security keys to access specific registered memory descriptors.
This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
</Warning>
> [!WARNING]
> NIXL metadata contains a worker's address as well as security keys to access specific registered memory descriptors.
> This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
## Example Usage
......@@ -37,11 +37,10 @@ This data provides direct memory access between workers, and should be considere
self.connector = dynamo.nixl_connect.Connector()
```
<Tip>
See [`ReadOperation`](read-operation.md#example-usage), [`ReadableOperation`](readable-operation.md#example-usage),
[`WritableOperation`](writable-operation.md#example-usage), and [`WriteOperation`](write-operation.md#example-usage)
for additional examples.
</Tip>
> [!TIP]
> See [`ReadOperation`](read-operation.md#example-usage), [`ReadableOperation`](readable-operation.md#example-usage),
> [`WritableOperation`](writable-operation.md#example-usage), and [`WriteOperation`](write-operation.md#example-usage)
> for additional examples.
## Methods
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.Descriptor"
---
# dynamo.nixl_connect.Descriptor
Memory descriptor that ensures memory is registered with the NIXL-base I/O subsystem.
Memory must be registered with the NIXL subsystem to enable interaction with the memory.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.DeviceKind(IntEnum)"
---
# dynamo.nixl_connect.DeviceKind(IntEnum)
Represents the kind of device a [`Device`](device.md) object represents.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.Device"
---
# dynamo.nixl_connect.Device
`Device` class describes the device a given allocation resides in.
Usually host (`"cpu"`) or GPU (`"cuda"`) memory.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.OperationStatus(IntEnum)"
---
# dynamo.nixl_connect.OperationStatus(IntEnum)
Represents the current state or status of an operation.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.RdmaMetadata"
---
# dynamo.nixl_connect.RdmaMetadata
A Pydantic type intended to provide JSON serialized NIXL metadata about a [`ReadableOperation`](readable-operation.md) or [`WritableOperation`](writable-operation.md) object.
NIXL metadata contains detailed information about a worker process and how to access memory regions registered with the corresponding agent.
This data is required to perform data transfers using the NIXL-based I/O subsystem.
<Warning>
NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
</Warning>
> [!WARNING]
> NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
> This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
Use the respective class's `.metadata()` method to generate an `RdmaMetadata` object for an operation.
<Tip>
Classes using `RdmaMetadata` objects must be paired correctly.
[`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
[`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).
Incorrect pairing will result in an error being raised.
</Tip>
> [!TIP]
> Classes using `RdmaMetadata` objects must be paired correctly.
> [`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
> [`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).
> Incorrect pairing will result in an error being raised.
## Related Classes
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.ReadOperation"
---
# dynamo.nixl_connect.ReadOperation
An operation which transfers data from a remote worker to the local worker.
To create the operation, NIXL metadata ([RdmaMetadata](rdma-metadata.md)) from a remote worker's [`ReadableOperation`](readable-operation.md)
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.ReadableOperation"
---
# dynamo.nixl_connect.ReadableOperation
An operation which enables a remote worker to read data from the local worker.
To create the operation, a set of local [`Descriptor`](descriptor.md) objects must be provided that reference memory intended to be transferred to a remote worker.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.WritableOperation"
---
# dynamo.nixl_connect.WritableOperation
An operation which enables a remote worker to write data to the local worker.
To create the operation, a set of local [`Descriptor`](descriptor.md) objects must be provided which reference memory intended to receive data from a remote worker.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "dynamo.nixl_connect.WriteOperation"
---
# dynamo.nixl_connect.WriteOperation
An operation which transfers data from the local worker to a remote worker.
To create the operation, NIXL metadata ([RdmaMetadata](rdma-metadata.md)) from a remote worker's [`WritableOperation`](writable-operation.md)
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Running SGLang with Dynamo"
---
# Running SGLang with Dynamo
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
......@@ -65,9 +66,8 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
- **Default (`--use-sglang-tokenizer` not set)**: Dynamo handles tokenization/detokenization via our blazing fast frontend and passes `input_ids` to SGLang
- **With `--use-sglang-tokenizer`**: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
<Note>
When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
</Note>
> [!NOTE]
> When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
### Request Cancellation
......@@ -80,9 +80,8 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |
<Warning>
⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
</Warning>
> [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
......@@ -164,18 +163,22 @@ docker run \
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
<Tip>
Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
</Tip>
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
> [!TIP]
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
### Aggregated Serving
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Expert Parallelism Load Balancer (EPLB) in SGLang"
---
# Expert Parallelism Load Balancer (EPLB) in SGLang
Mixture-of-Experts (MoE) models utilize a technique called Expert Parallelism (EP), where experts are distributed across multiple GPUs. While this allows for much larger and more powerful models, it can lead to an uneven workload distribution. Because the load on different experts may vary depending on the workload, some GPUs can become bottlenecks, forcing the entire system to wait. This imbalance leads to wasted compute cycles and increased memory usage.
To address this, SGLang implements an Expert Parallelism Load Balancer (EPLB) inspired by the work in the DeepSeek-V3 paper. EPLB analyzes expert usage patterns and dynamically re-arranges the experts across the available GPUs to ensure a more balanced workload.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Running gpt-oss-120b Disaggregated with SGLang"
---
# Running gpt-oss-120b Disaggregated with SGLang
The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](../vllm/gpt-oss.md),
please ues the vLLM guide as a reference with the different deployment steps as highlighted below:
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Profiling SGLang Workers in Dynamo"
---
# Profiling SGLang Workers in Dynamo
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "SGLang Prometheus Metrics"
---
# SGLang Prometheus Metrics
## Overview
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Enable SGLang Hierarchical Cache (HiCache)"
---
# Enable SGLang Hierarchical Cache (HiCache)
This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
## 1) Start the SGLang worker with HiCache enabled
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "SGLang Disaggregated Serving"
---
# SGLang Disaggregated Serving
This document explains how SGLang's disaggregated prefill-decode architecture works, both standalone and within Dynamo.
## Overview
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment