Unverified Commit 39d645e5 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate Fern docs from fern/ into docs/ (#6206)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent d381e6ff
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running SGLang with Dynamo
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Dynamo SGLang Integration](#dynamo-sglang-integration)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Multi-Node and Advanced Examples](#advanced-examples)
- [Deploy on SLURM or Kubernetes](#deployment)
## Feature Support Matrix
### Core Dynamo Features
| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../components/planner/planner_guide.md) | ✅ | |
| [**Multimodal Support**](../../features/multimodal/multimodal_sglang.md) | ✅ | |
| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
## Dynamo SGLang Integration
Dynamo SGLang integrates SGLang engines into Dynamo's distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang's engine arguments.
### Argument Handling
Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine arguments work identically**. You can pass any SGLang argument (like `--model-path`, `--tp`, `--trust-remote-code`) directly to `dynamo.sglang`.
#### Dynamo-Specific Arguments
| Argument | Description | Default | SGLang Equivalent |
|----------|-------------|---------|-------------------|
| `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
| `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
| `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
| `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
| `--custom-jinja-template` | Use custom chat template for that model (takes precedence over default chat template in model repo) | `None` | `--chat-template` |
#### Tokenizer Behavior
- **Default (`--use-sglang-tokenizer` not set)**: Dynamo handles tokenization/detokenization via our blazing fast frontend and passes `input_ids` to SGLang
- **With `--use-sglang-tokenizer`**: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
> [!NOTE]
> When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
### Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
#### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |
> [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
## Installation
### Install latest release
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh`
<details>
<summary>Expand for instructions</summary>
```bash
# create a virtual env
uv venv --python 3.12 --seed
# install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]"
```
</details>
### Install editable version for development
<details>
<summary>Expand for instructions</summary>
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available.
```bash
# create a virtual env
uv venv --python 3.12 --seed
# build dynamo runtime bindings
uv pip install maturin
cd $DYNAMO_HOME/lib/bindings/python
maturin develop --uv
cd $DYNAMO_HOME
# installs sglang supported version along with dynamo
# include the prerelease flag to install flashinfer rc versions
uv pip install -e .
# install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2"
```
</details>
### Using docker containers
<details>
<summary>Expand for instructions</summary>
We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
```bash
cd $DYNAMO_ROOT
python container/render.py --framework=sglang --target=runtime --output-short-filename
docker build -t dynamo:sglang-latest -f container/rendered.Dockerfile .
```
And then run it using
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-sglang:latest
```
</details>
## Quick Start
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start Infrastructure Services (Local Development Only)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
> [!TIP]
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
>
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh
```
### Aggregated Serving with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_router.sh
```
### Aggregated Serving for Embedding Models
Here's an example that uses the [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_embed.sh
```
<details>
<summary>Send the following request to verify your deployment:</summary>
```bash
curl localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Embedding-4B",
"input": "Hello, world!"
}'
```
</details>
### Disaggregated serving
See [SGLang Disaggregation](sglang-disaggregation.md) to learn more about how sglang and dynamo handle disaggregated serving.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg.sh
```
### Disaggregated Serving with KV Aware Prefill Routing
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_router.sh
```
### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
```bash
# note this will require 4 GPUs
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_dp_attn.sh
```
### Testing the Deployment
Send a test request to verify your deployment:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": true,
"max_tokens": 30
}'
```
## Deployment
We currently provide deployment examples for Kubernetes and SLURM.
## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes](../../../examples/backends/sglang/deploy/README.md)**
## SLURM
- **[Deploying Dynamo with SGLang on SLURM](../../../examples/backends/sglang/slurm_jobs/README.md)**
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running Diffusion LMs with SGLang
Diffusion Language Models (Diffusion LMs) are a class of generative models that use diffusion processes for text generation. This guide shows how to deploy diffusion models like LLaDA2.0 using SGLang as the backend with Dynamo. Diffusion LMs work differently from autoregressive models - they iteratively refine generated text through a diffusion process.
## Launch the Deployment
### Using the Launch Script (Recommended)
The easiest way to start the diffusion LM service is using the provided launch script:
```bash
bash examples/backends/sglang/launch/diffusion_llada.sh
```
### Manual Launch Steps
If you prefer to launch components manually:
**Start frontend**
```bash
python -m dynamo.frontend --http-port 8001 &
```
**Run diffusion worker**
```bash
export CUDA_VISIBLE_DEVICES=0,1
python -m dynamo.sglang \
--model-path inclusionAI/LLaDA2.0-mini-preview \
--tp-size 2 \
--skip-tokenizer-init \
--trust-remote-code \
--endpoint dyn://dynamo.backend.generate \
--enable-metrics \
--disable-cuda-graph \
--disable-overlap-schedule \
--attention-backend triton \
--dllm-algorithm LowConfidence
```
## Diffusion Algorithms
The diffusion worker uses the **LowConfidence** algorithm for the iterative refinement process. This algorithm refines tokens with low confidence scores, progressively replacing masked tokens with the model's predictions until confidence thresholds are met.
For more details on diffusion algorithms and configuration options, refer to the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/diffusion_language_models.md).
## Testing the Deployment
Once deployed, you can test the service using curl:
```bash
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionAI/LLaDA2.0-mini-preview",
"messages": [
{
"role": "user",
"content": "Hello! How are you?"
}
],
"temperature": 0.7,
"max_tokens": 512
}'
```
Or use the completions endpoint:
```bash
curl -X POST http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionAI/LLaDA2.0-mini-preview",
"prompt": "Once upon a time",
"max_tokens": 256
}'
```
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Expert Parallelism Load Balancer (EPLB) in SGLang
Mixture-of-Experts (MoE) models utilize a technique called Expert Parallelism (EP), where experts are distributed across multiple GPUs. While this allows for much larger and more powerful models, it can lead to an uneven workload distribution. Because the load on different experts may vary depending on the workload, some GPUs can become bottlenecks, forcing the entire system to wait. This imbalance leads to wasted compute cycles and increased memory usage.
To address this, SGLang implements an Expert Parallelism Load Balancer (EPLB) inspired by the work in the DeepSeek-V3 paper. EPLB analyzes expert usage patterns and dynamically re-arranges the experts across the available GPUs to ensure a more balanced workload.
## The EPLB Algorithm: Core Concepts
The load balancing algorithm revolves around a few key ideas to achieve an optimal distribution of work.
### Redundant Experts for Flexibility
The core strategy is to create **redundant experts**. Instead of being limited to the model's original number of experts, EPLB can create duplicates of heavily-loaded experts. For example, if a model has 256 experts, you can configure EPLB to create an additional 32 "redundant" experts, bringing the total to 288. This pool of replicated experts is then strategically packed onto the available GPUs. A popular expert might be duplicated multiple times, while a moderately used expert might be grouped with several rarely used ones on a single GPU.
### Group-Limited Routing for Efficiency
Modern MoE models like DeepSeek-V3 use **group-limited expert routing**. In this design, experts are organized into groups, and routing decisions are constrained within these groups. EPLB can take advantage of this structure to reduce inter-node data traffic by attempting to place all experts from the same group onto the same node whenever possible.
### Load Balancing Policies
The algorithm comes with two policies for different scenarios:
1. **Hierarchical Load Balancing**: This policy is used when the number of server nodes evenly divides the number of expert groups. It first harnesses the group-limited routing by packing expert groups onto nodes to balance the load between nodes. Then, within each node, it replicates and packs the experts onto individual GPUs to balance the load locally. This is often used during prefill where the expert-parallel size might be smaller.
2. **Global Load Balancing**: In all other cases, a global policy is used. It replicates experts globally without regard to their group affiliation and packs them onto individual GPUs. This policy is more general and can be adopted during the decoding stage with a larger expert-parallel size.
## How SGLang Implements EPLB
SGLang provides a robust implementation of EPLB, allowing for dynamic, online rebalancing of expert locations based on real-world traffic.
### Dynamic Rebalancing
You can enable dynamic rebalancing by setting the `--enable-eplb` flag. When enabled, the `EPLBManager` runs in the background. It periodically triggers a rebalance after a certain number of requests, configured with `--eplb-rebalance-num-iterations`. At each rebalance, it computes a new expert placement plan based on the latest usage statistics and updates the model's expert locations on the fly.
### Expert Usage Recording
To make intelligent balancing decisions, SGLang needs to collect data on expert usage. The `ExpertDistributionRecorder` is responsible for this, and its behavior is controlled by the `--expert-distribution-recorder-mode` flag. This flag determines the granularity of the collected data. When `enable_eplb` is on, this mode defaults to `stat` to gather statistics for rebalancing. The available modes are:
- **`per_token`**: This is the most detailed mode. It records the specific expert choices for every single token processed by the model. While it provides the richest data, it also has the highest performance overhead. The raw, unaggregated data for each forward pass is stored.
- **`per_pass`**: In this mode, SGLang records the aggregated expert usage counts for each individual forward pass. The data is not aggregated across different passes, giving you a snapshot of expert popularity for each batch of requests.
- **`stat`**: This mode also records the exact expert usage counts for each forward pass, but it then aggregates these counts across multiple passes (the number of passes is determined by `--expert-distribution-recorder-buffer-size`). This provides a moving average of expert usage statistics and is the default when EPLB is enabled.
- **`stat_approx`**: This mode is similar to `stat` but gathers _approximate_ statistics, usually from the DeepEP dispatcher. This method has lower overhead than `stat` but is less precise, especially for small batch sizes. It is a good choice when performance is critical.
The collected statistics are then fed into the rebalancing algorithm to generate a new expert placement plan.
### Initializing with a Pre-computed Distribution
While SGLang can start with a simple default layout and learn a better one over time, you can also provide it with a pre-computed expert distribution to start with. The `--init-expert-location` flag allows you to specify a file path (`.pt` or `.json`) or a JSON string containing an expert layout. This is useful if you have already analyzed a representative workload offline and want the server to start immediately with a balanced configuration. If this flag is not set, it defaults to a `trivial` sequential layout.
### References and further reading
- [SGLang Large Scale P/D + WideEP Deployment](https://lmsys.org/blog/2025-05-05-large-scale-ep/#expert-parallelism-load-balancer)
- [Deepseek's EPLB repository](https://github.com/deepseek-ai/EPLB)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with SGLang
The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/docs/backends/vllm/gpt-oss.md),
please ues the vLLM guide as a reference with the different deployment steps as highlighted below:
# Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To
ensure the response is being processed correctly, the worker should be
launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
**Start frontend**
```bash
python3 -m dynamo.frontend --http-port 8000 &
```
**Run decode worker**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.sglang \
--model-path openai/gpt-oss-120b \
--served-model-name openai/gpt-oss-120b \
--tp 4 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
**Run prefill workers**
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.sglang \
--model-path openai/gpt-oss-120b \
--served-model-name openai/gpt-oss-120b \
--tp 4 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiling SGLang Workers in Dynamo
> [!NOTE]
> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for SLA-driven profiling and deployment optimization.
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
## Quick Start
1. **Start profiling:**
```bash
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
```
2. **Run some inference requests to generate profiling data**
3. **Stop profiling:**
```bash
curl -X POST http://localhost:9090/engine/stop_profile
```
4. **View the traces:**
The profiler outputs Chrome trace files in the specified `output_dir`. You can view them using:
- Chrome's `chrome://tracing`
- [Perfetto UI](https://ui.perfetto.dev/)
- TensorBoard with the PyTorch Profiler plugin
## Test Script
A test script is provided at [`examples/backends/sglang/test_sglang_profile.py`](../../../examples/backends/sglang/test_sglang_profile.py) that demonstrates the full profiling workflow:
```bash
python examples/backends/sglang/test_sglang_profile.py
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment