See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. SGLang currently supports aggregated and disaggregated serving. KV routing support is coming soon!
## Table of Contents
-[Feature Support Matrix](#feature-support-matrix)
Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.
### Start NATS and ETCD in the background
Start using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build docker
### Build container
```bash
# On an x86 machine - sglang does not support ARM yet
Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.
### Example architectures
## Run Single Node Examples
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each commmand and run them in separate terminals.
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
>
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
#### Aggregated
### Aggregated Serving
```bash
cd$DYNAMO_ROOT/examples/sglang
./launch/agg.sh
```
#### Aggregated serving with KV Routing
### Aggregated Serving with KV Routing
> [!NOTE]
> The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
...
...
@@ -112,10 +117,10 @@ cd $DYNAMO_ROOT/examples/sglang
./launch/agg_router.sh
```
#### Disaggregated serving
### Disaggregated serving
<details>
<summary>SGLang Load Balancer vs Dynamo Discovery</summary>
<summary>Under the hood: SGLang Load Balancer vs Dynamo Discovery</summary>
SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows:
...
...
@@ -136,9 +141,9 @@ cd $DYNAMO_ROOT/examples/sglang
./launch/disagg.sh
```
##### Disaggregated with MoE models and DP attention
### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention
SGLang also supports DP attention for MoE models. We provide an example config for this in `configs/disagg-dp-attention.yaml` which is based on the [DeepSeek-R1-Small-2layers](https://huggingface.co/silence09/DeepSeek-R1-Small-2layers) model. You can use this configuration to test out disaggregated serving on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
```bash
# note this will require 4 GPUs
...
...
@@ -146,8 +151,32 @@ cd $DYNAMO_ROOT/examples/sglang
./launch/disagg_dp_attn.sh
```
In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.
## Advanced Examples
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Run on multi-node
-**[Run a multi-node model](docs/multinode-examples.md)**
### Large scale P/D disaggregation with WideEP
-**[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
-**[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**
### Speculative Decoding
-**[Deploying DeepSeek-R1 with MTP - coming soon!](.)**
### Structured Output and Tool Calling
-**[Tool calling with Dynamo - coming soon!](.)**
### Supporting SGLang's native endpoints via Dynamo
-**[HTTP Server for native SGLang endpoints](docs/sgl-http-server.md)**
## Deployment
We currently provide deployment examples for Kubernetes (coming soon!) and SLURM
##### Disaggregated with WideEP
## Kubernetes
-**[Deploying Dynamo with SGLang on Kubernetes - coming soon!](.)**
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can find detailed deployment and benchmarking instructions [here](./dsr1-wideep.md)
## SLURM
-**[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
@@ -15,7 +15,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Running DeepSeek-R1 Disaggregated with WideEP
# Running DeepSeek-R1 Disaggregated with WideEP on H100s
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
...
...
@@ -26,16 +26,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Supporting SGLang's native endpoints via HTTP Server
# Introduction
The SGLang HTTP server provides a REST API interface for managing and monitoring SGLang components running in a dynamo distributed environment. It leverages dynamo's service discovery mechanism to automatically find and communicate with SGLang workers across the cluster.
## Architecture Overview
The HTTP server (`sgl_http_server.py`) is built on FastAPI and integrates with dynamo's `DistributedRuntime` to discover and interact with SGLang components. It uses the following discovery flow:
1.**Service Discovery**: Queries dynamo's etcd instance to find components that expose specific endpoints
2.**Dynamic Targeting**: Automatically discovers all matching components across namespaces without requiring manual configuration
3.**Direct Communication**: Establishes direct connections to discovered component instances using dynamo's client infrastructure
## Discovery Mechanism
The server uses dynamo's hierarchical service discovery structure:
-**DistributedRuntime**: Maintains connections to etcd (service discovery) and NATS (messaging)
-**Namespace**: Logical grouping of components (default: "dynamo")
-**Component**: Individual SGLang workers or services
-**Endpoint**: Specific functionality exposed by each component
The discovery process queries etcd with the prefix `instances/` to find all registered components that expose the target endpoint. Components are identified by their namespace, component name, and endpoint, allowing the server to dynamically scale operations across multiple instances.
## Supported Endpoints
### Current Endpoints
#### POST /flush_cache
Flushes the radix cache across all discovered SGLang components.
**Behavior:**
- Discovers all components in the specified namespace that expose the `flush_cache` endpoint
- Sends flush requests to all instances of each discovered component
- Returns success/failure status with details about the operation
**Response:**
```json
{
"message":"Cache flush initiated",
"success":true
}
```
### Upcoming Endpoints
The following endpoints will be supported in future releases:
#### POST /start_expert_distribution_record
Begins recording expert distribution metrics across SGLang components.
#### POST /stop_expert_distribution_record
Stops the expert distribution recording process.
#### GET /dump_expert_distribution_record
Retrieves the collected expert distribution data.
## Configuration
The server accepts the following command-line arguments:
-`--port`: HTTP server port (default: 9001)
-`--ns/--namespace`: Target dynamo namespace (default: "dynamo")
-`--comp/--component`: Specific component name to target (default: discover all)
-`--endpoint`: Endpoint name to discover (default: "flush_cache")
## Usage
Start the server:
```bash
python sgl_http_server.py --port 9001 --namespace dynamo
```
The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.