Unverified Commit 80d8aa19 authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

feat(sglang): unify entry point for SGLang backend architecture (#2493)

parent 28400714
......@@ -226,9 +226,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**
### Supporting SGLang's native endpoints via Dynamo
- **[HTTP Server for native SGLang endpoints](docs/sgl-http-server.md)**
### Hierarchical Cache (HiCache)
- **[Enable SGLang Hierarchical Cache (HiCache)](docs/sgl-hicache-example.md)**
......
......@@ -66,7 +66,7 @@ extraPodSpec:
args:
- "python3"
- "-m"
- "dynamo.sglang.worker"
- "dynamo.sglang"
# Model-specific arguments
```
......
......@@ -31,7 +31,7 @@ spec:
- -c
args:
- >-
python3 -m dynamo.sglang.worker
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--page-size 16
......
......@@ -34,7 +34,7 @@ spec:
- -c
args:
- >-
python3 -m dynamo.sglang.worker
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--page-size 16
......
......@@ -40,7 +40,7 @@ spec:
command: ["sh", "-c"]
args:
- >-
python3 -m dynamo.sglang.decode_worker
python3 -m dynamo.sglang
--model-path meta-llama/Llama-3.3-70B-Instruct
--served-model-name meta-llama/Llama-3.3-70B-Instruct
--tp-size 8
......@@ -67,7 +67,7 @@ spec:
command: ["sh", "-c"]
args:
- >-
python3 -m dynamo.sglang.worker
python3 -m dynamo.sglang
--model-path meta-llama/Llama-3.3-70B-Instruct
--served-model-name meta-llama/Llama-3.3-70B-Instruct
--tp-size 8
......
......@@ -31,7 +31,7 @@ spec:
- -c
args:
- >-
python3 -m dynamo.sglang.decode_worker
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--page-size 16
......@@ -58,7 +58,7 @@ spec:
- -c
args:
- >-
python3 -m dynamo.sglang.worker
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--page-size 16
......
......@@ -115,7 +115,7 @@ spec:
- -c
args:
- >-
python3 -m dynamo.sglang.decode_worker
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--page-size 16
......@@ -141,7 +141,7 @@ spec:
- -c
args:
- >-
python3 -m dynamo.sglang.worker
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--page-size 16
......
......@@ -80,7 +80,7 @@ NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
......@@ -131,7 +131,7 @@ NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.decode_worker \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
......
......@@ -55,7 +55,7 @@ python3 -m dynamo.frontend --http-port=8000 &
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
python3 utils/sgl_http_server.py --ns dynamo &
# run prefill worker
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
......@@ -90,7 +90,7 @@ On the other prefill node (since this example has 4 total prefill nodes), run th
5. Run the decode worker on the head decode node
```bash
python3 -m dynamo.sglang.decode_worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
......
......@@ -21,7 +21,7 @@ Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......@@ -40,7 +40,7 @@ python3 -m dynamo.sglang.worker \
Node 2: Run the remaining 8 shards of the prefill worker
```bash
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......@@ -59,7 +59,7 @@ python3 -m dynamo.sglang.worker \
Node 3: Run the first 8 shards of the decode worker
```bash
python3 -m dynamo.sglang.decode_worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......
......@@ -10,7 +10,7 @@ This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dyna
## 1) Start the SGLang worker with HiCache enabled
```bash
python -m dynamo.sglang.worker \
python -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--host 0.0.0.0 --port 8000 \
--page-size 64 \
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Supporting SGLang's native endpoints via HTTP Server
# Introduction
The SGLang HTTP server provides a REST API interface for managing and monitoring SGLang components running in a dynamo distributed environment. It leverages dynamo's service discovery mechanism to automatically find and communicate with SGLang workers across the cluster.
<details>
<summary>How it works under the hood</summary>
## Architecture Overview
The HTTP server (`sgl_http_server.py`) is built on FastAPI and integrates with dynamo's `DistributedRuntime` to discover and interact with SGLang components. It uses the following discovery flow:
1. **Service Discovery**: Queries dynamo's etcd instance to find components that expose specific endpoints
2. **Dynamic Targeting**: Automatically discovers all matching components across namespaces without requiring manual configuration
3. **Direct Communication**: Establishes direct connections to discovered component instances using dynamo's client infrastructure
## Discovery Mechanism
The server uses dynamo's hierarchical service discovery structure:
- **DistributedRuntime**: Maintains connections to etcd (service discovery) and NATS (messaging)
- **Namespace**: Logical grouping of components (default: "dynamo")
- **Component**: Individual SGLang workers or services
- **Endpoint**: Specific functionality exposed by each component
The discovery process queries etcd with the prefix `instances/` to find all registered components that expose the target endpoint. Components are identified by their namespace, component name, and endpoint, allowing the server to dynamically scale operations across multiple instances.
</details>
## Supported Endpoints
All of these endpoints can be called using
```bash
curl -X POST http://<ip>:9001/<endpoint>
```
#### `/flush_cache`
Flushes the kv cache across all SGLang components. Useful for resetting after a warmup or a benchmarking run.
#### `/start_expert_distribution_record`
Begins recording expert distribution metrics across SGLang components.
#### `/stop_expert_distribution_record`
Stops the expert distribution recording process.
#### `/dump_expert_distribution_record`
Dumps the collected expert distribution data.
## Configuration
The server accepts the following command-line arguments:
- `--port`: HTTP server port (default: 9001)
- `--ns/--namespace`: Target dynamo namespace (default: "dynamo")
## Usage
Start the server:
```bash
python src/dynamo/sglang/utils/sgl_http_server.py --port 9001 --namespace dynamo
```
The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.
......@@ -19,7 +19,7 @@ python3 -m dynamo.frontend --http-port=8000 &
DYNAMO_PID=$!
# run worker
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
......
......@@ -19,7 +19,7 @@ python -m dynamo.frontend --router-mode kv --http-port=8000 &
DYNAMO_PID=$!
# run worker
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
......@@ -29,7 +29,7 @@ python3 -m dynamo.sglang.worker \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:5557"}' &
WORKER_PID=$!
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.worker \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
......
......@@ -19,7 +19,7 @@ python3 -m dynamo.frontend --http-port=8000 &
DYNAMO_PID=$!
# run prefill worker
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
......@@ -31,7 +31,7 @@ python3 -m dynamo.sglang.worker \
PREFILL_PID=$!
# run decode worker
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
......
......@@ -5,8 +5,8 @@
# Setup cleanup trap
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID $HTTP_SERVER_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID $HTTP_SERVER_PID 2>/dev/null || true
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
trap cleanup EXIT INT TERM
......@@ -18,16 +18,12 @@ python3 -m dynamo.sglang.utils.clear_namespace --namespace dynamo
python3 -m dynamo.frontend --http-port=8000 &
DYNAMO_PID=$!
# run http server
python3 -m dynamo.sglang.utils.sgl_http_server --namespace dynamo &
HTTP_SERVER_PID=$!
# Set the expert distribution recording directory
mkdir -p /tmp/sglang_expert_distribution_record
export SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/tmp/sglang_expert_distribution_record
# run prefill worker
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path silence09/DeepSeek-R1-Small-2layers \
--served-model-name silence09/DeepSeek-R1-Small-2layers \
--tp 2 \
......@@ -42,7 +38,7 @@ python3 -m dynamo.sglang.worker \
PREFILL_PID=$!
# run decode worker
CUDA_VISIBLE_DEVICES=2,3 python3 -m dynamo.sglang.decode_worker \
CUDA_VISIBLE_DEVICES=2,3 python3 -m dynamo.sglang \
--model-path silence09/DeepSeek-R1-Small-2layers \
--served-model-name silence09/DeepSeek-R1-Small-2layers \
--tp 2 \
......
......@@ -86,7 +86,7 @@ if [ "$mode" = "prefill" ]; then
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
......@@ -187,7 +187,7 @@ elif [ "$mode" = "decode" ]; then
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.decode_worker \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
......
......@@ -70,7 +70,7 @@ fi
if [ "$mode" = "prefill" ]; then
if [ "$cmd" = "dynamo" ]; then
# H100 dynamo prefill command
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
......@@ -131,7 +131,7 @@ if [ "$mode" = "prefill" ]; then
elif [ "$mode" = "decode" ]; then
if [ "$cmd" = "dynamo" ]; then
# H100 dynamo decode command
python3 -m dynamo.sglang.decode_worker \
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from dynamo.sglang.main import main
if __name__ == "__main__":
main()
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import contextlib
import logging
import os
import socket
import sys
from argparse import Namespace
from dataclasses import dataclass
from enum import Enum
from typing import Any, Dict
from sglang.srt.server_args import ServerArgs
from dynamo.sglang import __version__
DEFAULT_ENDPOINT = "dyn://dynamo.backend.generate"
DYNAMO_ARGS: Dict[str, Dict[str, Any]] = {
"endpoint": {
"flags": ["--endpoint"],
"type": str,
"help": f"Dynamo endpoint string in 'dyn://namespace.component.endpoint' format. Example: {DEFAULT_ENDPOINT}",
},
"migration-limit": {
"flags": ["--migration-limit"],
"type": int,
"default": 0,
"help": "Maximum number of times a request may be migrated to a different engine worker",
},
}
@dataclass
class DynamoArgs:
namespace: str
component: str
endpoint: str
migration_limit: int
class DisaggregationMode(Enum):
AGGREGATED = "agg"
PREFILL = "prefill"
DECODE = "decode"
class Config:
def __init__(self, server_args: ServerArgs, dynamo_args: DynamoArgs) -> None:
self.server_args = server_args
self.dynamo_args = dynamo_args
self.serving_mode = self._set_serving_strategy()
def _set_serving_strategy(self):
if self.server_args.disaggregation_mode == "null":
return DisaggregationMode.AGGREGATED
elif self.server_args.disaggregation_mode == "prefill":
return DisaggregationMode.PREFILL
elif self.server_args.disaggregation_mode == "decode":
return DisaggregationMode.DECODE
def parse_args(args: list[str]) -> Config:
"""
Parse all arguments and return Config with server_args and dynamo_args
"""
parser = argparse.ArgumentParser()
parser.add_argument(
"--version", action="version", version=f"Dynamo Backend SGLang {__version__}"
)
# Dynamo args
for info in DYNAMO_ARGS.values():
parser.add_argument(
*info["flags"],
type=info["type"],
default=info["default"] if "default" in info else None,
help=info["help"],
)
# SGLang args
bootstrap_port = _reserve_disaggregation_bootstrap_port()
ServerArgs.add_cli_args(parser)
parsed_args = parser.parse_args(args)
# Auto-set bootstrap port if not provided
if not any(arg.startswith("--disaggregation-bootstrap-port") for arg in args):
args_dict = vars(parsed_args)
args_dict["disaggregation_bootstrap_port"] = bootstrap_port
parsed_args = Namespace(**args_dict)
# Dynamo argument processing
# If an endpoint is provided, validate and use it
# otherwise fall back to default endpoints
namespace = os.environ.get("DYNAMO_NAMESPACE", "dynamo")
endpoint = parsed_args.endpoint
if endpoint is None:
if (
hasattr(parsed_args, "disaggregation_mode")
and parsed_args.disaggregation_mode == "prefill"
):
endpoint = f"dyn://{namespace}.prefill.generate"
else:
endpoint = f"dyn://{namespace}.backend.generate"
# Always parse the endpoint (whether auto-generated or user-provided)
endpoint_str = endpoint.replace("dyn://", "", 1)
endpoint_parts = endpoint_str.split(".")
if len(endpoint_parts) != 3:
logging.error(
f"Invalid endpoint format: '{endpoint}'. Expected 'dyn://namespace.component.endpoint' or 'namespace.component.endpoint'."
)
sys.exit(1)
parsed_namespace, parsed_component_name, parsed_endpoint_name = endpoint_parts
dynamo_args = DynamoArgs(
namespace=parsed_namespace,
component=parsed_component_name,
endpoint=parsed_endpoint_name,
migration_limit=parsed_args.migration_limit,
)
logging.debug(f"Dynamo args: {dynamo_args}")
server_args = ServerArgs.from_cli_args(parsed_args)
return Config(server_args, dynamo_args)
@contextlib.contextmanager
def reserve_free_port(host: str = "localhost"):
"""
Find and reserve a free port until context exits.
"""
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
sock.bind((host, 0))
_, port = sock.getsockname()
yield port
finally:
sock.close()
def _reserve_disaggregation_bootstrap_port():
"""
Each worker requires a unique port for disaggregation_bootstrap_port.
We use an existing utility function that reserves a free port on your
machine to avoid collisions.
"""
with reserve_free_port() as port:
return port
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment