Unverified Commit 9cbf8031 authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

feat: add dynamo components for sglang (#1721)

parent 008bb1e6
......@@ -137,6 +137,7 @@ RUN if [ "$ARCH" = "arm64" ]; then \
# This commit references a NIXL fix that was releasted after the 0.4.8.post1 release https://github.com/sgl-project/sglang/pull/7330
ARG SGLANG_COMMIT="bb9b608c86ebad7d9d01e29fe058bc184dc7285f"
RUN --mount=type=cache,target=/root/.cache/uv \
cd /opt && \
git clone https://github.com/sgl-project/sglang.git && \
cd sglang && \
git checkout ${SGLANG_COMMIT} && \
......
......@@ -91,7 +91,8 @@ WORKDIR /sgl-workspace
# support batch completions for SGL benchmarking
# https://github.com/ai-dynamo/dynamo/pull/1626
ARG DYNAMO_COMMIT="fc16a79bfc5a4c4f58503d3c36f2013340244cac"
RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_COMMIT}
ARG DYNAMO_BRANCH="ishan/sgl-v2"
RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_BRANCH}
# install dynamo in editable mode
WORKDIR /sgl-workspace/dynamo
......@@ -138,7 +139,7 @@ RUN cp target/release/dynamo-run deploy/sdk/src/dynamo/sdk/cli/bin
RUN cd lib/bindings/python && pip install --break-system-packages -e . && cd ../../..
RUN pip install --break-system-packages -e .
ENV PYTHONPATH=/sgl-workspace/dynamo/components/planner/src
ENV PYTHONPATH=/sgl-workspace/dynamo/components/planner/src:/sgl-workspace/dynamo/examples/sglang:$PYTHONPATH
RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.24/nats-server-v2.10.24-${ARCH}.deb && \
dpkg -i nats-server-v2.10.24-${ARCH}.deb && rm nats-server-v2.10.24-${ARCH}.deb
......@@ -165,8 +166,7 @@ ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-buil
RUN pip install --break-system-packages genai-perf
COPY examples/sglang/configs/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/configs/
COPY examples/sglang/utils/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/utils/
COPY examples/sglang/configs/deepseek_r1/wideep/* /sgl-workspace/dynamo/examples/sglang/configs/
COPY examples/sglang/utils/benchmarking/* /sgl-workspace/dynamo/examples/sglang/utils/
WORKDIR /sgl-workspace/dynamo/examples/sglang
......@@ -62,30 +62,58 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
./container/run.sh -it --framework sglang
```
## Run Deployment
This figure shows an overview of the major components to deploy:
```
+------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| processor |----->| Worker |------------>| Prefill |
| |<-----| |<-----| |<------------| Worker |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
```
Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.
### Example architectures
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each commmand and run them in separate terminals.
#### Aggregated
```bash
cd /workspace/examples/sglang
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
./launch/agg.sh
```
#### Aggregated with router
#### Aggregated serving with KV Routing
> [!NOTE]
> The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
> • Upstream: [sgl-project/sglang #6721](https://github.com/sgl-project/sglang/pull/6721) – _Expose runtime KV-cache & request metrics_.
> • Dynamo: [ai-dynamo/dynamo #1465](https://github.com/ai-dynamo/dynamo/pull/1465) – _feat: receive kvmetrics from sglang scheduler_.
>
> After these are in, the TODOs in `worker.py` will be resolved and the placeholder logic removed.
```bash
cd /workspace/examples/sglang
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Frontend.router=kv
export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
./launch/agg_router.sh
```
#### Disaggregated
#### Disaggregated serving
<details>
<summary>SGLang Load Balancer vs Dynamo Discovery</summary>
......@@ -106,7 +134,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
```bash
cd /workspace/examples/sglang
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
./launch/disagg.sh
```
##### Disaggregated with MoE models and DP attention
......@@ -116,7 +144,7 @@ SGLang also supports DP attention for MoE models. We provide an example config f
```bash
# note this will require 4 GPUs
cd /workspace/examples/sglang
dynamo serve graphs.disagg:Frontend -f ./configs/disagg-dp-attention.yaml
./launch/disagg_dp_attn.sh
```
In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.
......
......@@ -15,45 +15,65 @@
from __future__ import annotations
import asyncio
import logging
import sys
import msgspec
import sglang as sgl
from utils.protocol import DisaggPreprocessedRequest
from utils.sgl_utils import parse_sglang_args
from dynamo.sdk import endpoint, service
logger = logging.getLogger(__name__)
@service(
dynamo={
"enabled": True,
"namespace": "dynamo",
},
resources={"gpu": 1},
workers=1,
)
class SGLangDecodeWorker:
def __init__(self):
class_name = self.__class__.__name__
self.engine_args = parse_sglang_args(class_name, "")
self.engine = sgl.Engine(server_args=self.engine_args)
logger.warning("Decode worker initialized")
@endpoint()
async def generate(self, req: DisaggPreprocessedRequest):
g = await self.engine.async_generate(
input_ids=req.request.token_ids
if req.request.batch_token_ids is None
else req.request.batch_token_ids,
sampling_params=req.sampling_params,
import uvloop
from sglang.srt.server_args import ServerArgs
from utils.sgl_utils import parse_sglang_args_inc
from dynamo.runtime import DistributedRuntime, dynamo_worker
from dynamo.runtime.logging import configure_dynamo_logging
configure_dynamo_logging()
class DecodeRequestHandler:
def __init__(self, engine: sgl.Engine):
self.engine = engine
logging.info("Decode request handler initialized")
async def generate(self, request: str):
req = msgspec.json.decode(request, type=dict)
results = await self.engine.async_generate(
input_ids=req["request"]["token_ids"]
if req["request"]["batch_token_ids"] is None
else req["request"]["batch_token_ids"],
sampling_params=req["sampling_params"],
stream=True,
bootstrap_host=req.bootstrap_host,
bootstrap_port=req.bootstrap_port,
bootstrap_room=req.bootstrap_room,
bootstrap_host=req["bootstrap_host"],
bootstrap_port=req["bootstrap_port"],
bootstrap_room=req["bootstrap_room"],
)
async for result in g:
async for result in results:
yield result
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):
server_args = parse_sglang_args_inc(sys.argv[1:])
await init(runtime, server_args)
async def init(runtime: DistributedRuntime, server_args: ServerArgs):
"""Initialize decode worker"""
engine = sgl.Engine(server_args=server_args)
handler = DecodeRequestHandler(engine)
component = runtime.namespace("dynamo").component("decode")
await component.create_service()
endpoint = component.endpoint("generate")
await endpoint.serve_endpoint(handler.generate)
if __name__ == "__main__":
uvloop.install()
asyncio.run(worker())
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import subprocess
from pathlib import Path
from components.worker import SGLangWorker
from fastapi import FastAPI
from pydantic import BaseModel
import dynamo.sdk as sdk
from dynamo.sdk import depends, service
from dynamo.sdk.lib.config import ServiceConfig
from dynamo.sdk.lib.image import DYNAMO_IMAGE
logger = logging.getLogger(__name__)
def get_dynamo_run_binary():
"""Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command."""
sdk_path = Path(sdk.__file__)
binary_path = sdk_path.parent / "cli/bin/dynamo-run"
if not binary_path.exists():
return "dynamo-run"
else:
return str(binary_path)
class FrontendConfig(BaseModel):
"""Configuration for the Frontend service including model and HTTP server settings."""
served_model_name: str
endpoint: str
port: int = 8080
router: str = "round-robin"
@service(
dynamo={
"namespace": "dynamo",
},
workers=1,
image=DYNAMO_IMAGE,
app=FastAPI(title="LLM Example"),
)
class Frontend:
worker = depends(SGLangWorker)
def __init__(self):
"""Initialize Frontend service with HTTP server and model configuration."""
frontend_config = FrontendConfig(**ServiceConfig.get_parsed_config("Frontend"))
self.frontend_config = frontend_config
self.process = None
self.start_ingress_and_processor()
def start_ingress_and_processor(self):
"""Starting dynamo-run based ingress and processor"""
logger.info(
f"Starting HTTP server and processor on port {self.frontend_config.port}"
)
dynamo_run_binary = get_dynamo_run_binary()
endpoint = f"dyn://{self.frontend_config.endpoint}"
self.process = subprocess.Popen(
[
dynamo_run_binary,
"in=http",
f"out={endpoint}",
"--http-port",
str(self.frontend_config.port),
"--router-mode",
str(self.frontend_config.router),
],
stdout=None,
stderr=None,
)
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
SGLang disaggregated serving flow is
Processor -> PrefillWorker -> DecodeWorker
This is different from how we've implemented the vLLM disaggregated flow.
For now - the SGLangWorker will be responsible for aggreagted and prefill and we will
have a separate DecodeWorker.
"""
import asyncio
import logging
import random
import socket
from typing import Dict, Union
import sys
from typing import Any, Dict, Optional, Union
import sglang as sgl
from components.decode_worker import SGLangDecodeWorker
import uvloop
from sglang.srt.server_args import ServerArgs
from sglang.srt.utils import get_ip
from utils.protocol import DisaggPreprocessedRequest, PreprocessedRequest
from utils.sgl_utils import parse_sglang_args
from utils.protocol import DisaggPreprocessedRequest
from utils.sgl_utils import parse_sglang_args_inc
from dynamo.llm import (
ModelType,
......@@ -43,35 +22,63 @@ from dynamo.llm import (
ZmqKvEventPublisherConfig,
register_llm,
)
from dynamo.sdk import async_on_start, depends, dynamo_context, endpoint, service
logger = logging.getLogger(__name__)
from dynamo.runtime import DistributedRuntime, dynamo_worker
from dynamo.runtime.logging import configure_dynamo_logging
configure_dynamo_logging()
class RequestHandler:
def __init__(
self,
engine: sgl.Engine,
server_args: ServerArgs,
component,
decode_client: Optional[Any] = None,
):
self.engine = engine
self.server_args = server_args
self.component = component
self.metrics_publisher = WorkerMetricsPublisher()
if server_args.disaggregation_mode != "null":
self.bootstrap_host, self.bootstrap_port = self._get_bootstrap_info()
if decode_client is None:
raise ValueError(
"decode_client must be provided when disaggregation_mode is not 'null'"
)
self.decode_client = decode_client
logging.info(
f"Disaggregation enabled - bootstrap host: {self.bootstrap_host}, bootstrap port: {self.bootstrap_port}"
)
@service(
dynamo={
"namespace": "dynamo",
},
resources={"gpu": 1},
workers=1,
)
class SGLangWorker:
decode_worker = depends(SGLangDecodeWorker)
logging.info("Request handler initialized")
def __init__(self):
class_name = self.__class__.__name__
self.engine_args = parse_sglang_args(class_name, "")
self.engine = sgl.Engine(server_args=self.engine_args)
def setup_metrics(self):
"""Set up metrics publisher - call this after handler creation"""
self.metrics_publisher.publish(
request_active_slots=0,
request_total_slots=1024,
kv_active_blocks=0,
kv_total_blocks=1024,
num_requests_waiting=0,
gpu_cache_usage_perc=0.0,
gpu_prefix_cache_hit_rate=0.0,
)
task = asyncio.create_task(self.create_metrics_publisher_endpoint())
task.add_done_callback(
lambda _: logging.debug("metrics publisher endpoint created")
)
# Initialize metrics publisher
self.metrics_publisher = WorkerMetricsPublisher()
async def create_metrics_publisher_endpoint(self):
logging.debug("Creating metrics publisher endpoint")
await self.metrics_publisher.create_endpoint(self.component)
def _update_metrics(self):
"""Update metrics with current engine state"""
# TODO: remove this once the following upstream changes are merged:
# • ai-dynamo/dynamo#1465 – "feat: receive kvmetrics from sglang scheduler"
# • sgl-project/sglang#6721 – "Expose runtime KV-cache & request metrics"
logger.warning(
logging.warning(
"Publishing placeholder metrics in SGLangWorker; these are NOT real engine metrics yet and will be replaced once upstream support lands."
)
self.metrics_publisher.publish(
......@@ -84,72 +91,11 @@ class SGLangWorker:
gpu_prefix_cache_hit_rate=random.uniform(0.0, 0.5),
)
async def create_metrics_publisher_endpoint(self):
component = dynamo_context["component"]
await self.metrics_publisher.create_endpoint(component)
@async_on_start
async def async_init(self):
runtime = dynamo_context["runtime"]
comp_ns, comp_name = SGLangWorker.dynamo_address() # type: ignore
endpoint = runtime.namespace(comp_ns).component(comp_name).endpoint("generate")
component = runtime.namespace(comp_ns).component(comp_name)
logger.info(
f"Registering LLM for discovery with kv block size {self.engine_args.page_size}, endpoint={endpoint}, model_path={self.engine_args.model_path}, served_model_name={self.engine_args.served_model_name}"
)
await register_llm(
ModelType.Backend,
endpoint,
self.engine_args.model_path,
self.engine_args.served_model_name,
kv_cache_block_size=self.engine_args.page_size,
)
self.metrics_publisher.publish(
request_active_slots=0,
request_total_slots=1024,
kv_active_blocks=0,
kv_total_blocks=1024,
num_requests_waiting=0,
gpu_cache_usage_perc=0.0,
gpu_prefix_cache_hit_rate=0.0,
)
# Create metrics publisher endpoint for KV router discovery
asyncio.create_task(self.create_metrics_publisher_endpoint())
if self.engine_args.disaggregation_mode:
self.bootstrap_host, self.bootstrap_port = self._get_bootstrap_info()
comp_ns, comp_name = SGLangDecodeWorker.dynamo_address() # type: ignore
self.decode_client = (
await runtime.namespace(comp_ns)
.component(comp_name)
.endpoint("generate")
.client()
)
# Configure ZMQ KV Event Publisher to relay KV events from SGLang to NATS
zmq_config = ZmqKvEventPublisherConfig(
worker_id=endpoint.lease_id(),
kv_block_size=self.engine_args.page_size, # Keep in sync with register_llm above
)
# Keep a reference on the instance to avoid the publisher being garbage-collected.
self._kv_event_publisher = ZmqKvEventPublisher(
component=component,
config=zmq_config,
)
def _get_bootstrap_info(self):
"""
Bootstrap info is stored in the worker's tokenizer manager. We use it to
add servers to the bootstrap_room
"""
"""Bootstrap info from tokenizer manager"""
inner_tm = self.engine.tokenizer_manager
bootstrap_port = inner_tm.server_args.disaggregation_bootstrap_port
# multinode check
if inner_tm.server_args.dist_init_addr:
bootstrap_host = socket.gethostbyname(
inner_tm.server_args.dist_init_addr.split(":")[0]
......@@ -159,39 +105,40 @@ class SGLangWorker:
return bootstrap_host, bootstrap_port
def _build_sampling_params(self, request: PreprocessedRequest) -> dict:
def _build_sampling_params(self, request: dict) -> dict:
sampling_params = {}
if request.sampling_options.temperature:
sampling_params["temperature"] = request.sampling_options.temperature
if request.sampling_options.top_p:
sampling_params["top_p"] = request.sampling_options.top_p
if request.sampling_options.top_k:
sampling_params["top_k"] = request.sampling_options.top_k
sampling_params["max_new_tokens"] = request.stop_conditions.max_tokens
if request.stop_conditions.ignore_eos:
sampling_params["ignore_eos"] = request.stop_conditions.ignore_eos
if request["sampling_options"]["temperature"]:
sampling_params["temperature"] = request["sampling_options"]["temperature"]
if request["sampling_options"]["top_p"]:
sampling_params["top_p"] = request["sampling_options"]["top_p"]
if request["sampling_options"]["top_k"]:
sampling_params["top_k"] = request["sampling_options"]["top_k"]
sampling_params["max_new_tokens"] = request["stop_conditions"]["max_tokens"]
if request["stop_conditions"]["ignore_eos"]:
sampling_params["ignore_eos"] = request["stop_conditions"]["ignore_eos"]
return sampling_params
def _get_request_batch_size(self, request: PreprocessedRequest):
def _get_request_batch_size(self, request: dict):
"""Get batch size from request, returns None for single requests"""
if request.batch_token_ids is not None:
return len(request.batch_token_ids)
if request["batch_token_ids"] is not None:
return len(request["batch_token_ids"])
return None
def _is_batch_request(self, request: PreprocessedRequest):
def _is_batch_request(self, request: dict):
"""Check if request is in batch mode"""
return request.batch_token_ids is not None
return request["batch_token_ids"] is not None
def _generate_bootstrap_room(self):
return random.randint(0, 2**63 - 1)
@endpoint()
async def generate(self, request: PreprocessedRequest):
# Check if we're in batch mode at the start
async def generate(self, request: dict):
is_batch = self._is_batch_request(request)
batch_size = self._get_request_batch_size(request)
# TODO: maintain a mapping from SGLang's Ouput struct to LLMEngineOuput
sampling_params = self._build_sampling_params(request)
if self.engine_args.disaggregation_mode != "null":
if self.server_args.disaggregation_mode != "null":
if is_batch:
bootstrap_room = [
self._generate_bootstrap_room() for _ in range(batch_size)
......@@ -214,9 +161,9 @@ class SGLangWorker:
# prefill response is not used
prefill = await self.engine.async_generate(
input_ids=request.token_ids
input_ids=request["token_ids"]
if not is_batch
else request.batch_token_ids,
else request["batch_token_ids"],
sampling_params=sampling_params,
stream=True,
bootstrap_host=bootstrap_host,
......@@ -235,9 +182,9 @@ class SGLangWorker:
await prefill_task
else:
g = await self.engine.async_generate(
input_ids=request.token_ids
input_ids=request["token_ids"]
if not is_batch
else request.batch_token_ids,
else request["batch_token_ids"],
sampling_params=sampling_params,
stream=True,
)
......@@ -290,9 +237,58 @@ class SGLangWorker:
yield out
def _generate_bootstrap_room(self):
return random.randint(0, 2**63 - 1)
async def _prefill_generator(self, prefill):
async for _ in prefill:
pass
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):
server_args = parse_sglang_args_inc(sys.argv[1:])
await init(runtime, server_args)
async def init(runtime: DistributedRuntime, server_args: ServerArgs):
"""Initialize worker (either prefill or aggregated)"""
engine = sgl.Engine(server_args=server_args)
component = runtime.namespace("dynamo").component("worker")
await component.create_service()
endpoint = component.endpoint("generate")
await register_llm(
ModelType.Backend,
endpoint,
server_args.model_path,
server_args.served_model_name,
kv_cache_block_size=server_args.page_size,
)
if server_args.disaggregation_mode != "null":
decode_client = (
await runtime.namespace("dynamo")
.component("decode")
.endpoint("generate")
.client()
)
handler = RequestHandler(engine, server_args, component, decode_client)
else:
handler = RequestHandler(engine, server_args, component)
# Set up metrics in background
handler.setup_metrics()
# Set up ZMQ kv event publisher
zmq_config = ZmqKvEventPublisherConfig(
worker_id=endpoint.lease_id(),
kv_block_size=server_args.page_size,
)
_ = ZmqKvEventPublisher(component=component, config=zmq_config)
await endpoint.serve_endpoint(handler.generate)
if __name__ == "__main__":
uvloop.install()
asyncio.run(worker())
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
endpoint: dynamo.SGLangWorker.generate
port: 8000
SGLangWorker:
model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
served-model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
page-size: 16
tp: 1
trust-remote-code: true
skip-tokenizer-init: true
ServiceArgs:
workers: 1
resources:
gpu: 1
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1
endpoint: dynamo.SGLangWorker.generate
port: 8000
SGLangWorker:
model-path: /model/
served-model-name: deepseek-ai/DeepSeek-R1
skip-tokenizer-init: true
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
disaggregation-bootstrap-port: 30001
dist-init-addr: HEAD_PREFILL_NODE_IP:29500
nnodes: 4
node-rank: 0
tp-size: 32
dp-size: 32
enable-dp-attention: true
decode-log-interval: 1
# when MoE is enabled ep-size == tp-size
enable-deepep-moe: true
page-size: 1
trust-remote-code: true
moe-dense-tp-size: 1
enable-dp-lm-head: true
disable-radix-cache: true
watchdog-timeout: 1000000
enable-two-batch-overlap: true
deepep-mode: normal
mem-fraction-static: 0.85
# ------------------------------------------------------------------------------------------------
# If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
# The `init-expert-location` configs can be found in the SGL blog post repro instructions
#max-running-requests: 8192
#max-total-tokens: 131072
#context-length: 8192
#init-expert-location: /configs/prefill_in4096.json
#chunked-prefill-size: 524288
# ------------------------------------------------------------------------------------------------
deepep-config: /configs/deepep.json
ep-num-redundant-experts: 32
ep-dispatch-algorithm: dynamic
eplb-algorithm: deepseek
ServiceArgs:
workers: 1
resources:
gpu: 8
envs:
MC_TE_METRIC: true
SGLANG_TBO_DEBUG: 1
SGLangDecodeWorker:
model-path: /model/
served-model-name: deepseek-ai/DeepSeek-R1
skip-tokenizer-init: true
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
disaggregation-bootstrap-port: 30001
dist-init-addr: HEAD_DECODE_NODE_IP:29500
nnodes: 9
node-rank: 0
tp-size: 72
dp-size: 72
enable-dp-attention: true
decode-log-interval: 1
enable-deepep-moe: true
page-size: 1
trust-remote-code: true
# when MoE is enabled ep-size == tp-size
moe-dense-tp-size: 1
enable-dp-lm-head: true
disable-radix-cache: true
watchdog-timeout: 1000000
enable-two-batch-overlap: true
deepep-mode: low_latency
mem-fraction-static: 0.835
# ------------------------------------------------------------------------------------------------
# If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
# The `init-expert-location` configs can be found in the SGL blog post repro instructions
#max-running-requests: 18432
#context-length: 4500
#init-expert-location: /configs/decode_in2000out100.json
# ------------------------------------------------------------------------------------------------
ep-num-redundant-experts: 32
cuda-graph-bs: 256
ServiceArgs:
workers: 1
resources:
gpu: 8
envs:
MC_TE_METRIC: true
SGLANG_TBO_DEBUG: 1
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: silence09/DeepSeek-R1-Small-2layers
endpoint: dynamo.SGLangWorker.generate
port: 8000
SGLangWorker:
model-path: silence09/DeepSeek-R1-Small-2layers
served-model-name: silence09/DeepSeek-R1-Small-2layers
tp: 2
dp-size: 2
enable-dp-attention: true
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
port: 30000
ServiceArgs:
workers: 1
resources:
gpu: 2
SGLangDecodeWorker:
model-path: silence09/DeepSeek-R1-Small-2layers
served-model-name: silence09/DeepSeek-R1-Small-2layers
tp: 2
dp-size: 2
enable-dp-attention: true
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
# SGLang requires a port delta between prefill and decode workers when using enable-dp-attention
port: 31000
ServiceArgs:
workers: 1
resources:
gpu: 2
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
endpoint: dynamo.SGLangWorker.generate
port: 8000
# We set disaggregation-bootstrap-port in utils/sglang.py to ensure unique ports for each replica
SGLangWorker:
model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
served-model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
tp: 1
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
ServiceArgs:
workers: 1
resources:
gpu: 1
SGLangDecodeWorker:
model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
served-model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
tp: 1
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
ServiceArgs:
workers: 1
resources:
gpu: 1
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1
endpoint: dynamo.SGLangWorker.generate
port: 8000
SGLangWorker:
model-path: /model/
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
dp-size: 16
dist-init-addr: HEAD_PREFILL_NODE_IP:29500
nnodes: 2
node-rank: 0
enable-dp-attention: true
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
mem-fraction-static: 0.82
ServiceArgs:
workers: 1
resources:
gpu: 8
SGLangDecodeWorker:
model-path: /model/
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
dp-size: 16
dist-init-addr: HEAD_DECODE_NODE_IP:29500
nnodes: 2
node-rank: 0
enable-dp-attention: true
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
mem-fraction-static: 0.82
ServiceArgs:
workers: 1
resources:
gpu: 8
\ No newline at end of file
......@@ -61,107 +61,114 @@ docker run \
In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
4. On the head prefill node, start `nats-server` and `etcd` using the following commands
4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
```bash
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://0.0.0.0:2379 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster default=http://HEAD_PREFILL_NODE_IP:2380 &
./utils/gen_env_vars.sh
```
5. On every other node, go ahead and export the `NATS_SERVER` and `ETCD_ENDPOINTS` environment variables
> [!IMPORTANT]
> You will need the IP address of your head prefill node and head decode node for the configuration files
5. Run the ingress and prefill worker
```bash
# run this on every other node
export NATS_SERVER=nats://HEAD_PREFILL_NODE_IP:4222
export ETCD_ENDPOINTS=http://HEAD_PREFILL_NODE_IP:2379
```
6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
Each container contains the configuration file in `configs/dsr1-wideep.yaml`. For our example, we will make the following changes:
On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
```yaml
SGLangWorker:
...
dist-init-addr: HEAD_PREFILL_NODE_IP
nnodes: 2
node-rank: 0
...
# run ingress
dynamo run in=http out=dyn &
# run prefill worker
python3 components/worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 4 \
--node-rank 0 \
--tp-size 32 \
--dp-size 32 \
--enable-dp-attention \
--decode-log-interval 1 \
--enable-deepep-moe \
--page-size 1 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode normal \
--mem-fraction-static 0.85 \
--deepep-config /configs/deepep.json \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek
```
On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
```yaml
SGLangWorker:
...
dist-init-addr: HEAD_PREFILL_NODE_IP
nnodes: 2
node-rank: 1
...
```
On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
7. Run the decode worker on the head decode node
```yaml
SGLangDecodeWorker:
...
dist-init-addr: HEAD_DECODE_NODE_IP
nnodes: 4
node-rank: 0
...
```bash
python3 components/decode_worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 9 \
--node-rank 0 \
--tp-size 72 \
--dp-size 72 \
--enable-dp-attention \
--decode-log-interval 1 \
--enable-deepep-moe \
--page-size 1 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode low_latency \
--mem-fraction-static 0.835 \
--ep-num-redundant-experts 32 \
--cuda-graph-bs 256
```
On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
```yaml
SGLangDecodeWorker:
...
dist-init-addr: HEAD_DECODE_NODE_IP
nnodes: 4
# depending on which node this will be 1, 2, and 3
node-rank: 1
```
7. Start up the workers using the following commands
8. Run the warmup script to warm up the model
On prefill head node
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml
./warmup.sh HEAD_PREFILL_NODE_IP
```
On prefill child node
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangWorker
```
## Benchmarking
On all decode nodes
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
prefill:
```bash
dynamo serve graphs.disagg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangDecodeWorker
```
8. Run the warmup script to warm up the model
...
--max-running-requests 8192 \
--max-total-tokens 131072 \
--context-length 8192 \
--init-expert-location /configs/prefill_in4096.json \
--chunked-prefill-size 524288
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
```
decode:
```bash
./warmup.sh HEAD_PREFILL_NODE_IP
...
--max-running-requests 18432 \
--context-length 4500 \
--init-expert-location /configs/decode_in2000out100.json
```
## Benchmarking
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to uncomment the labeled flags in the `configs/dsr1.yaml` file inside of the container.
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from components.frontend import Frontend
from components.worker import SGLangWorker
Frontend.link(SGLangWorker)
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from components.decode_worker import SGLangDecodeWorker
from components.frontend import Frontend
from components.worker import SGLangWorker
Frontend.link(SGLangWorker).link(SGLangDecodeWorker)
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Setup cleanup trap
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID 2>/dev/null || true
wait $DYNAMO_PID 2>/dev/null || true
echo "Cleanup complete."
}
trap cleanup EXIT INT TERM
# run clear_namespace
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn &
DYNAMO_PID=$!
# run ingress
dynamo run in=http out=dyn &
# run worker
python3 components/worker.py \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Setup cleanup trap
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID 2>/dev/null || true
wait $DYNAMO_PID 2>/dev/null || true
echo "Cleanup complete."
}
trap cleanup EXIT INT TERM
# run clear_namespace
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn --router-mode kv &
DYNAMO_PID=$!
# run worker
python3 components/worker.py \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Setup cleanup trap
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
trap cleanup EXIT INT TERM
# run clear_namespace
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn &
DYNAMO_PID=$!
# run prefill worker
python3 components/worker.py \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
PREFILL_PID=$!
# run decode worker
CUDA_VISIBLE_DEVICES=1 python3 components/decode_worker.py \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Setup cleanup trap
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
trap cleanup EXIT INT TERM
# run clear_namespace
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn &
DYNAMO_PID=$!
# run prefill worker
python3 components/worker.py \
--model-path silence09/DeepSeek-R1-Small-2layers \
--served-model-name silence09/DeepSeek-R1-Small-2layers \
--tp 2 \
--dp-size 2 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--port 30000 &
PREFILL_PID=$!
# run decode worker
CUDA_VISIBLE_DEVICES=2,3 python3 components/decode_worker.py \
--model-path silence09/DeepSeek-R1-Small-2layers \
--served-model-name silence09/DeepSeek-R1-Small-2layers \
--tp 2 \
--dp-size 2 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--port 31000
\ No newline at end of file
......@@ -2,8 +2,7 @@
## Multi-node sized models
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires
4 nodes of 8xH100 GPUs.
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs.
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes.
```bash
......@@ -14,130 +13,93 @@ docker compose -f lib/runtime/docker-compose.yml up -d
**Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
```yaml
# configs/prefill-1.yaml
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1
endpoint: dynamo.SGLangWorker.generate
port: 8000
SGLangWorker:
model-path: deepseek-ai/DeepSeek-R1
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
trust-remote-code: true
skip-tokenizer-init: true
dist-init-addr: <node-1-ip>:29500
disaggregation-bootstrap-port: 30001
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
nnodes: 2
node-rank: 0
mem-fraction-static: 0.82
ServiceArgs:
workers: 1
resources:
gpu: 8
```
Run this with:
```bash
cd examples/sglang
dynamo serve graphs.agg:Frontend -f configs/prefill-1.yaml
```
Node 2: Run the remaining 8 shards of the prefill worker and the decode worker
```yaml
# configs/prefill-2.yaml
SGLangWorker:
model-path: deepseek-ai/DeepSeek-R1
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
trust-remote-code: true
skip-tokenizer-init: true
mem-fraction-static: 0.82
dist-init-addr: <node-1-ip>:29500
disaggregation-bootstrap-port: 30001
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
nnodes: 2
node-rank: 1
ServiceArgs:
workers: 1
resources:
gpu: 8
# run ingress
dynamo run in=http out=dyn &
# run prefill worker
python3 components/worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr HEAD_PREFILL_NODE_IP:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82 \
```
On all other nodes, we need to export the NATS and ETCD endpoints. Run this with with:
Node 2: Run the remaining 8 shards of the prefill worker
```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
cd examples/sglang
dynamo serve graphs.disagg:Frontend -f configs/prefill-2.yaml --service-name SGLangWorker
# worker
python3 components/worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr HEAD_PREFILL_NODE_IP:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82
```
Node 3: Run the first 8 shards of the decode worker
```yaml
# configs/decode-1.yaml
SGLangDecodeWorker:
model-path: deepseek-ai/DeepSeek-R1
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
trust-remote-code: true
skip-tokenizer-init: true
mem-fraction-static: 0.80
dist-init-addr: 2:29500
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
disaggregation-bootstrap-port: 30001
nnodes: 2
node-rank: 0
ServiceArgs:
workers: 1
resources:
gpu: 8
```
Run this with:
```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
cd examples/sglang
dynamo serve graphs.disagg:Frontend -f configs/decode-1.yaml --service-name SGLangDecodeWorker
# worker
python3 components/decode_worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr HEAD_DECODE_NODE_IP:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82
```
Node 4: Run the remaining 8 shards of the decode worker
```yaml
# configs/decode-2.yaml
SGLangDecodeWorker:
model-path: deepseek-ai/DeepSeek-R1
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
trust-remote-code: true
skip-tokenizer-init: true
mem-fraction-static: 0.80
dist-init-addr: 2:29500
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
disaggregation-bootstrap-port: 30001
disable-cuda-graph: true
nnodes: 2
node-rank: 1
ServiceArgs:
workers: 1
resources:
gpu: 8
```
Run this with:
```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
cd examples/sglang
dynamo serve graphs.disagg:Frontend -f configs/decode-2.yaml --service-name SGLangDecodeWorker
# worker
python3 components/decode_worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr HEAD_DECODE_NODE_IP:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82
```
**Step 3**: Run inference
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment