feat: add dynamo components for sglang (#1721)

9cbf8031 · ishandhanani · GitHub · 008bb1e6 · 9cbf8031 · 9cbf8031
Unverified Commit 9cbf8031 authored Jul 02, 2025 by ishandhanani Committed by GitHub Jul 02, 2025
20 changed files
--- a/container/Dockerfile.sglang
+++ b/container/Dockerfile.sglang
@@ -137,6 +137,7 @@ RUN if [ "$ARCH" = "arm64" ]; then \
 # This commit references a NIXL fix that was releasted after the 0.4.8.post1 release https://github.com/sgl-project/sglang/pull/7330
 ARG SGLANG_COMMIT="bb9b608c86ebad7d9d01e29fe058bc184dc7285f"
 RUN --mount=type=cache,target=/root/.cache/uv \
+    cd /opt && \
    git clone https://github.com/sgl-project/sglang.git && \
    cd sglang && \
    git checkout ${SGLANG_COMMIT} && \

--- a/container/Dockerfile.sglang-deepep
+++ b/container/Dockerfile.sglang-deepep
@@ -91,7 +91,8 @@ WORKDIR /sgl-workspace
 # support batch completions for SGL benchmarking
 # https://github.com/ai-dynamo/dynamo/pull/1626
 ARG DYNAMO_COMMIT="fc16a79bfc5a4c4f58503d3c36f2013340244cac"
-RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_COMMIT}
+ARG DYNAMO_BRANCH="ishan/sgl-v2"
+RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_BRANCH}
 # install dynamo in editable mode
 WORKDIR /sgl-workspace/dynamo
@@ -138,7 +139,7 @@ RUN cp target/release/dynamo-run deploy/sdk/src/dynamo/sdk/cli/bin
 RUN cd lib/bindings/python && pip install --break-system-packages -e . && cd ../../..
 RUN pip install --break-system-packages -e .
-ENV PYTHONPATH=/sgl-workspace/dynamo/components/planner/src
+ENV PYTHONPATH=/sgl-workspace/dynamo/components/planner/src:/sgl-workspace/dynamo/examples/sglang:$PYTHONPATH
 RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.24/nats-server-v2.10.24-${ARCH}.deb && \
    dpkg -i nats-server-v2.10.24-${ARCH}.deb && rm nats-server-v2.10.24-${ARCH}.deb
@@ -165,8 +166,7 @@ ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-buil
 RUN pip install --break-system-packages genai-perf
-COPY examples/sglang/configs/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/configs/
+COPY examples/sglang/configs/deepseek_r1/wideep/* /sgl-workspace/dynamo/examples/sglang/configs/
-COPY examples/sglang/utils/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/utils/
+COPY examples/sglang/utils/benchmarking/* /sgl-workspace/dynamo/examples/sglang/utils/
 WORKDIR /sgl-workspace/dynamo/examples/sglang
--- a/examples/sglang/README.md
+++ b/examples/sglang/README.md
@@ -62,30 +62,58 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
 ./container/run.sh -it --framework sglang
 ```
+## Run Deployment
+This figure shows an overview of the major components to deploy:
+```
+------+      +-----------+      +------------------+             +---------------+
+| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
+|      |<-----|           |<-----|                  |<------------|     Worker    |
+------+      +-----------+      +------------------+             +---------------+
+                  |    ^                  |
+       query best |    | return           | publish kv events
+           worker |    | worker_id        v
+                  |    |         +------------------+
+                  |    +---------|     kv-router    |
+                  +------------->|                  |
+                                 +------------------+
+```
+Note: The above architecture illustrates all the components. The final components
+that get spawned depend upon the chosen graph.
 ### Example architectures
+> [!IMPORTANT]
+> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each commmand and run them in separate terminals.
 #### Aggregated
 ```bash
 cd /workspace/examples/sglang
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
+./launch/agg.sh
 ```
-#### Aggregated with router
+#### Aggregated serving with KV Routing
 > [!NOTE]
 > The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
-> • Upstream: [sgl-project/sglang #6721](https://github.com/sgl-project/sglang/pull/6721) – _Expose runtime KV-cache & request metrics_.
 > • Dynamo: [ai-dynamo/dynamo #1465](https://github.com/ai-dynamo/dynamo/pull/1465) – _feat: receive kvmetrics from sglang scheduler_.
 >
 > After these are in, the TODOs in `worker.py` will be resolved and the placeholder logic removed.
 ```bash
 cd /workspace/examples/sglang
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Frontend.router=kv
+export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
+./launch/agg_router.sh
 ```
-#### Disaggregated
+#### Disaggregated serving
 <details>
 <summary>SGLang Load Balancer vs Dynamo Discovery</summary>
@@ -106,7 +134,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
 ```bash
 cd /workspace/examples/sglang
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
+./launch/disagg.sh
 ```
 ##### Disaggregated with MoE models and DP attention
@@ -116,7 +144,7 @@ SGLang also supports DP attention for MoE models. We provide an example config f
 ```bash
 # note this will require 4 GPUs
 cd /workspace/examples/sglang
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg-dp-attention.yaml
+./launch/disagg_dp_attn.sh
 ```
 In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.

--- a/examples/sglang/components/decode_worker.py
+++ b/examples/sglang/components/decode_worker.py
@@ -15,45 +15,65 @@
 from __future__ import annotations
+import asyncio
 import logging
+import sys
+import msgspec
 import sglang as sgl
-from utils.protocol import DisaggPreprocessedRequest
+import uvloop
-from utils.sgl_utils import parse_sglang_args
+from sglang.srt.server_args import ServerArgs
+from utils.sgl_utils import parse_sglang_args_inc
-from dynamo.sdk import endpoint, service
+from dynamo.runtime import DistributedRuntime, dynamo_worker
-logger = logging.getLogger(__name__)
+from dynamo.runtime.logging import configure_dynamo_logging
+configure_dynamo_logging()
-@service(
-    dynamo={
-        "enabled": True,
+class DecodeRequestHandler:
-        "namespace": "dynamo",
+    def __init__(self, engine: sgl.Engine):
-    },
+        self.engine = engine
-    resources={"gpu": 1},
+        logging.info("Decode request handler initialized")
-    workers=1,
-)
+    async def generate(self, request: str):
-class SGLangDecodeWorker:
+        req = msgspec.json.decode(request, type=dict)
-    def __init__(self):
-        class_name = self.__class__.__name__
+        results = await self.engine.async_generate(
-        self.engine_args = parse_sglang_args(class_name, "")
+            input_ids=req["request"]["token_ids"]
-        self.engine = sgl.Engine(server_args=self.engine_args)
+            if req["request"]["batch_token_ids"] is None
+            else req["request"]["batch_token_ids"],
-        logger.warning("Decode worker initialized")
+            sampling_params=req["sampling_params"],
-    @endpoint()
-    async def generate(self, req: DisaggPreprocessedRequest):
-        g = await self.engine.async_generate(
-            input_ids=req.request.token_ids
-            if req.request.batch_token_ids is None
-            else req.request.batch_token_ids,
-            sampling_params=req.sampling_params,
            stream=True,
-            bootstrap_host=req.bootstrap_host,
+            bootstrap_host=req["bootstrap_host"],
-            bootstrap_port=req.bootstrap_port,
+            bootstrap_port=req["bootstrap_port"],
-            bootstrap_room=req.bootstrap_room,
+            bootstrap_room=req["bootstrap_room"],
        )
-        async for result in g:
+        async for result in results:
            yield result
+@dynamo_worker(static=False)
+async def worker(runtime: DistributedRuntime):
+    server_args = parse_sglang_args_inc(sys.argv[1:])
+    await init(runtime, server_args)
+async def init(runtime: DistributedRuntime, server_args: ServerArgs):
+    """Initialize decode worker"""
+    engine = sgl.Engine(server_args=server_args)
+    handler = DecodeRequestHandler(engine)
+    component = runtime.namespace("dynamo").component("decode")
+    await component.create_service()
+    endpoint = component.endpoint("generate")
+    await endpoint.serve_endpoint(handler.generate)
+if __name__ == "__main__":
+    uvloop.install()
+    asyncio.run(worker())
--- a/examples/sglang/components/frontend.py
+++ b/examples/sglang/components/frontend.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import logging
-import subprocess
-from pathlib import Path
-from components.worker import SGLangWorker
-from fastapi import FastAPI
-from pydantic import BaseModel
-import dynamo.sdk as sdk
-from dynamo.sdk import depends, service
-from dynamo.sdk.lib.config import ServiceConfig
-from dynamo.sdk.lib.image import DYNAMO_IMAGE
-logger = logging.getLogger(__name__)
-def get_dynamo_run_binary():
-    """Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command."""
-    sdk_path = Path(sdk.__file__)
-    binary_path = sdk_path.parent / "cli/bin/dynamo-run"
-    if not binary_path.exists():
-        return "dynamo-run"
-    else:
-        return str(binary_path)
-class FrontendConfig(BaseModel):
-    """Configuration for the Frontend service including model and HTTP server settings."""
-    served_model_name: str
-    endpoint: str
-    port: int = 8080
-    router: str = "round-robin"
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    workers=1,
-    image=DYNAMO_IMAGE,
-    app=FastAPI(title="LLM Example"),
-)
-class Frontend:
-    worker = depends(SGLangWorker)
-    def __init__(self):
-        """Initialize Frontend service with HTTP server and model configuration."""
-        frontend_config = FrontendConfig(**ServiceConfig.get_parsed_config("Frontend"))
-        self.frontend_config = frontend_config
-        self.process = None
-        self.start_ingress_and_processor()
-    def start_ingress_and_processor(self):
-        """Starting dynamo-run based ingress and processor"""
-        logger.info(
-            f"Starting HTTP server and processor on port {self.frontend_config.port}"
-        )
-        dynamo_run_binary = get_dynamo_run_binary()
-        endpoint = f"dyn://{self.frontend_config.endpoint}"
-        self.process = subprocess.Popen(
-            [
-                dynamo_run_binary,
-                "in=http",
-                f"out={endpoint}",
-                "--http-port",
-                str(self.frontend_config.port),
-                "--router-mode",
-                str(self.frontend_config.router),
-            ],
-            stdout=None,
-            stderr=None,
-        )
--- a/examples/sglang/components/worker.py
+++ b/examples/sglang/components/worker.py
 # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-SGLang disaggregated serving flow is
-Processor -> PrefillWorker -> DecodeWorker
-This is different from how we've implemented the vLLM disaggregated flow.
-For now - the SGLangWorker will be responsible for aggreagted and prefill and we will
-have a separate DecodeWorker.
-"""
 import asyncio
 import logging
 import random
 import socket
-from typing import Dict, Union
+import sys
+from typing import Any, Dict, Optional, Union
 import sglang as sgl
-from components.decode_worker import SGLangDecodeWorker
+import uvloop
+from sglang.srt.server_args import ServerArgs
 from sglang.srt.utils import get_ip
-from utils.protocol import DisaggPreprocessedRequest, PreprocessedRequest
+from utils.protocol import DisaggPreprocessedRequest
-from utils.sgl_utils import parse_sglang_args
+from utils.sgl_utils import parse_sglang_args_inc
 from dynamo.llm import (
    ModelType,
@@ -43,69 +22,40 @@ from dynamo.llm import (
    ZmqKvEventPublisherConfig,
    register_llm,
 )
-from dynamo.sdk import async_on_start, depends, dynamo_context, endpoint, service
+from dynamo.runtime import DistributedRuntime, dynamo_worker
+from dynamo.runtime.logging import configure_dynamo_logging
-logger = logging.getLogger(__name__)
+configure_dynamo_logging()
-@service(
+class RequestHandler:
-    dynamo={
+    def __init__(
-        "namespace": "dynamo",
+        self,
-    },
+        engine: sgl.Engine,
-    resources={"gpu": 1},
+        server_args: ServerArgs,
-    workers=1,
+        component,
-)
+        decode_client: Optional[Any] = None,
-class SGLangWorker:
+    ):
-    decode_worker = depends(SGLangDecodeWorker)
+        self.engine = engine
+        self.server_args = server_args
-    def __init__(self):
+        self.component = component
-        class_name = self.__class__.__name__
-        self.engine_args = parse_sglang_args(class_name, "")
-        self.engine = sgl.Engine(server_args=self.engine_args)
-        # Initialize metrics publisher
        self.metrics_publisher = WorkerMetricsPublisher()
-    def _update_metrics(self):
+        if server_args.disaggregation_mode != "null":
-        """Update metrics with current engine state"""
+            self.bootstrap_host, self.bootstrap_port = self._get_bootstrap_info()
-        # TODO: remove this once the following upstream changes are merged:
+            if decode_client is None:
-        #   • ai-dynamo/dynamo#1465 – "feat: receive kvmetrics from sglang scheduler"
+                raise ValueError(
-        #   • sgl-project/sglang#6721 – "Expose runtime KV-cache & request metrics"
+                    "decode_client must be provided when disaggregation_mode is not 'null'"
-        logger.warning(
-            "Publishing placeholder metrics in SGLangWorker; these are NOT real engine metrics yet and will be replaced once upstream support lands."
                )
-        self.metrics_publisher.publish(
+            self.decode_client = decode_client
-            request_active_slots=1,
+            logging.info(
-            request_total_slots=100,
+                f"Disaggregation enabled - bootstrap host: {self.bootstrap_host}, bootstrap port: {self.bootstrap_port}"
-            kv_active_blocks=random.randint(0, 500),
-            kv_total_blocks=1000,
-            num_requests_waiting=0,
-            gpu_cache_usage_perc=random.uniform(0.1, 0.8),
-            gpu_prefix_cache_hit_rate=random.uniform(0.0, 0.5),
            )
-    async def create_metrics_publisher_endpoint(self):
+        logging.info("Request handler initialized")
-        component = dynamo_context["component"]
-        await self.metrics_publisher.create_endpoint(component)
-    @async_on_start
-    async def async_init(self):
-        runtime = dynamo_context["runtime"]
-        comp_ns, comp_name = SGLangWorker.dynamo_address()  # type: ignore
-        endpoint = runtime.namespace(comp_ns).component(comp_name).endpoint("generate")
-        component = runtime.namespace(comp_ns).component(comp_name)
-        logger.info(
-            f"Registering LLM for discovery with kv block size {self.engine_args.page_size}, endpoint={endpoint}, model_path={self.engine_args.model_path}, served_model_name={self.engine_args.served_model_name}"
-        )
-        await register_llm(
-            ModelType.Backend,
-            endpoint,
-            self.engine_args.model_path,
-            self.engine_args.served_model_name,
-            kv_cache_block_size=self.engine_args.page_size,
-        )
+    def setup_metrics(self):
+        """Set up metrics publisher - call this after handler creation"""
        self.metrics_publisher.publish(
            request_active_slots=0,
            request_total_slots=1024,
@@ -115,41 +65,37 @@ class SGLangWorker:
            gpu_cache_usage_perc=0.0,
            gpu_prefix_cache_hit_rate=0.0,
        )
+        task = asyncio.create_task(self.create_metrics_publisher_endpoint())
-        # Create metrics publisher endpoint for KV router discovery
+        task.add_done_callback(
-        asyncio.create_task(self.create_metrics_publisher_endpoint())
+            lambda _: logging.debug("metrics publisher endpoint created")
-        if self.engine_args.disaggregation_mode:
-            self.bootstrap_host, self.bootstrap_port = self._get_bootstrap_info()
-            comp_ns, comp_name = SGLangDecodeWorker.dynamo_address()  # type: ignore
-            self.decode_client = (
-                await runtime.namespace(comp_ns)
-                .component(comp_name)
-                .endpoint("generate")
-                .client()
        )
-        # Configure ZMQ KV Event Publisher to relay KV events from SGLang to NATS
+    async def create_metrics_publisher_endpoint(self):
-        zmq_config = ZmqKvEventPublisherConfig(
+        logging.debug("Creating metrics publisher endpoint")
-            worker_id=endpoint.lease_id(),
+        await self.metrics_publisher.create_endpoint(self.component)
-            kv_block_size=self.engine_args.page_size,  # Keep in sync with register_llm above
-        )
-        # Keep a reference on the instance to avoid the publisher being garbage-collected.
+    def _update_metrics(self):
-        self._kv_event_publisher = ZmqKvEventPublisher(
+        """Update metrics with current engine state"""
-            component=component,
+        # TODO: remove this once the following upstream changes are merged:
-            config=zmq_config,
+        #   • sgl-project/sglang#6721 – "Expose runtime KV-cache & request metrics"
+        logging.warning(
+            "Publishing placeholder metrics in SGLangWorker; these are NOT real engine metrics yet and will be replaced once upstream support lands."
+        )
+        self.metrics_publisher.publish(
+            request_active_slots=1,
+            request_total_slots=100,
+            kv_active_blocks=random.randint(0, 500),
+            kv_total_blocks=1000,
+            num_requests_waiting=0,
+            gpu_cache_usage_perc=random.uniform(0.1, 0.8),
+            gpu_prefix_cache_hit_rate=random.uniform(0.0, 0.5),
        )
    def _get_bootstrap_info(self):
-        """
+        """Bootstrap info from tokenizer manager"""
-        Bootstrap info is stored in the worker's tokenizer manager. We use it to
-        add servers to the bootstrap_room
-        """
        inner_tm = self.engine.tokenizer_manager
        bootstrap_port = inner_tm.server_args.disaggregation_bootstrap_port
-        # multinode check
        if inner_tm.server_args.dist_init_addr:
            bootstrap_host = socket.gethostbyname(
                inner_tm.server_args.dist_init_addr.split(":")[0]
@@ -159,39 +105,40 @@ class SGLangWorker:
        return bootstrap_host, bootstrap_port
-    def _build_sampling_params(self, request: PreprocessedRequest) -> dict:
+    def _build_sampling_params(self, request: dict) -> dict:
        sampling_params = {}
-        if request.sampling_options.temperature:
+        if request["sampling_options"]["temperature"]:
-            sampling_params["temperature"] = request.sampling_options.temperature
+            sampling_params["temperature"] = request["sampling_options"]["temperature"]
-        if request.sampling_options.top_p:
+        if request["sampling_options"]["top_p"]:
-            sampling_params["top_p"] = request.sampling_options.top_p
+            sampling_params["top_p"] = request["sampling_options"]["top_p"]
-        if request.sampling_options.top_k:
+        if request["sampling_options"]["top_k"]:
-            sampling_params["top_k"] = request.sampling_options.top_k
+            sampling_params["top_k"] = request["sampling_options"]["top_k"]
-        sampling_params["max_new_tokens"] = request.stop_conditions.max_tokens
+        sampling_params["max_new_tokens"] = request["stop_conditions"]["max_tokens"]
-        if request.stop_conditions.ignore_eos:
+        if request["stop_conditions"]["ignore_eos"]:
-            sampling_params["ignore_eos"] = request.stop_conditions.ignore_eos
+            sampling_params["ignore_eos"] = request["stop_conditions"]["ignore_eos"]
        return sampling_params
-    def _get_request_batch_size(self, request: PreprocessedRequest):
+    def _get_request_batch_size(self, request: dict):
        """Get batch size from request, returns None for single requests"""
-        if request.batch_token_ids is not None:
+        if request["batch_token_ids"] is not None:
-            return len(request.batch_token_ids)
+            return len(request["batch_token_ids"])
        return None
-    def _is_batch_request(self, request: PreprocessedRequest):
+    def _is_batch_request(self, request: dict):
        """Check if request is in batch mode"""
-        return request.batch_token_ids is not None
+        return request["batch_token_ids"] is not None
-    @endpoint()
+    def _generate_bootstrap_room(self):
-    async def generate(self, request: PreprocessedRequest):
+        return random.randint(0, 2**63 - 1)
-        # Check if we're in batch mode at the start
+    async def generate(self, request: dict):
        is_batch = self._is_batch_request(request)
        batch_size = self._get_request_batch_size(request)
        # TODO: maintain a mapping from SGLang's Ouput struct to LLMEngineOuput
        sampling_params = self._build_sampling_params(request)
-        if self.engine_args.disaggregation_mode != "null":
+        if self.server_args.disaggregation_mode != "null":
            if is_batch:
                bootstrap_room = [
                    self._generate_bootstrap_room() for _ in range(batch_size)
@@ -214,9 +161,9 @@ class SGLangWorker:
            # prefill response is not used
            prefill = await self.engine.async_generate(
-                input_ids=request.token_ids
+                input_ids=request["token_ids"]
                if not is_batch
-                else request.batch_token_ids,
+                else request["batch_token_ids"],
                sampling_params=sampling_params,
                stream=True,
                bootstrap_host=bootstrap_host,
@@ -235,9 +182,9 @@ class SGLangWorker:
            await prefill_task
        else:
            g = await self.engine.async_generate(
-                input_ids=request.token_ids
+                input_ids=request["token_ids"]
                if not is_batch
-                else request.batch_token_ids,
+                else request["batch_token_ids"],
                sampling_params=sampling_params,
                stream=True,
            )
@@ -290,9 +237,58 @@ class SGLangWorker:
            yield out
-    def _generate_bootstrap_room(self):
-        return random.randint(0, 2**63 - 1)
    async def _prefill_generator(self, prefill):
        async for _ in prefill:
            pass
+@dynamo_worker(static=False)
+async def worker(runtime: DistributedRuntime):
+    server_args = parse_sglang_args_inc(sys.argv[1:])
+    await init(runtime, server_args)
+async def init(runtime: DistributedRuntime, server_args: ServerArgs):
+    """Initialize worker (either prefill or aggregated)"""
+    engine = sgl.Engine(server_args=server_args)
+    component = runtime.namespace("dynamo").component("worker")
+    await component.create_service()
+    endpoint = component.endpoint("generate")
+    await register_llm(
+        ModelType.Backend,
+        endpoint,
+        server_args.model_path,
+        server_args.served_model_name,
+        kv_cache_block_size=server_args.page_size,
+    )
+    if server_args.disaggregation_mode != "null":
+        decode_client = (
+            await runtime.namespace("dynamo")
+            .component("decode")
+            .endpoint("generate")
+            .client()
+        )
+        handler = RequestHandler(engine, server_args, component, decode_client)
+    else:
+        handler = RequestHandler(engine, server_args, component)
+    # Set up metrics in background
+    handler.setup_metrics()
+    # Set up ZMQ kv event publisher
+    zmq_config = ZmqKvEventPublisherConfig(
+        worker_id=endpoint.lease_id(),
+        kv_block_size=server_args.page_size,
+    )
+    _ = ZmqKvEventPublisher(component=component, config=zmq_config)
+    await endpoint.serve_endpoint(handler.generate)
+if __name__ == "__main__":
+    uvloop.install()
+    asyncio.run(worker())
--- a/examples/sglang/configs/agg.yaml
+++ b/examples/sglang/configs/agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.SGLangWorker.generate
-  port: 8000
-SGLangWorker:
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  served-model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  page-size: 16
-  tp: 1
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
--- a/examples/sglang/configs/deepseek-r1-wideep/dsr1-wideep.yaml
+++ b/examples/sglang/configs/deepseek-r1-wideep/dsr1-wideep.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1
-  endpoint: dynamo.SGLangWorker.generate
-  port: 8000
-SGLangWorker:
-  model-path: /model/
-  served-model-name: deepseek-ai/DeepSeek-R1
-  skip-tokenizer-init: true
-  disaggregation-mode: prefill
-  disaggregation-transfer-backend: nixl
-  disaggregation-bootstrap-port: 30001
-  dist-init-addr: HEAD_PREFILL_NODE_IP:29500
-  nnodes: 4
-  node-rank: 0
-  tp-size: 32
-  dp-size: 32
-  enable-dp-attention: true
-  decode-log-interval: 1
-  # when MoE is enabled ep-size == tp-size
-  enable-deepep-moe: true
-  page-size: 1
-  trust-remote-code: true
-  moe-dense-tp-size: 1
-  enable-dp-lm-head: true
-  disable-radix-cache: true
-  watchdog-timeout: 1000000
-  enable-two-batch-overlap: true
-  deepep-mode: normal
-  mem-fraction-static: 0.85
-  # ------------------------------------------------------------------------------------------------
-  # If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
-  # The `init-expert-location` configs can be found in the SGL blog post repro instructions
-  #max-running-requests: 8192
-  #max-total-tokens: 131072
-  #context-length: 8192
-  #init-expert-location: /configs/prefill_in4096.json
-  #chunked-prefill-size: 524288
-  # ------------------------------------------------------------------------------------------------
-  deepep-config: /configs/deepep.json
-  ep-num-redundant-experts: 32
-  ep-dispatch-algorithm: dynamic
-  eplb-algorithm: deepseek
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
-    envs:
-      MC_TE_METRIC: true
-      SGLANG_TBO_DEBUG: 1
-SGLangDecodeWorker:
-  model-path: /model/
-  served-model-name: deepseek-ai/DeepSeek-R1
-  skip-tokenizer-init: true
-  disaggregation-mode: decode
-  disaggregation-transfer-backend: nixl
-  disaggregation-bootstrap-port: 30001
-  dist-init-addr: HEAD_DECODE_NODE_IP:29500
-  nnodes: 9
-  node-rank: 0
-  tp-size: 72
-  dp-size: 72
-  enable-dp-attention: true
-  decode-log-interval: 1
-  enable-deepep-moe: true
-  page-size: 1
-  trust-remote-code: true
-  # when MoE is enabled ep-size == tp-size
-  moe-dense-tp-size: 1
-  enable-dp-lm-head: true
-  disable-radix-cache: true
-  watchdog-timeout: 1000000
-  enable-two-batch-overlap: true
-  deepep-mode: low_latency
-  mem-fraction-static: 0.835
-  # ------------------------------------------------------------------------------------------------
-  # If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
-  # The `init-expert-location` configs can be found in the SGL blog post repro instructions
-  #max-running-requests: 18432
-  #context-length: 4500
-  #init-expert-location: /configs/decode_in2000out100.json
-  # ------------------------------------------------------------------------------------------------
-  ep-num-redundant-experts: 32
-  cuda-graph-bs: 256
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
-    envs:
-      MC_TE_METRIC: true
-      SGLANG_TBO_DEBUG: 1
--- a/examples/sglang/configs/deepseek-r1-wideep/deepep.json
+++ b/examples/sglang/configs/deepseek-r1-wideep/deepep.json
--- a/examples/sglang/configs/disagg-dp-attention.yaml
+++ b/examples/sglang/configs/disagg-dp-attention.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Frontend:
-  served_model_name: silence09/DeepSeek-R1-Small-2layers
-  endpoint: dynamo.SGLangWorker.generate
-  port: 8000
-SGLangWorker:
-  model-path: silence09/DeepSeek-R1-Small-2layers
-  served-model-name: silence09/DeepSeek-R1-Small-2layers
-  tp: 2
-  dp-size: 2
-  enable-dp-attention: true
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  disaggregation-mode: prefill
-  disaggregation-transfer-backend: nixl
-  port: 30000
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 2
-SGLangDecodeWorker:
-  model-path: silence09/DeepSeek-R1-Small-2layers
-  served-model-name: silence09/DeepSeek-R1-Small-2layers
-  tp: 2
-  dp-size: 2
-  enable-dp-attention: true
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  disaggregation-mode: decode
-  disaggregation-transfer-backend: nixl
-  # SGLang requires a port delta between prefill and decode workers when using enable-dp-attention
-  port: 31000
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 2
\ No newline at end of file
--- a/examples/sglang/configs/disagg.yaml
+++ b/examples/sglang/configs/disagg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.SGLangWorker.generate
-  port: 8000
-# We set disaggregation-bootstrap-port in utils/sglang.py to ensure unique ports for each replica
-SGLangWorker:
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  served-model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  tp: 1
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  disaggregation-mode: prefill
-  disaggregation-transfer-backend: nixl
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-SGLangDecodeWorker:
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  served-model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  tp: 1
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  disaggregation-mode: decode
-  disaggregation-transfer-backend: nixl
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
--- a/examples/sglang/configs/dsr1.yaml
+++ b/examples/sglang/configs/dsr1.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1
-  endpoint: dynamo.SGLangWorker.generate
-  port: 8000
-SGLangWorker:
-  model-path: /model/
-  served-model-name: deepseek-ai/DeepSeek-R1
-  tp: 16
-  dp-size: 16
-  dist-init-addr: HEAD_PREFILL_NODE_IP:29500
-  nnodes: 2
-  node-rank: 0
-  enable-dp-attention: true
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  disaggregation-mode: prefill
-  disaggregation-transfer-backend: nixl
-  mem-fraction-static: 0.82
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
-SGLangDecodeWorker:
-  model-path: /model/
-  served-model-name: deepseek-ai/DeepSeek-R1
-  tp: 16
-  dp-size: 16
-  dist-init-addr: HEAD_DECODE_NODE_IP:29500
-  nnodes: 2
-  node-rank: 0
-  enable-dp-attention: true
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  disaggregation-mode: decode
-  disaggregation-transfer-backend: nixl
-  mem-fraction-static: 0.82
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
\ No newline at end of file
--- a/examples/sglang/dsr1-wideep.md
+++ b/examples/sglang/dsr1-wideep.md
@@ -61,107 +61,114 @@ docker run \
 In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
-4. On the head prefill node, start `nats-server` and `etcd` using the following commands
+4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
 ```bash
-nats-server -js &
+./utils/gen_env_vars.sh
-etcd --listen-client-urls http://0.0.0.0:2379 \
-     --advertise-client-urls http://0.0.0.0:2379 \
-     --listen-peer-urls http://0.0.0.0:2380 \
-     --initial-cluster default=http://HEAD_PREFILL_NODE_IP:2380 &
 ```
-5. On every other node, go ahead and export the `NATS_SERVER` and `ETCD_ENDPOINTS` environment variables
+5. Run the ingress and prefill worker
-> [!IMPORTANT]
-> You will need the IP address of your head prefill node and head decode node for the configuration files
 ```bash
-# run this on every other node
+# run ingress
-export NATS_SERVER=nats://HEAD_PREFILL_NODE_IP:4222
+dynamo run in=http out=dyn &
-export ETCD_ENDPOINTS=http://HEAD_PREFILL_NODE_IP:2379
+# run prefill worker
-```
+python3 components/worker_inc.py \
+  --model-path /model/ \
-6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --skip-tokenizer-init \
-Each container contains the configuration file in `configs/dsr1-wideep.yaml`. For our example, we will make the following changes:
+  --disaggregation-mode prefill \
+  --disaggregation-transfer-backend nixl \
-On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
+  --disaggregation-bootstrap-port 30001 \
+  --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
-```yaml
+  --nnodes 4 \
-SGLangWorker:
+  --node-rank 0 \
-    ...
+  --tp-size 32 \
-    dist-init-addr: HEAD_PREFILL_NODE_IP
+  --dp-size 32 \
-    nnodes: 2
+  --enable-dp-attention \
-    node-rank: 0
+  --decode-log-interval 1 \
-    ...
+  --enable-deepep-moe \
+  --page-size 1 \
+  --trust-remote-code \
+  --moe-dense-tp-size 1 \
+  --enable-dp-lm-head \
+  --disable-radix-cache \
+  --watchdog-timeout 1000000 \
+  --enable-two-batch-overlap \
+  --deepep-mode normal \
+  --mem-fraction-static 0.85 \
+  --deepep-config /configs/deepep.json \
+  --ep-num-redundant-experts 32 \
+  --ep-dispatch-algorithm dynamic \
+  --eplb-algorithm deepseek
 ```
-On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
+On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
-```yaml
+7. Run the decode worker on the head decode node
-SGLangWorker:
-    ...
-    dist-init-addr: HEAD_PREFILL_NODE_IP
-    nnodes: 2
-    node-rank: 1
-    ...
-```
-On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
-```yaml
+```bash
-SGLangDecodeWorker:
+python3 components/decode_worker_inc.py \
-    ...
+  --model-path /model/ \
-    dist-init-addr: HEAD_DECODE_NODE_IP
+  --served-model-name deepseek-ai/DeepSeek-R1 \
-    nnodes: 4
+  --skip-tokenizer-init \
-    node-rank: 0
+  --disaggregation-mode decode \
-    ...
+  --disaggregation-transfer-backend nixl \
+  --disaggregation-bootstrap-port 30001 \
+  --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
+  --nnodes 9 \
+  --node-rank 0 \
+  --tp-size 72 \
+  --dp-size 72 \
+  --enable-dp-attention \
+  --decode-log-interval 1 \
+  --enable-deepep-moe \
+  --page-size 1 \
+  --trust-remote-code \
+  --moe-dense-tp-size 1 \
+  --enable-dp-lm-head \
+  --disable-radix-cache \
+  --watchdog-timeout 1000000 \
+  --enable-two-batch-overlap \
+  --deepep-mode low_latency \
+  --mem-fraction-static 0.835 \
+  --ep-num-redundant-experts 32 \
+  --cuda-graph-bs 256
 ```
-On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
+On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
-```yaml
+8. Run the warmup script to warm up the model
-SGLangDecodeWorker:
-    ...
-    dist-init-addr: HEAD_DECODE_NODE_IP
-    nnodes: 4
-    # depending on which node this will be 1, 2, and 3
-    node-rank: 1
-```
-7. Start up the workers using the following commands
-On prefill head node
+DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
 ```bash
-dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml
+./warmup.sh HEAD_PREFILL_NODE_IP
 ```
-On prefill child node
+## Benchmarking
-```bash
-dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangWorker
-```
-On all decode nodes
+In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
+prefill:
 ```bash
-dynamo serve graphs.disagg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangDecodeWorker
+...
-```
+--max-running-requests 8192 \
+--max-total-tokens 131072 \
-8. Run the warmup script to warm up the model
+--context-length 8192 \
+--init-expert-location /configs/prefill_in4096.json \
+--chunked-prefill-size 524288
-DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
+```
+decode:
 ```bash
-./warmup.sh HEAD_PREFILL_NODE_IP
+...
+--max-running-requests 18432 \
+--context-length 4500 \
+--init-expert-location /configs/decode_in2000out100.json
 ```
-## Benchmarking
-In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to uncomment the labeled flags in the `configs/dsr1.yaml` file inside of the container.
 We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
 1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**

--- a/examples/sglang/graphs/agg.py
+++ b/examples/sglang/graphs/agg.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from components.frontend import Frontend
-from components.worker import SGLangWorker
-Frontend.link(SGLangWorker)
--- a/examples/sglang/graphs/disagg.py
+++ b/examples/sglang/graphs/disagg.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from components.decode_worker import SGLangDecodeWorker
-from components.frontend import Frontend
-from components.worker import SGLangWorker
-Frontend.link(SGLangWorker).link(SGLangDecodeWorker)
--- a/examples/sglang/launch/agg.sh
+++ b/examples/sglang/launch/agg.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Setup cleanup trap
+cleanup() {
+    echo "Cleaning up background processes..."
+    kill $DYNAMO_PID 2>/dev/null || true
+    wait $DYNAMO_PID 2>/dev/null || true
+    echo "Cleanup complete."
+}
+trap cleanup EXIT INT TERM
+# run clear_namespace
+python3 utils/clear_namespace.py --namespace dynamo
+# run ingress
+dynamo run in=http out=dyn &
+DYNAMO_PID=$!
+# run ingress
+dynamo run in=http out=dyn &
+# run worker
+python3 components/worker.py \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --page-size 16 \
+  --tp 1 \
+  --trust-remote-code \
+  --skip-tokenizer-init \
--- a/examples/sglang/launch/agg_router.sh
+++ b/examples/sglang/launch/agg_router.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Setup cleanup trap
+cleanup() {
+    echo "Cleaning up background processes..."
+    kill $DYNAMO_PID 2>/dev/null || true
+    wait $DYNAMO_PID 2>/dev/null || true
+    echo "Cleanup complete."
+}
+trap cleanup EXIT INT TERM
+# run clear_namespace
+python3 utils/clear_namespace.py --namespace dynamo
+# run ingress
+dynamo run in=http out=dyn --router-mode kv &
+DYNAMO_PID=$!
+# run worker
+python3 components/worker.py \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --page-size 16 \
+  --tp 1 \
+  --trust-remote-code \
+  --skip-tokenizer-init \
\ No newline at end of file
--- a/examples/sglang/launch/disagg.sh
+++ b/examples/sglang/launch/disagg.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Setup cleanup trap
+cleanup() {
+    echo "Cleaning up background processes..."
+    kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+    wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+    echo "Cleanup complete."
+}
+trap cleanup EXIT INT TERM
+# run clear_namespace
+python3 utils/clear_namespace.py --namespace dynamo
+# run ingress
+dynamo run in=http out=dyn &
+DYNAMO_PID=$!
+# run prefill worker
+python3 components/worker.py \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --page-size 16 \
+  --tp 1 \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode prefill \
+  --disaggregation-transfer-backend nixl &
+PREFILL_PID=$!
+# run decode worker
+CUDA_VISIBLE_DEVICES=1 python3 components/decode_worker.py \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --page-size 16 \
+  --tp 1 \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode decode \
+  --disaggregation-transfer-backend nixl
\ No newline at end of file
--- a/examples/sglang/launch/disagg_dp_attn.sh
+++ b/examples/sglang/launch/disagg_dp_attn.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Setup cleanup trap
+cleanup() {
+    echo "Cleaning up background processes..."
+    kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+    wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+    echo "Cleanup complete."
+}
+trap cleanup EXIT INT TERM
+# run clear_namespace
+python3 utils/clear_namespace.py --namespace dynamo
+# run ingress
+dynamo run in=http out=dyn &
+DYNAMO_PID=$!
+# run prefill worker
+python3 components/worker.py \
+  --model-path silence09/DeepSeek-R1-Small-2layers \
+  --served-model-name silence09/DeepSeek-R1-Small-2layers \
+  --tp 2 \
+  --dp-size 2 \
+  --enable-dp-attention \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode prefill \
+  --disaggregation-transfer-backend nixl \
+  --port 30000 &
+PREFILL_PID=$!
+# run decode worker
+CUDA_VISIBLE_DEVICES=2,3 python3 components/decode_worker.py \
+  --model-path silence09/DeepSeek-R1-Small-2layers \
+  --served-model-name silence09/DeepSeek-R1-Small-2layers \
+  --tp 2 \
+  --dp-size 2 \
+  --enable-dp-attention \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode decode \
+  --disaggregation-transfer-backend nixl \
+  --port 31000
\ No newline at end of file
--- a/examples/sglang/multinode-examples.md
+++ b/examples/sglang/multinode-examples.md
@@ -2,8 +2,7 @@
 ## Multi-node sized models
-SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires
+SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs.
-4 nodes of 8xH100 GPUs.
 **Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes.
 ```bash
@@ -14,130 +13,93 @@ docker compose -f lib/runtime/docker-compose.yml up -d
 **Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
 Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
-```yaml
-# configs/prefill-1.yaml
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1
-  endpoint: dynamo.SGLangWorker.generate
-  port: 8000
-SGLangWorker:
-  model-path: deepseek-ai/DeepSeek-R1
-  served-model-name: deepseek-ai/DeepSeek-R1
-  tp: 16
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  dist-init-addr: <node-1-ip>:29500
-  disaggregation-bootstrap-port: 30001
-  disaggregation-mode: prefill
-  disaggregation-transfer-backend: nixl
-  nnodes: 2
-  node-rank: 0
-  mem-fraction-static: 0.82
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
-```
-Run this with:
 ```bash
-cd examples/sglang
+# run ingress
-dynamo serve graphs.agg:Frontend -f configs/prefill-1.yaml
+dynamo run in=http out=dyn &
-```
+# run prefill worker
+python3 components/worker_inc.py \
-Node 2: Run the remaining 8 shards of the prefill worker and the decode worker
+  --model-path /model/ \
-```yaml
+  --served-model-name deepseek-ai/DeepSeek-R1 \
-# configs/prefill-2.yaml
+  --tp 16 \
-SGLangWorker:
+  --dp-size 16 \
-  model-path: deepseek-ai/DeepSeek-R1
+  --dist-init-addr HEAD_PREFILL_NODE_IP:29500 \
-  served-model-name: deepseek-ai/DeepSeek-R1
+  --nnodes 2 \
-  tp: 16
+  --node-rank 0 \
-  trust-remote-code: true
+  --enable-dp-attention \
-  skip-tokenizer-init: true
+  --trust-remote-code \
-  mem-fraction-static: 0.82
+  --skip-tokenizer-init \
-  dist-init-addr: <node-1-ip>:29500
+  --disaggregation-mode prefill \
-  disaggregation-bootstrap-port: 30001
+  --disaggregation-transfer-backend nixl \
-  disaggregation-mode: prefill
+  --mem-fraction-static 0.82 \
-  disaggregation-transfer-backend: nixl
-  nnodes: 2
-  node-rank: 1
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
 ```
-On all other nodes, we need to export the NATS and ETCD endpoints. Run this with with:
+Node 2: Run the remaining 8 shards of the prefill worker
 ```bash
+# nats and etcd endpoints
 export NATS_SERVER="nats://<node-1-ip>"
 export ETCD_ENDPOINTS="<node-1-ip>:2379"
-cd examples/sglang
+# worker
-dynamo serve graphs.disagg:Frontend -f configs/prefill-2.yaml --service-name SGLangWorker
+python3 components/worker_inc.py \
+  --model-path /model/ \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --tp 16 \
+  --dp-size 16 \
+  --dist-init-addr HEAD_PREFILL_NODE_IP:29500 \
+  --nnodes 2 \
+  --node-rank 1 \
+  --enable-dp-attention \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode prefill \
+  --disaggregation-transfer-backend nixl \
+  --mem-fraction-static 0.82
 ```
 Node 3: Run the first 8 shards of the decode worker
-```yaml
-# configs/decode-1.yaml
-SGLangDecodeWorker:
-  model-path: deepseek-ai/DeepSeek-R1
-  served-model-name: deepseek-ai/DeepSeek-R1
-  tp: 16
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  mem-fraction-static: 0.80
-  dist-init-addr: 2:29500
-  disaggregation-mode: decode
-  disaggregation-transfer-backend: nixl
-  disaggregation-bootstrap-port: 30001
-  nnodes: 2
-  node-rank: 0
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
-```
-Run this with:
 ```bash
+# nats and etcd endpoints
 export NATS_SERVER="nats://<node-1-ip>"
 export ETCD_ENDPOINTS="<node-1-ip>:2379"
-cd examples/sglang
+# worker
-dynamo serve graphs.disagg:Frontend -f configs/decode-1.yaml --service-name SGLangDecodeWorker
+python3 components/decode_worker_inc.py \
+  --model-path /model/ \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --tp 16 \
+  --dp-size 16 \
+  --dist-init-addr HEAD_DECODE_NODE_IP:29500 \
+  --nnodes 2 \
+  --node-rank 0 \
+  --enable-dp-attention \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode decode \
+  --disaggregation-transfer-backend nixl \
+  --mem-fraction-static 0.82
 ```
 Node 4: Run the remaining 8 shards of the decode worker
-```yaml
-# configs/decode-2.yaml
-SGLangDecodeWorker:
-  model-path: deepseek-ai/DeepSeek-R1
-  served-model-name: deepseek-ai/DeepSeek-R1
-  tp: 16
-  trust-remote-code: true
-  skip-tokenizer-init: true
-  mem-fraction-static: 0.80
-  dist-init-addr: 2:29500
-  disaggregation-mode: decode
-  disaggregation-transfer-backend: nixl
-  disaggregation-bootstrap-port: 30001
-  disable-cuda-graph: true
-  nnodes: 2
-  node-rank: 1
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 8
-```
-Run this with:
 ```bash
+# nats and etcd endpoints
 export NATS_SERVER="nats://<node-1-ip>"
 export ETCD_ENDPOINTS="<node-1-ip>:2379"
-cd examples/sglang
+# worker
-dynamo serve graphs.disagg:Frontend -f configs/decode-2.yaml --service-name SGLangDecodeWorker
+python3 components/decode_worker_inc.py \
+  --model-path /model/ \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --tp 16 \
+  --dp-size 16 \
+  --dist-init-addr HEAD_DECODE_NODE_IP:29500 \
+  --nnodes 2 \
+  --node-rank 1 \
+  --enable-dp-attention \
+  --trust-remote-code \
+  --skip-tokenizer-init \
+  --disaggregation-mode decode \
+  --disaggregation-transfer-backend nixl \
+  --mem-fraction-static 0.82
 ```
 **Step 3**: Run inference