docs: Update docs for new UX (#2070)

3c500ae7 · Graham King · GitHub · f3d784f3 · 3c500ae7 · 3c500ae7
Unverified Commit 3c500ae7 authored Jul 23, 2025 by Graham King Committed by GitHub Jul 23, 2025
20 changed files
--- a/README.md
+++ b/README.md
@@ -55,96 +55,68 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
 The following examples require a few system level packages.
 Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)
-```
+1. Install etcd and nats
-apt-get update
-DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
-python3 -m venv venv
-source venv/bin/activate
-pip install "ai-dynamo[all]"
-```
-> [!NOTE]
-> To ensure compatibility, please refer to the examples in the release branch or tag that matches the version you installed.
-### Building the Dynamo Base Image
-Although not needed for local development, deploying your Dynamo pipelines to Kubernetes will require you to use a Dynamo base image to your container registry. You can use any container registry of your choice, such as:
+To co-ordinate across the data center Dynamo relies on an etcd and nats cluster. To run locally these need to be available.
- Docker Hub (docker.io)
- NVIDIA NGC Container Registry (nvcr.io)
- Any private registry
-We publish our images in [nvcr.io](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)  and you can use them.
+- [etcd](https://etcd.io/) can be run directly as `./etcd`.
-Alternatively you could build and push an image from source:
+- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
-```bash
+The Dynamo team recommend the `uv` Python package manager, although anyway works. Install uv:
-./container/build.sh
+```
-docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
+curl -LsSf https://astral.sh/uv/install.sh | sh
-docker login <your-registry>
-docker push <your-registry>/dynamo-base:latest-vllm
 ```
-Notes about builds for specific frameworks:
+2. Select an engine
- For specific details on the `--framework vllm` build [read about the VLLM backend](components/backends/vllm/README.md)
-.
- For specific details on the `--framework tensorrtllm` build, see [Read about the TensorRT-LLM backend](components/backends/trtllm/README.md)
-.
-Note about AWS environments:
+We publish Python wheels specialized for each of our supported engines: vllm, sglang, llama.cpp and trtllm. The examples that follow use sglang, read on for other engines.
- If deploying Dynamo in AWS, make sure to build the container with EFA support using the `--make-efa` flag.
-After building, you can use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
-```bash
-export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
 ```
+uv venv venv
+source venv/bin/activate
+uv pip install pip
-> [!NOTE]
+# Choose one
-> We are working on leaner base images that can be built using the targets in the top-level Earthfile.
+uv pip install "ai-dynamo[sglang]"
+uv pip install "ai-dynamo[vllm]"
+uv pip install "ai-dynamo[llama_cpp]" # CPU, see later for GPU
+```
 ### Running and Interacting with an LLM Locally
 You can run a model and interact with it locally using commands below.
-We support several backends including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
 #### Example Commands
 ```
-python -m dynamo.frontend [--http-port 8080]
+python -m dynamo.frontend --interactive
-python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+python -m dynamo.sglang.worker Qwen/Qwen3-4B
 ```
 ```
-? User › Hello, how are you?
 ✔ User · Hello, how are you?
 Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
 ```
-### LLM Serving
+If the model is not available locally it will be downloaded from HuggingFace and cached.
-Dynamo provides a simple way to spin up a local set of inference
+You can also pass a local path: `python -m dynamo.sglang.worker --model-path ~/llms/Qwen3-0.6B`
-components including:
+### Running an LLM API server
+Dynamo provides a simple way to spin up a local set of inference components including:
 - **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
 - **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
 - **Workers** – Set of pre-configured LLM serving engines.
-To run a minimal configuration you can use a pre-configured
-example.
-#### Start Dynamo Distributed Runtime Services
-First start the Dynamo Distributed Runtime services:
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
 ```
-#### Start Dynamo LLM Serving Components
+# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
+python -m dynamo.frontend [--http-port 8080]
-Next serve a minimal configuration with an http server, basic
-round-robin router, and a single worker.
-```bash
+# Start the vllm engine, connecting to nats and etcd to receive requests. You can run several of these,
-cd examples/llm
+# both for the same model and for multiple models. The frontend node will discover them.
-dynamo serve graphs.agg:Frontend -f configs/agg.yaml
+python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 ```
 #### Send a Request
@@ -163,43 +135,143 @@ curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"
  }' | jq
 ```
+Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
+### Engines
+In the introduction we installed the `sglang` engine. There are other options.
+All of these requires nats and etcd, as well as a frontend (`python -m dynamo.frontend [--interactive]`).
+# vllm
+```
+uv pip install ai-dynamo[vllm]
+```
+Run the backend/worker like this:
+```
+python -m dynamo.vllm --help
+```
+vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
+To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
+# sglang
+```
+uv pip install ai-dynamo[sglang]
+```
+Run the backend/worker like this:
+```
+python -m dynamo.sglang.worker --help
+```
+You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
+# TRT-LLM
+This currently requires a container TODO ADD THE DOCS PLZ THANK YOU
+# llama.cpp
+To install llama.cpp for CPU inference:
+```
+uv pip install ai-dynamo[llama_cpp]
+```
+To build llama.cpp for CUDA:
+```
+pip install llama-cpp-python -C cmake.args="-DGGML_CUDA=on"
+uv pip install uvloop ai-dynamo
+```
+At time of writing the `uv pip` version does not support that syntax, so use `pip` directly inside the venv.
+To build llama.cpp for other accelerators see https://pypi.org/project/llama-cpp-python/ .
+Download a GGUF and run the engine like this:
+```
+python -m dynamo.llama_cpp --model-path ~/llms/Qwen3-0.6B-Q8_0.gguf
+```
+If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
 ### Local Development
-If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
+1. Install libraries
-Otherwise, to develop locally, we recommend working inside of the container
+**Ubuntu:**
+```
+sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
+```
-```bash
+**macOS:**
-./container/build.sh
+- [Homebrew](https://brew.sh/)
-./container/run.sh -it --mount-workspace
+```
+# if brew is not installed on your system, install it
+/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
+```
+- [Xcode](https://developer.apple.com/xcode/)
-cargo build --release
+```
-mkdir -p /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
+brew install cmake protobuf
-cp /workspace/target/release/dynamo-run /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
-uv pip install -e .
+## Check that Metal is accessible
-export PYTHONPATH=$PYTHONPATH:/workspace/deploy/sdk/src:/workspace/components/planner/src
+xcrun -sdk macosx metal
 ```
+If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
-#### Conda Environment
+2. Install Rust
-Alternately, you can use a conda environment
+```
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+source $HOME/.cargo/env
+```
-```bash
+3. Create a Python virtual env:
-conda activate <ENV_NAME>
-pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
+```
+uv venv dynamo
+source dynamo/bin/activate
+```
-cargo build --release
+4. Install build tools
-# To install ai-dynamo-runtime from source
+```
-cd lib/bindings/python
+uv pip install pip maturin
-pip install .
+```
-cd ../../../
+[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.
-pip install ".[all]"
+5. Build the Rust bindings
+```
+cd lib/bindings/python
+maturin develop --uv
+```
-Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md)
+6. Install the wheel
 ```
+cd $PROJECT_ROOT
+uv pip install .
+```
+Note editable (`-e`) does not work because the `dynamo` package is split over multiple directories, one per backend.
+You should now be able to run `python -m dynamo.frontend`.
+Remember that nats and etcd must be running (see earlier).
+Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
+If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
+### Deployment to Kubernetes
+Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md) to deploy to Kubernetes.
--- a/components/metrics/README.md
+++ b/components/metrics/README.md
@@ -74,18 +74,14 @@ metrics --component MyComponent --endpoint my_endpoint
 To run a more realistic deployment to gathering metrics from,
 see the examples in [examples/llm](../../examples/llm).
-For example, for a VLLM + KV Routing based deployment that
-exposes statistics on an endpoint labeled
-`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
-with any other example such as examples/vllm_v0, vllm_v1, ...):
 ```bash
-cd deploy/examples/llm
+python -m dynamo.frontend &
-dynamo serve graphs.agg:Frontend -f configs/agg.yaml
+python -m dynamo.vllm --model-path <your-model-checkout>
 ```
 Then, to monitor the metrics of these VllmWorkers, run:
 ```bash
-metrics --component VllmWorker --endpoint load_metrics
+metrics --component backend --endpoint load_metrics
 ```
 **NOTE**: `load_metrics` is currently a

--- a/docs/API/python_bindings.md
+++ b/docs/API/python_bindings.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-https://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Dynamo Python Bindings
-Python bindings for the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
-## 🚀 Quick Start
-1. Install `uv`: https://docs.astral.sh/uv/#getting-started
-```
-curl -LsSf https://astral.sh/uv/install.sh | sh
-```
-2. Install `protoc` protobuf compiler: https://grpc.io/docs/protoc-installation/.
-For example on an Ubuntu/Debian system:
-```
-apt install protobuf-compiler
-```
-3. Setup a virtualenv
-```
-uv venv
-source .venv/bin/activate
-uv pip install maturin
-```
-4. Build and install dynamo wheel
-```
-maturin develop --uv
-```
-## Run Examples
-### Prerequisite
-See [README.md](../runtime/README.md#prerequisites).
-### Hello World Example
-1. Start 3 separate shells, and activate the virtual environment in each
-```
-source .venv/bin/activate
-```
-2. In one shell (shell 1), run example server the instance-1
-```
-python3 ./examples/hello_world/server.py
-```
-3. (Optional) In another shell (shell 2), run example the server instance-2
-```
-python3 ./examples/hello_world/server.py
-```
-4. In the last shell (shell 3), run the example client:
-```
-python3 ./examples/hello_world/client.py
-```
-If you run the example client in rapid succession, and you started more than
-one server instance above, you should see the requests from the client being
-distributed across the server instances in each server's output. If only one
-server instance is started, you should see the requests go to that server
-each time.
-## Performance
-The performance impacts of synchronizing the Python and Rust async runtimes
-is a critical consideration when optimizing the performance of a highly
-concurrent and parallel distributed system.
-The Python GIL is a global critical section and is ultimately the death of
-parallelism. To compound that, when Rust async futures become ready,
-accessing the GIL on those async event loop needs to be considered carefully.
-Under high load, accessing the GIL or performing CPU intensive tasks on
-on the event loop threads can starve out other async tasks for CPU resources.
-However, performing a `tokio::task::spawn_blocking` is not without overheads
-as well.
-If bouncing many small message back-and-forth between the Python and Rust
-event loops where Rust requires GIL access, this is pattern where moving the
-code from Python to Rust will give you significant gains.
--- a/docs/API/sdk.md
+++ b/docs/API/sdk.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Dynamo SDK
-## Introduction
-Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
-Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
-## Installation
-The SDK can be installed using pip:
-```bash
-pip install ai-dynamo
-```
-## Core Concepts
-As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
-### Defining a Service
-A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
-```python
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-```
-Key configuration options:
-1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
-2. `resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
-3. `workers`: Number of parallel instances of the service to spin up.
-### Writing a Service
-Let's walk through an example to understand how you write a dynamo service.
-```python
-import ServiceB
-@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
-class ServiceA:
-    # Define service dependencies
-    service_b = depends(ServiceB)
-    def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
-        self.model_name = model_name
-        self.engine = None
-    @async_on_start
-    async def async_init(self):
-        # Initialize resources that require async operations
-        self.engine = await initialize_model_engine(self.model_name)
-        print(f"ServiceA initialized with model: {self.model_name}")
-    @on_shutdown
-    def shutdown(self):
-        # Clean up resources
-        if self.engine:
-            self.engine.shutdown()
-        print("ServiceA engine shut down")
-    @endpoint()
-    async def generate(self, request: ChatCompletionRequest):
-        # Call dependent service
-        processed_request = await self.service_b.preprocess(request)
-        # Use the engine to generate a response
-        response = await self.engine.generate(processed_request)
-        return response
-```
-#### Class-Based Architecture
-Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
-1. Class attributes for dependencies using `depends()`
-2. An `__init__` method for standard initialization
-3. Optional lifecycle hooks like `@async_on_start` and `@on_shutdown`
-4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
-   via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
-   function being decorated if omitted.
-This approach provides a clean separation of concerns and makes the service structure easy to understand.
-#### Service Dependencies with `depends()`
-The `depends()` function is a powerful feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
-1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
-2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
-3. You are able to access `ServiceB` endpoints as if it were a local function!
-```python
-# What happens internally when you use depends(ServiceB)
-service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
-# But with Dynamo SDK, you simply write:
-service_b = depends(ServiceB)
-# And then call methods directly:
-result = await service_b.preprocess(data)
-```
-```{note}
-Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
-```
-```python
-import VllmWorker
-# this runtime object gives you access to the underlying python bindings
-runtime = dynamo_context["runtime"]
-comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
-print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
-self.worker_client = (
-    await runtime.namespace(comp_ns)
-    .component(comp_name)
-    .endpoint("generate")
-    .client()
-)
-```
-This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
-#### Lifecycle Hooks
-Dynamo supports key lifecycle hooks to manage service initialization and cleanup.
-##### `@async_on_start`
-The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
-```python
-@async_on_start
-async def async_init(self):
-    # Perfect for operations that need to be awaited
-    self.db = await connect_to_db()
-    self.tokenizer = await load_tokenizer()
-    self.engine = await initialize_engine(self.model)
-```
-This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
-#### `@on_shutdown`
-The `@on_shutdown` hook is called when the service is shutdown handles cleanup.
-```python
-@on_shutdown
-def shutdown(self):
-    # gracefully Handle shutdown / cleanup
-    logger.info("worker shutting down")
-```
-This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
-### Configuring a Service
-Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
-1. Directly in the `@service` decorator
-2. Through YAML configuration files
-3. Via command-line arguments
-4. Using environment variables
-These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
-#### Configuration via Service Decorator
-The most basic method is to specify parameters directly in the service decorator:
-```python
-@service(
-    dynamo={"namespace": "prod"},
-    resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
-    workers=2,
-)
-class MyService:
-    def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
-        self.model_name = model_name
-        self.temperature = temperature
-```
-This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
-#### Configuration via YAML
-For more flexible configuration, especially across environments, you can use YAML files:
-```yaml
-# config.yaml
-MyService:
-  # Override service decorator settings
-  ServiceArgs:
-    workers: 4
-    resources:
-      gpu: 4
-  # Service instance parameters
-  model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
-  temperature: 0.8
-```
-The YAML file has a hierarchical structure:
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
-#### Loading YAML Configuration
-Use the CLI to load configuration from a YAML file:
-```bash
-dynamo serve service:MyService -f config.yaml
-```
-The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
-#### Configuration Precedence
-When multiple configuration sources are used, they follow this precedence order (highest to lowest):
-1. Command-line arguments
-2. YAML configuration
-3. Service decorator defaults
-4. Constructor defaults
-#### Accessing Configuration in Services
-Inside a service, you can access configuration using the `ServiceConfig` class:
-```python
-from dynamo.sdk.lib.config import ServiceConfig
-class MyService:
-    def __init__(self):
-        config = ServiceConfig.get_instance()
-        # Get with default value
-        self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
-        self.temperature = config.get("MyService", {}).get("temperature", 0.7)
-        # Require a config value (raises error if missing)
-        self.api_key = config.require("MyService", "api_key")
-        # Get all config for this service
-        all_my_config = config.get("MyService", {})
-```
-#### Parsing Configuration as CLI Arguments
-For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
-```python
-from dynamo.sdk.lib.config import ServiceConfig
-def setup_my_lib():
-    config = ServiceConfig.get_instance()
-    # Get all MyService config with prefix "lib_" as CLI args
-    cli_args = config.as_args("MyService", prefix="lib_")
-    # Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
-    # Pass to an external library's argument parser
-    lib_parser = MyLibArgumentParser()
-    lib_args = lib_parser.parse_args(cli_args)
-    return lib_args
-```
-This pattern is used in the example vLLM integration:
-```python
-def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
-    config = ServiceConfig.get_instance()
-    vllm_args = config.as_args(service_name, prefix=prefix)
-    parser = FlexibleArgumentParser()
-    # Add custom arguments
-    parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
-    parser.add_argument("--remote-prefill", action="store_true", default=False)
-    # Add VLLM's arguments (ServiceConfig handles True defaults automatically)
-    parser = AsyncEngineArgs.add_cli_args(parser)
-    # Parse both custom and VLLM arguments
-    args = parser.parse_args(vllm_args)
-    # Convert to engine arguments
-    engine_args = AsyncEngineArgs.from_cli_args(args)
-    # Add custom args to the engine args
-    engine_args.router = args.router
-    engine_args.remote_prefill = args.remote_prefill
-    return engine_args
-```
-#### Boolean Argument Handling
-ServiceConfig uses a targeted approach for boolean arguments to maintain compatibility with different argument parsers:
-1. Standard Boolean Handling:
- `true` → outputs just the flag (e.g., `--enable-feature`)
- `false` → omitted entirely (uses parser's default)
-2. vLLM True-Default Arguments (targeted override support):
- Automatically detects vLLM arguments that default to `True` and need explicit `false` handling
- `enable-prefix-caching: false` → `--no-enable-prefix-caching` (negative flag)
- `multi-step-stream-outputs: false` → `--no-multi-step-stream-outputs` (negative flag)
-```yaml
-# Example YAML configuration
-VllmWorker:
-  # Standard boolean flags (action="store_true" style)
-  enforce-eager: true          # → --enforce-eager
-  disable-logging: false       # → (omitted)
-  # vLLM arguments with True defaults (automatically handled)
-  enable-prefix-caching: false  # → --no-enable-prefix-caching
-  # Non-boolean arguments
-  max-model-len: 16384         # → --max-model-len 16384
-```
-#### Overriding Service Decorator with ServiceArgs
-The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
-```yaml
-MyService:
-  ServiceArgs:
-    dynamo:
-      namespace: "staging" # Override namespace
-    resources:
-      gpu: 4 # Use more GPUs
-    workers: 8 # Scale up workers
-```
-This is particularly useful for:
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
-Under the hood, the `DynamoService` class reads these arguments during initialization:
-```python
-def _get_service_args(self, service_name: str) -> Optional[dict]:
-    """Get ServiceArgs from environment config if specified"""
-    config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
-    if config_str:
-        config = json.loads(config_str)
-        service_config = config.get(service_name, {})
-        return service_config.get("ServiceArgs")
-    return None
-```
-#### Complete Configuration Example
-Here's a comprehensive example showing how all these pieces fit together:
-1. First, define your service with basic defaults:
-```python
-@service(
-    dynamo={"namespace": "default"},
-    resources={"gpu": 1},
-    workers=1,
-)
-class LLMService:
-    def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
-        self.model_name = model_name
-        self.temperature = temperature
-        self.max_tokens = max_tokens
-        # Get additional configuration
-        config = ServiceConfig.get_instance()
-        service_config = config.get("LLMService", {})
-        # Extract service-specific parameters
-        self.cache_size = service_config.get("cache_size", 1000)
-        self.use_kv_cache = service_config.get("use_kv_cache", True)
-```
-2. Create a YAML configuration for production:
-```yaml
-# prod_config.yaml
-LLMService:
-  ServiceArgs:
-    dynamo:
-      namespace: "prod"
-    resources:
-      gpu: 4
-      memory: "64Gi"
-    workers: 8
-  # Constructor parameters
-  model_name: "llama-3-70b-instruct"
-  temperature: 0.8
-  max_tokens: 2048
-  # Service-specific parameters
-  cache_size: 10000
-  use_kv_cache: true
-```
-3. Deploy with mixed configuration:
-```bash
-dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
-```
-The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
-#### Service Configuration Best Practices
-1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
-2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
-3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
-4. **Use CLI for Quick Testing**: Override specific values for experimentation
-5. **Document Configuration Keys**: Make sure to document all available configuration options
-Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
-### Deploying a Single Service
-You can deploy a single service for local development even if you have a dependency graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
-### Composing Services into an Graph
-There are two main ways to compose services in Dynamo:
-1. Use `depends()` (Recommended)
-   The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
-2. Use `.link()` (Experimental)
-   Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
-#### Understanding the `.link()` syntax
-Lets take the example of a `Processor` component. This component can currently do 2 things:
-1. Process a request and send it to a `Router` to decide what worker to send it to.
-2. Process a request and send it to a `Worker` directly.
-Consider this snippet of the Processor:
-```python
-class Processor(ProcessMixIn):
-    """
-    vLLM pre and post processing
-    """
-    worker = depends(VllmWorker)
-    router = depends(Router)
-    # logic for processing a request based on router or worker
-```
-Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
-```python
-Processor.link(Router)
-```
-This removes the `worker` dependency from the Processor and only spin up the Router.
--- a/docs/architecture/disagg_serving.md
+++ b/docs/architecture/disagg_serving.md
@@ -2,18 +2,6 @@
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 # Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
 - Add prefill worker: no explicit action needed.
 - Delete prefill worker: flush engine.
-### How this works under the hood
-#### Auto-Discovery for new workers
-In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
-You can watch this happen live by running the following:
-```bash
-# in terminal 1 - run the disaggregated serving example
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
-```bash
-# in terminal 2 - watch the namespace in etcd
-watch -cd etcdctl get --prefix <namespace>
-```
-You should see something like this show up as the disaggregated serving example starts up:
-```bash
-# worker information
-dynamo/components/PrefillWorker/mock:694d967da694ea1e
-{
-  "component": "PrefillWorker",
-  "endpoint": "mock",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009310,
-  "transport": {
-    "nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
-  }
-}
-dynamo/components/Processor/chat/completions:694d967da694ea16
-{
-  "component": "Processor",
-  "endpoint": "chat/completions",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009302,
-  "transport": {
-    "nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
-  }
-}
-dynamo/components/VllmWorker/generate:694d967da694ea1a
-{
-  "component": "VllmWorker",
-  "endpoint": "generate",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009306,
-  "transport": {
-    "nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
-  }
-}
-dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
-{
-  "component": "VllmWorker",
-  "endpoint": "load_metrics",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009306,
-  "transport": {
-    "nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
-  }
-}
-# nixl metadata
-dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2
-```
-#### Graceful worker shutdown
-Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
-You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
\ No newline at end of file
--- a/docs/architecture/kv_cache_routing.md
+++ b/docs/architecture/kv_cache_routing.md
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
+-->
-Licensed under the Apache License, Version 2.0 (the "License");
+# KV Cache Routing
-you may not use this file except in compliance with the License.
+This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
+To enable KV cache aware routing start the frontend node like this:
+```
+python -m dynamo.frontend --router-mode kv
+```
-Unless required by applicable law or agreed to in writing, software
+The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
->[!NOTE]
+For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
->This information is temporary and will change soon.
-# KV Cache Routing
+The KV-aware routing arguments:
-This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
-## Architecture
+- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
-Dynamo's architecture consists of three key concepts:
+- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
+- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
- **Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
- **Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
-A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
-A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
+## Architecture
 Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
@@ -150,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
 The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
-Example:
-```python
-from dynamo.llm import KvIndexer
-from dynamo.sdk import dynamo_context
-runtime = dynamo_context["runtime"]
-kv_listener = runtime.namespace("dynamo").component("VllmWorker")
-await kv_listener.create_service()
-indexer = KvIndexer(kv_listener, block_size=16)
-indexer.find_matches_for_request([INPUT SEQUENCE OF TOKEN IDs])
-```
-Sample Output:
-```
-{
-	123456789: 10,
-	987654321: 3,
-	543219876: 7,
-}
-```
-```{note}
-This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
-```
 ### WorkerMetricsPublisher
 We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
 - num_requests_waiting
@@ -191,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
 ### KvMetricsAggregator
 The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
-Example:
-```python
-from dynamo.llm import KvMetricsAggregator
-from dynamo.sdk import dynamo_context
-runtime = dynamo_context["runtime"]
-kv_listener = runtime.namespace("dynamo").component("VllmWorker")
-await kv_listener.create_service()
-metrics_aggregator = KvMetricsAggregator(kv_listener)
-for endpoint in metrics_aggregator.get_metrics().endpoints:
-    print("Worker ID: ", endpoint.worker_id)
-    print("GPU Cache Usage: ", endpoint.gpu_cache_usage_perc)
-    print("Number of Requests Waiting: ", endpoint.num_requests_waiting)
-    print("GPU Prefix Cache Hit Rate: ", endpoint.gpu_prefix_cache_hit_rate)
-    print("***")
-```
-Sample Output:
-```
-Worker ID: 123456789
-GPU Cache Usage: 0.5
-Number of Requests Waiting: 2
-GPU Prefix Cache Hit Rate: 0.1
-***
-Worker ID: 987654321
-GPU Cache Usage: 0.5
-Number of Requests Waiting: 1
-GPU Prefix Cache Hit Rate: 0.1
-***
-```
-```{note}
-This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
-```
-### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
-The Router component makes intelligent worker selection decisions
-1. Receives incoming requests as tokens
-2. Queries the KVIndexer to find potential cache hits across workers
-3. Collects performance metrics from workers (via KvMetricsAggregator)
-4. Uses a cost function to determine the optimal worker for each request
-5. Returns chosen worker
-The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.
--- a/docs/examples/README.md
+++ b/docs/examples/README.md
@@ -19,8 +19,6 @@ If you are a **🧑‍💻 Dynamo Contributor** first follow the instructions in
 You would have to rebuild the dynamo platform images as the code evolves. For more details please look at the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)
-Export the [Dynamo Base Image](../get_started.md#building-the-dynamo-base-image) you want to use (or built during the prerequisites step) as the `DYNAMO_IMAGE` environment variable.
 ```bash
 export DYNAMO_IMAGE=<your-registry>/<your-image-name>:<your-tag>
 ```

--- a/docs/examples/disagg_skeleton.md
+++ b/docs/examples/disagg_skeleton.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Hello World: Aggregated and Disaggregated Deployment Examples
-The `example` directory contains a [hello world example](../examples/hello_world.md) that implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
-## Components
- frontend: A simple http server handles incoming requests
- processor: A pre/post processing server and invokes router server
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- worker: A dummy decode worker
- prefill-worker: A dummy prefill worker
-## Deployment Architectures
-This figure shows an overview of the major components to deploy:
-```
-                                                 +----------------+
-                                                 | prefill worker |-------+
-                                                 |                |       |
-                                                 +----------------+       | pull
-                                                                          v
-+------+      +-----------+      +------------------+    push     +---------------+
-| HTTP |----->| processor |----->|  decode/monolith |------------>| prefill queue |
-|      |<-----|           |<-----|      worker      |             |               |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^
-       query best |    | return
-           worker |    | worker_id
-                  |    |         +------------------+
-                  |    +---------|      router      |
-                  +------------->|                  |
-                                 +------------------+
-```
-## The Aggregated Deployment
-This example uses 2 nodes to demo the disagg serving.
- Node 1
-  - Runs NATS and etcd services
-  - Deploys Frontend, Processor and Router
-  - Deploys DummyWorker as the monolith worker
- Node 2
-  - Deploys DummyWorker as the monolith worker
-### Prerequisites
-On Node 1, start required services (etcd and NATS) using [Docker Compose](https://github.com/ai-dynamo/dynamo/blob/main/deploy/docker-compose.yml)
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-### Run the Deployment
-1. Set environment variables for NATS and etcd services
-```bash
-export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
-export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
-```
-2. Launch Frontend, Processor and Router services:
-```
-cd dynamo/examples/hello_world/disagg_skeleton
-dynamo serve components.graph:Frontend
-```
-3. Open a new terminal on Node 1 and deploy Worker service
-```
-export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
-export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
-cd dynamo/examples/hello_world/disagg_skeleton
-dynamo serve components.worker:DummyWorker
-```
-4. Go to Node 2 and start Worker service as in step 3.
-Now you should see both workers are ready in Node 1's terminal.
-5. Query the Frontend with following two prompts. The router would assign different workers for each prompt and you can observe it from the responses.
- `Response: {"worker_output":"Tell me a joke_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
- `Response: {"worker_output":"Which team won 2020 World Series_GeneratedBy_NODE2HOSTNAME","request_id":"id_number"}`
-```
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "Tell me a joke",
-  "request_id":"id_number"
-}'
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "Which team won 2020 World Series",
-  "request_id":"id_number"
-}'
-```
-6. Then modify the prompt; prompts with similar prefixes are routed to the same worker due to the routing algorithm used in this demo. For example, following query is routed to the worker that proceesed `Tell me a joke` prompt.
-```
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "Tell me a fact",
-  "request_id":"id_number"
-}'
-```
- `Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
-## The Disaggregated Deployment
-This example uses 3 nodes to demo the disagg serving.
- Node 1
-  - Runs NATS and etcd services
-  - Deploys Frontend and Processor
-  - Deploys DummyWorker as the decode worker
- Node 2
-  - Deploys DummyWorker as the decode worker
- Node 3
-  - Deploys Prefill as the prefill worker
-### Run the Deployment
-1. Repeat step 1 to 4 to deploy Frontend, Processor, Router and 2 Workers as decode worker
-2. Go to Node 3 and start the prefill worker.
-```
-export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
-export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
-cd dynamo/examples/hello_world/disagg_skeleton
-dynamo serve components.prefill_worker:PrefillWorker
-```
-3. Query the Frontend. This time decode workers push requests to the prefill queue, and prefill worker pulles task from the queue to simulate the prefill task. The actual prefill is skipped in this demo.
-```
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "This is prefill disagg serving example",
-  "request_id":"12345"
-}'
-```
--- a/docs/examples/llm_deployment.md
+++ b/docs/examples/llm_deployment.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# LLM Deployment Examples
-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
-## Use the Latest Release
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-## Components
- workers: Prefill and decode worker handles actual LLM inference
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- frontend: OpenAI compatible http server handles incoming requests
-## Deployment Architectures
-### Aggregated
-Single-instance deployment where both prefill and decode are done by the same worker.
-### Disaggregated
-Distributed deployment where prefill and decode are done by separate workers that can scale independently.
-```mermaid
-sequenceDiagram
-    participant D as VllmWorker
-    participant Q as PrefillQueue
-    participant P as PrefillWorker
-    Note over D: Request is routed to decode
-    D->>D: Decide if prefill should be done locally or remotely
-        D->>D: Allocate KV blocks
-        D->>Q: Put RemotePrefillRequest on the queue
-        P->>Q: Pull request from the queue
-        P-->>D: Read cached KVs from Decode
-        D->>D: Decode other requests
-        P->>P: Run prefill
-        P-->>D: Write prefilled KVs into allocated blocks
-        P->>D: Send completion notification
-        Note over D: Notification received when prefill is done
-        D->>D: Schedule decoding
-```
-## Getting Started
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
-### Prerequisites
-Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-### Build the container image for your platform
-```bash
-# On an x86 machine
-./container/build.sh --framework VLLM
-# On an ARM machine (ex: GB200)
-./container/build.sh --framework VLLM --platform linux/arm64
-```
-```{note}
-Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
-You can tune the number of parallel build jobs for building VLLM from source
-on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
-For example, on an ARM machine with low system resources:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
-For example, on a GB200 which has very high CPU cores and memory resource:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
-When vLLM has pre-built ARM wheels published, this process can be improved.
-You can tune the number of parallel build jobs for building VLLM from source
-on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
-For example, on an ARM machine with low system resources:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
-For example, on a GB200 which has very high CPU cores and memory resource:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
-When vLLM has pre-built ARM wheels published, this process can be improved.
-```
-### Run the container you have built
-```
-./container/run.sh -it --framework VLLM
-```
-## Run Deployment
-This figure shows an overview of the major components to deploy:
-```
-                                                 +----------------+
-                                          +------| prefill worker |-------+
-                                   notify |      |                |       |
-                                 finished |      +----------------+       | pull
-                                          v                               v
-+------+      +-----------+      +------------------+    push     +---------------+
-| HTTP |----->| processor |----->| decode/monolith  |------------>| prefill queue |
-|      |<-----|           |<-----|      worker      |             |               |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+
-```
-```{note}
-The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command.
-For more details, see [Planner Architecture Overview](../architecture/planner_intro.rst).
-```
--- a/docs/examples/multinode.md
+++ b/docs/examples/multinode.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Multinode Examples
-## Use the Latest Release
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-## Single node sized models
-You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node must be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
-##### Disaggregated Deployment with KV Routing
- Node 1: Frontend, Processor, Router, Decode Worker
- Node 2: Prefill Worker
- Node 3: Prefill Worker
-Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
-**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes the NATS/ETCD endpoints must be accessible by all other nodes.
-```bash
-# node 1
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-**Step 2**: Create the inference graph for this node. Here we use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we spin up the other decode worker and prefill worker separately on different nodes later).
-```python
-# graphs/agg_router.py
-Frontend.link(Processor).link(Router).link(VllmWorker)
-```
-**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
-**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
-```bash
-# node 1
-cd $DYNAMO_HOME/examples/llm
-dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
-```
-**Step 5**: Start the first prefill worker on node 2.
-Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
-```bash
-# node 2
-export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
-export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
-```
-**Step 6**: Start the second prefill worker on node 3.
-```bash
-# node 3
-export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
-export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
-```
-**Step 7**: [Optional] Start more decode workers on other nodes
-This example can be extended to more nodes as well. For example, if you want to spin up another decode worker, you can use
-```bash
-# node X
-export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
-export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.worker:VllmWorker -f ./configs/multinode-405b.yaml --service-name VllmWorker
-```
-Note the use of `--service-name`. This only spins up the worker that you are requesting and ignore any `depends` statements.
-###### Client
-In another terminal:
-```bash
-# this test request has around 200 tokens isl
-curl <node1-ip>:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Accept: text/event-stream" \
-  -d '{
-    "model": "nvidia/Llama-3.1-405B-Instruct-FP8",
-    "messages": [
-      {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-      }
-    ],
-    "stream": true,
-    "max_tokens": 300
-  }'
-```
-#### Multi-node sized models
-Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
-##### Aggregated Deployment
-The steps for aggregated deployment of multi-node sized models is similar to
-single-node sized models, except that you need to first configure the nodes
-to be interconnected according to the framework's multi-node deployment guide.
-In the below example, vLLM is be used as the framework to serve `DeepSeek-R1` model
-using tensor parallel 16 on two H100x8 nodes.
-**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
-collectively:
-```bash
-# head node
-ray start --head --port=6379
-# example output and keep note of the IP address of the head node
-# Local node IP: <head-node-address>
-# set vLLM env arg
-export VLLM_HOST_IP=<head-node-address>
-# other node
-ray start  --address=<head-node-address>:6379
-export VLLM_HOST_IP=<current-node-address>
-# verify the accessibility by checking aggregated GPU count shown in ray status
-ray status
-# Expected/Sample output for 2 nodes:
-# ```bash
-# ======== Autoscaler status: 2025-04-16 15:35:42.751688 ========
-# Node status
-# ---------------------------------------------------------------
-# Active:
-#  1 node_<hash_1>
-#  1 node_<hash_2>
-# Pending:
-#  (no pending nodes)
-# Recent failures:
-#  (no failures)
-# Resources
-# ---------------------------------------------------------------
-# Usage:
-# XXX CPU
-# XXX GPU
-# XXX memory
-# XXX object_store_memory
-# Demands:
-#  (no resource demands)
-```
-**Step 2**: On the head node, follow [LLM deployment examples](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/README.md) to
-setup dynamo deployment for aggregated serving, using the configuration file,
-`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
-```bash
-cd $DYNAMO_HOME/examples/llm
-dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
-```
-###### Client
-In another terminal, you can send the same curl request as described above but
-with `"model": "deepseek-ai/DeepSeek-R1"`
-```bash
-# this test request has around 200 tokens isl
-curl <node1-ip>:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Accept: text/event-stream" \
-  -d '{
-    "model": "deepseek-ai/DeepSeek-R1",
-    "messages": [
-      {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-      }
-    ],
-    "stream": true,
-    "max_tokens": 300
-  }'
-```
-##### Disaggregated Deployment
-In this example, we deploy two replicas of the model (one prefill worker
-and one decode worker). We use 4 H100x8 nodes and group every two of them
-into one Ray cluster in the same way as described in aggregated deployment.
-However, for etcd and nats server, we only run them in
-one node and let's consider that node to be the head node of the whole deployment.
-Note that if you are starting etcd server directly instead of using `docker compose`,
-you should add additional arguments to be discoverable in other node.
-```bash
-etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
-```
-**Step 1**: On every two nodes, set up Ray cluster as described in
-[aggregated deployment](#aggregated-deployment). After that, you should have
-two independent Ray cluster, each has access to 16 GPUs.
-**Step 2** start the deployment by running different flavors of `dynamo serve`
-on one of the node for each Ray cluster, using the configuration file,
-`configs/mutinode_disagg_r1.yaml`.
-For decode, the below command is used; the node is the entry point of
-the whole deployment. In other words, the ip of the node should be used to send
-requests to.
-```bash
-# if not head node
-export NATS_SERVER='nats://<nats-server-ip>:4222'
-export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
-cd $DYNAMO_HOME/examples/llm
-dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
-```
-For prefill:
-```bash
-# if not head node
-export NATS_SERVER='nats://<nats-server-ip>:4222'
-export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
-```
-###### Client
-In another terminal, you can send the same curl request as described in
-[aggregated deployment](#aggregated-deployment), addressing to the ip of
-the decode node.
--- a/docs/examples/trtllm.md
+++ b/docs/examples/trtllm.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# LLM Deployment Examples using TensorRT-LLM
-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
-## Use the Latest Release
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-## Deployment Architectures
-See [Deployment Architectures](llm_deployment.md#deployment-architectures) to learn about the general idea of the architecture.
-Note that this TensorRT-LLM version does not support all the options yet.
-```{note}
-TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
-```
-## Getting Started
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
-### Prerequisites
-Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-### Build docker
-#### Step 1: Build TensorRT-LLM base container image
-Because of the known issue of C++11 ABI compatibility within the NGC pytorch container, we rebuild TensorRT-LLM from source.
-See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html) for more information.
-Use the helper script to build a TensorRT-LLM container base image. The script uses a specific commit id from TensorRT-LLM main branch.
-```bash
-# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
-apt-get update && apt-get -y install git git-lfs
-# The script uses python packages like docker-squash to squash image
-# layers within trtllm base image
-DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get -y install python3 python3-pip python3-venv
-./container/build_trtllm_base_image.sh
-```
-For more information see [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-build-tensorrt-llm-in-one-step) for more details on building from source.
-If you already have a TensorRT-LLM container image, you can skip this step.
-#### Step 2: Build the Dynamo container
-```
-# On an x86 machine:
-./container/build.sh --framework tensorrtllm
-# On an ARM machine:
-./container/build.sh --framework tensorrtllm --platform linux/arm64
-```
-This build script internally points to the base container image built with step 1. If you skipped previous step because you already have the container image available, you can run the build script with that image as a base.
-```bash
-# Build dynamo image with other TRTLLM base image.
-./container/build.sh --framework TENSORRTLLM --base-image <trtllm-base-image> --base-image-tag <trtllm-base-image-tag>
-```
-### Run container
-```
-./container/run.sh --framework tensorrtllm -it
-```
-## Run Deployment
-This figure shows an overview of the major components to deploy:
-```
-+------+      +-----------+      +------------------+             +---------------+
-| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
-|      |<-----|           |<-----|                  |<------------|     Worker    |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+
-```
-```{note}
-The above architecture illustrates all the components. The final components that get spawned depend upon the chosen graph.
-```
-### Example architectures
-#### Aggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
-```
-#### Aggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg_router:Frontend -f ./configs/agg_router.yaml
-```
-#### Disaggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
-We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
-cache between the context and generation workers.
-#### Disaggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
-```
-We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
-cache between the context and generation workers.
-### Client
-See [client](llm_deployment.md#client) section to learn how to send request to the deployment.
-### Close deployment
-See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
-Remaining tasks:
- [x] Add support for the disaggregated serving.
- [ ] Add integration test coverage.
- [ ] Add instructions for benchmarking.
- [ ] Add multi-node support.
- [ ] Merge the code base with llm example to reduce the code duplication.
- [ ] Use processor from dynamo-llm framework.
- [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
--- a/docs/get_started.md
+++ b/docs/get_started.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Getting Started
-## Development Environment
-This section describes how to set up your development environment.
-### Recommended Setup: Using Dev Container
-We recommend using our pre-configured development container:
-1. Install prerequisites:
-   - [Docker](https://www.docker.com/products/docker-desktop)
-   - [Visual Studio Code](https://code.visualstudio.com/)
-   - [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
-2. Get the code:
-   ```bash
-   git clone https://github.com/ai-dynamo/dynamo.git
-   cd dynamo
-   ```
-3. Open in Visual Studio Code:
-   1. Launch Visual Studio Code
-   2. Click the button in the bottom-left corner
-   3. Select **Reopen in Container**
-Visual Studio Code builds and starts a container with all necessary dependencies for Dynamo development.
-### Alternative Setup: Manual Installation
-If you don't want to use the dev container, you can set the environment up manually:
-1. Ensure you have:
-   - Ubuntu 24.04 (recommended)
-   - x86_64 CPU
-   - Python 3.x
-   - Git
-   See [Support Matrix](support_matrix.md) for more information.
-2. **If you plan to use vLLM or SGLang**, you must also install:
-   - etcd
-   - NATS.io
-   Before starting dynamo, run both etcd and NATS.io in separate processes.
-3. Install required system packages:
-   ```bash
-   apt-get update
-   DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
-   ```
-4. Set up the Python environment:
-   ```bash
-   python3 -m venv venv
-   source venv/bin/activate
-   ```
-5. Install Dynamo:
-   ```bash
-   pip install "ai-dynamo[all]"
-   ```
-> [!Important]
-> To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
-## Building the Dynamo Base Image
-Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry.
-You can use any private container registry of your choice, including:
- [Docker Hub](https://hub.docker.com/)
- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/)
-To build it:
-```bash
-./container/build.sh
-docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
-docker login <your-registry>
-docker push <your-registry>/dynamo-base:latest-vllm
-```
-This documentation describes these frameworks:
- `--framework vllm` build:
-   See [LLM Deployment Examples](examples/llm_deployment.md).
- `--framework tensorrtllm` build:
-   See [TRTLLM Deployment Examples](examples/trtllm.md).
-After building, use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
-```bash
-export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
-```
-## Running and Interacting with an LLM Locally
-Dynamo supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
-Use example commands below tp launch a model.
-### Example Command
-```bash
-python -m dynamo.frontend [--http-port 8080]
-python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-```
-```bash
-? User › Hello, how are you?
-✔ User · Hello, how are you?
-Okay, so I'm trying to figure out how to respond to the user's greeting.
-They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking."
-Hmm, I need to come up with a suitable reply. ...
-```
-## LLM Serving
-Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI-compatible Frontend**:
-   High-performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**:
-   Route and load balance traffic to a set of workers.
- **Workers**:
-   Set of pre-configured LLM serving engines.
-To run a minimal configuration, use a pre-configured example.
-### Start Dynamo Distributed Runtime Services
-To start the Dynamo Distributed Runtime services the first time:
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-### Start Dynamo LLM Serving Components
-[Explore the VLLM Example](../components/backends/vllm/README.md)
-## Local Development
-If you use vscode or cursor, use the `.devcontainer` folder built on [Microsoft's Extension](https://code.visualstudio.com/docs/devcontainers/containers).
-For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
-Otherwise, to develop locally, we recommend working inside of the container:
-```bash
-./container/build.sh
-./container/run.sh -it --mount-workspace
-cargo build --release
-mkdir -p /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
-cp /workspace/target/release/dynamo-run /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
-uv pip install -e .
-export PYTHONPATH=$PYTHONPATH:/workspace/deploy/dynamo/sdk/src:/workspace/components/planner/src
-```
-### Conda Environment
-Alternatively, use a Conda environment:
-```bash
-conda activate <ENV_NAME>
-pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
-cargo build --release
-# To install ai-dynamo-runtime from source
-cd lib/bindings/python
-pip install .
-cd ../../../
-pip install .[all]
-# To test
-docker compose -f deploy/docker-compose.yml up -d
-python -m dynamo.frontend [--http-port 8080]
-python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-```
--- a/docs/guides/backend.md
+++ b/docs/guides/backend.md
@@ -2,680 +2,108 @@
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 # Writing Python Workers in Dynamo
-This guide explains how to create your own Python worker in Dynamo and deploy
+This guide explains how to create your own Python worker in Dynamo.
-it via `dynamo serve` or `dynamo deploy`, covering basic concepts as well as
-advanced features like enabling KV routing and disaggregated serving.
-For detailed information about `dynamo serve` infrastructure, see the
-[Dynamo SDK Docs](../API/sdk.md).
-For a guide that walks through how to launch a vLLM-based worker with
-implementation of Disaggregated Serving and KV-Aware Routing included,
-see the [Dynamo Serve Guide](../../docs/guides/dynamo_serve.md).
-## Basic Concepts
-When deploying a python-based worker with `dynamo serve` or `dynamo deploy`, it is
-a Python class based definition that requires a few key decorators to get going:
- `@service`: used to define a worker class
- `@endpoint`: marks methods that can be called by other workers or clients
-For more detailed information on these concepts, see the
-[Dynamo SDK Docs](../API/sdk.md).
-### Worker Skeleton
+The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
-Here is the rough outline of what a worker may look like in its simplest form:
+The Python file must do three things:
+1. Decorate a function to get the runtime
+2. Register on the network
+3. Attach a request handler
-```python
-from dynamo.sdk import endpoint, service
-@service(
-    dynamo={
-        "namespace": "your_namespace",
-    },
-)
-class YourWorker:
-    # Worker implementation
-    # ...
-    @endpoint()
-    async def your_endpoint(self, request: RequestType) -> AsyncIterator[ResponseType]:
-        # Endpoint Implementation
-        pass
 ```
+from dynamo.llm import ModelType, register_llm
+from dynamo.runtime import DistributedRuntime, dynamo_worker
-Workers in Dynamo are identified by a `namespace/component/endpoint` naming schema.
+   # 1. Decorate a function to get the runtime
-When addressing this worker's endpoint with the `namespace/component/endpoint` schema
+   #
-based on the definitions above, it would be: `your_namespace/YourWorker/your_endpoint`:
+   @dynamo_worker(static=False)
+   async def worker(runtime: DistributedRuntime):
- `namespace="your_namespace"`: Defined in the `@service` decorator
- `component="YourWorker"`: Defined by the Python Class name
- `endpoint="your_endpoint"`: Defined by the `@endpoint` decorator, or by default the name of the function being decorated.
-For more details about service configuration, resource management, and dynamo endpoints,
-see the [Dynamo SDK Docs](../API/sdk.md).
-### Request/Response Types
-Request/Response types of endpoints can be defined arbitrarily for your use case's needs, as long as
-the client calling your worker matches the expectations.
-Define your request and response types using Pydantic models:
-```python
-from pydantic import BaseModel
-class RequestType(BaseModel):
+    # 2. Register ourselves on the network
-    text: str
+    #
-    # Add other fields as needed
+    component = runtime.namespace("namespace").component("component")
+    await component.create_service()
-class ResponseType(BaseModel):
+    model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
-    text: str
+    model_type = ModelType.Backend
-    # Add other fields as needed
+    endpoint = component.endpoint("endpoint")
-```
+    # Optional last param to register_llm is model_name. If not present derives it from model_path
+    await register_llm(model_type, endpoint, model_path)
-```python
-from vllm.entrypoints.openai.protocol import ChatCompletionRequest
-class YourLLMWorker:
-    @endpoint(name="my_chat_completions_endpoint")
-    async def generate(self, request: ChatCompletionRequest):
-        # Endpoint Implementation
-        pass
-```
-## Basic Worker Example
-Here's a simple example of a worker that takes text in and streams text out
-via custom RequestType/ResponseType definitions:
-```python
-# basic_worker.py
-# This can be run standalone with `dynamo serve basic_worker:YourWorker`
-import logging
-from pydantic import BaseModel
-from dynamo.sdk import endpoint, service
-logger = logging.getLogger(__name__)
-class RequestType(BaseModel):
-    text: str
-class ResponseType(BaseModel):
-    text: str
-@service(
-    dynamo={
-        "namespace": "your_namespace",
-    }
-)
-class YourWorker:
-    def __init__(self) -> None:
-        logger.info("Starting worker...")
-    @endpoint()
-    async def generate(self, request: RequestType):
-        """Generate tokens and stream them back"""
-        logger.info(f"Worker endpoint received: {request.text}")
-        for token in request.text.split():
-            yield ResponseType(text=token).model_dump_json()
-```
-To see a minimal worker example like the above used in a larger pipeline of
+    # Initialize your engine here
-components, see the `dynamo serve`
+    # engine = ...
-[Hello World example](../../examples/hello_world).
-### Client Example
+    # 3. Attach request handler
+    #
+    await endpoint.serve_endpoint(RequestHandler(engine).generate)
-Here's a simple example of a client that directly calls the example
+class RequestHandler:
-worker above through Dynamo without any intermediate services:
-```python
+    def __init__(self, engine):
-import asyncio
+        ...
-from pydantic import BaseModel
-from dynamo.runtime import dynamo_worker, DistributedRuntime
-# These could also be imported from a shared file/definition
+    async def generate(self, request):
-class RequestType(BaseModel):
+        # Call the engine
-    text: str
+        # yield result dict
+        ...
-class ResponseType(BaseModel):
-    text: str
-@dynamo_worker()
-async def client_worker(runtime: DistributedRuntime):
-    # Get a client to the worker endpoint from the distributed runtime
-    client = await runtime.namespace("your_namespace").component("YourWorker").endpoint("generate").client()
-    # Create a request
-    request = RequestType(text="Hello, Dynamo!")
-    # Call the dynamo endpoint exposed by the worker
-    responses = await client.generate(request.model_dump_json())
-    async for response in responses:
-        print(response)
 if __name__ == "__main__":
-    asyncio.run(client_worker())
+    uvloop.install()
-```
+    asyncio.run(worker())
-If there is an OpenAI `http` service in front of your worker and the worker
-is defined to accept ChatCompletions-like requests, you could also use an
-OpenAI-based client (or `curl`) that sends requests to the OpenAI HTTP Service,
-and internally these requests would be routed to the attached worker endpoints instead.
-In more advanced scenarios where your worker may operate on some other intermediate format
-that may not directly match an OpenAI-like format, you could setup a separate processor worker
-that does something like the following:
- Take in OpenAI Chat Completions requests from the HTTP service
- Convert requests from Chat Completions format to the RequestType format your worker expects
- Forward requests to the worker(s)
- Convert responses from the worker's ResponseType back into Chat Completions response format
- Forward responses back to client
-This advanced scenario of a separate
-[OpenAI Processor worker](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/processor.py)
-is demonstrated in this
-[vLLM example](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm).
-For a more minimal example of deploying a pipeline of components with a custom
-API that your client can communicate with, see the
-[Hello World example](../../examples/hello_world).
-## Advanced Features
-### KV Routing for LLMs
-KV-aware routing is a powerful feature of Dynamo that optimizes for routing
-requests to specific workers while minimizing a specific KV-cache based cost function.
-In its simplest form, all a worker needs to do to enable KV-aware routing is to
-publish KV metrics through the `WorkerMetricsPublisher`, which is consumed
-by a Dynamo KV Router through the `KvMetricsAggregator`:
-```python
-# kv_metrics_worker.py
-# This can be run standalone with `dynamo serve kv_metrics_worker:YourWorker`
-import logging
-import random
-from pydantic import BaseModel
-from dynamo.llm import (
-    WorkerMetricsPublisher,
-    ForwardPassMetrics,
-    KvStats,
-    SpecDecodeStats,
-    WorkerStats
-)
-from dynamo.sdk import endpoint, service, dynamo_context
-logger = logging.getLogger(__name__)
-class RequestType(BaseModel):
-    text: str
-class ResponseType(BaseModel):
-    text: str
-@service(
-    dynamo={
-        "namespace": "your_namespace",
-    }
-)
-class YourWorker:
-    def __init__(self):
-        # Initialize metrics publisher from Dynamo
-        self.component = dynamo_context["component"]
-        self.metrics_publisher = WorkerMetricsPublisher()
-        # Register an endpoint for consumers of the KV Metrics
-        # (KvMetricsAggregator) to listen/gather on.
-        self.metrics_publisher.create_endpoint(self.component)
-        # Initialize some metrics for the worker/class to track
-        self.request_active_slots = 0
-        self.request_total_slots = 1024
-        self.kv_active_blocks = 0
-        self.kv_total_blocks = 1024
-        self.num_requests_waiting = 0
-        self.gpu_cache_usage_perc = 0.0
-        self.gpu_prefix_cache_hit_rate = 0.0
-        worker_stats = WorkerStats(
-            data_parallel_rank=None,
-            self.request_active_slots,
-            self.request_total_slots,
-            self.num_requests_waiting
-        )
-        kv_stats = KvStats(
-            self.kv_active_blocks,
-            self.kv_total_blocks,
-            self.gpu_cache_usage_perc,
-            self.gpu_prefix_cache_hit_rate
-        )
-        # Publish some initial metrics to register
-        # this worker as a candidate for KV Routing.
-        metrics = ForwardPassMetrics(
-            worker_stats=worker_stats,
-            kv_stats=kv_stats,
-            spec_decode_stats=None,
-        )
-        self.metrics_publisher.publish(metrics)
-    def publish_kv_metrics(self):
-        # Populate the frequently changing metrics with random data for
-        # demonstration. These values should be tracked by the implementation,
-        # or queried from the underlying inference framework.
-        self.kv_active_blocks = random.randint(0, 1024)
-        self.num_requests_waiting = random.randint(0, 100)
-        self.gpu_cache_usage_perc = random.uniform(0, 1.0)
-        self.gpu_prefix_cache_hit_rate = random.uniform(0, 1.0)
-        # Publish the metrics with the current state
-        worker_stats = WorkerStats(
-            data_parallel_rank=None,
-            self.request_active_slots,
-            self.request_total_slots,
-            self.num_requests_waiting
-        )
-        kv_stats = KvStats(
-            self.kv_active_blocks,
-            self.kv_total_blocks,
-            self.gpu_cache_usage_perc,
-            self.gpu_prefix_cache_hit_rate
-        )
-        metrics = ForwardPassMetrics(
-            worker_stats=worker_stats,
-            kv_stats=kv_stats,
-            spec_decode_stats=None,
-        )
-        self.metrics_publisher.publish(metrics)
-    @endpoint()
-    async def generate(self, request: RequestType):
-        """Generate tokens, update KV Cache metrics, and stream the tokens back"""
-        # Increment the number of active requests on receiving one
-        self.request_active_slots += 1
-        logger.info(f"Worker endpoint received: {request.text}")
-        # Simulate each step of token generation
-        for token in request.text.split():
-            # Update the metrics with the current state
-            self.publish_kv_metrics()
-            print("Returning token:", token)
-            yield ResponseType(text=token).model_dump_json()
-        # Decrement the number of active requests when complete with one
-        self.request_active_slots -= 1
 ```
-The granularity at which metrics are published is up to the backend/worker implementation.
-For example, you may want to update metrics on every single generation step during token
-generation, or you may only want to update once per request, depending on your use case.
-Assuming long generation time or long output token sequence lengths, it would be more
-accurate to publish metrics at every generation step.
-With the worker publishing KV metrics, you should now be able to connect it
-to a KV Router through the `KvMetricsAggregator`.
-These metrics can then be inputs to a cost function to determine which
-of the available worker's the request should be routed to.
-For a [python-based KV Router](../../examples/llm/components/kv_router.py)
-implementation, the router is like any other worker, and it can expose
-an endpoint that can do arbitrary things based on your use case.
-For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
+The `model_path` can be:
-in your class implementation:
+- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
+- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
-```python
+- The path to a GGUF file, if your engine supports that.
-@service(
-    dynamo={
-        "namespace": "your_namespace",
-    },
-)
-class Router:
-    # ...
-    @async_on_start
-    async def async_init(self):
-        self.runtime = dynamo_context["runtime"]
-        # Initialize a listener/aggregator for collecting KV metrics
-        # from the specified component (workers) publishing them
-        kv_listener = self.runtime.namespace("your_namespace").component("YourWorker")
-        await kv_listener.create_service()
-        self.indexer = KvIndexer(kv_listener, self.args.block_size)
-        self.metrics_aggregator = KvMetricsAggregator(kv_listener)
-```
-With this flexibility, you can also define your own cost function that takes the
+The `model_type` can be:
-KV metrics from the KvMetricsAggregator above and the set of available workers
+- ModelType.Backend. Dynamo handles pre-processing. Your `generate` method receives a `request` dict containing a `token_ids` array of int. It must return a dict also containing a `token_ids` array and an optional `finish_reason` string.
-as inputs, and returns which worker ID that the request should be routed to.
+- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat). Your engine handles pre-processing.
-Since the router is like any other worker in this context, you can also track
+- ModelType.Completion. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions). Your engine handles pre-processing.
-your own custom metrics and use them in your cost function:
-```python
+`register_llm` can also take the following kwargs:
-@service(
+- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, the folder name, or the GGUF file name.
-    dynamo={
+- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
-        "namespace": "your_namespace",
+- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
-    },
-)
-class Router:
-    # ...
-    def _cost_function(
+See `components/backends` for full code examples.
-        self,
-        scores: OverlapScores | None,
-        metrics: AggregatedMetrics | None,
-        token_length: int,
-        custom_metrics: dict = {},
-    ):
-        """
-        Args:
-            scores (OverlapScores | None): The number of matching blocks between
-                the request and the prefix cache of each worker.
-            metrics (AggregatedMetrics | None): Several worker metrics polled
-                by the `KvMetricsAggregator`, currently including the
-                GPU cache usage, number of waiting requests, and the
-                GPU prefix cache hit rate.
-            token_length (int): The number of tokens in the request.
-            custom_metrics (dict): Arbitrary (optional) data from your implementation.
-        Returns:
+### Component names
-            worker_id (str): The best worker ID based on the inputs.
-        """
-        # This is a dummy implementation for demonstration purposes, see the
+A worker needs three names to register itself: namespace.component.endpoint
-        # llm/tensorrt_llm/hello_world examples for more realistic implementations.
-        worker_ids = []
-        # KV cache block hit scores
+* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
-        for worker_id, score in scores.scores.items():
+* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
-            print(f"{worker_id} # of matching KV Blocks of size {self.indexer.block_size()}: {score}")
+* *Endpoint*: Like a URL. "generate", "load_metrics".
-            worker_ids.append(worker_id)
+* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
-        # Aggregated KVMetrics published by workers
+If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
-        for endpoint_metrics in metrics.endpoints:
-            print(f"Endpoint metrics: {endpoint_metrics}")
-        # Replace this random choice with your custom criteria to
+If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
-        # select the best worker ID.
-        best_worker_id = random.choice(worker_ids)
-        return best_worker_id
+Example 1: Data parallel load balanced, one model one pipeline two instances.
-    @endpoint()
-    async def generate(self, request: Tokens) -> AsyncIterator[WorkerId]:
-        try:
-            # lora_id is a placeholder for lora support, but not used in this example
-            lora_id = 0
-            scores = await self.indexer.find_matches_for_request(
-                request.tokens, lora_id
-            )
-        except Exception as e:
-            scores = {}
-            print(f"Error finding matches: {e}")
-        # Get published/aggregated KV Metrics
-        metrics = await self.metrics_aggregator.get_metrics()
-        # (Optional) Add custom metrics to consider in the cost function
-        # The types and data used here are fully up to your implementation
-        custom_metrics = {"my_custom_metric": 42}
-        # Call custom cost function
-        worker_id = self._cost_function(
-            scores, metrics, len(request.tokens), custom_metrics
-        )
-        # Return worker ID selected from cost function
-        yield f"{worker_id}"
 ```
+Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0
-Similarly, for running a Rust-based Router as a standalone binary
+Node 2: namespace: qwen3-32b, component: backend, endpoint: generate model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2
-rather than as a Python Worker, see the
-[WorkerSelector Trait](../../lib/llm/src/kv_router.rs) trait, and the
-[Router Component](../../components/router/src/main.rs).
-For more details on receiving and routing based on the worker's published KV
-metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routing.md).
-### Disaggregated Serving
-#### NIXL
-NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
-1. **NIXL Agent Setup**
-```python
-from nixl._api import nixl_agent
-class NixlConnector:
-    def __init__(self, engine_id: str, rank: int):
-        # Create unique NIXL agent for this worker
-        self.nixl_wrapper = nixl_agent(str(uuid.uuid4()), None)
-        self.engine_id = engine_id
-        self.rank = rank
-        self.block_len = None  # Will be set during registration
 ```
-2. **Memory Registration and Transfer Preparation**
+Example 2: Two models, two pipelines.
-```python
-def register_kv_caches(self, kv_cache: torch.Tensor):
-    # Get block size from the KV cache tensor
-    # Note: KV cache layout depends on specific attention implementation
-    num_blocks, block_size, num_heads, head_dim = kv_cache.shape
-    self.block_len = block_size * num_heads * head_dim * kv_cache.element_size()
-    self.num_blocks = num_blocks
-    # Register KV cache tensor with NIXL for sharing
-    base_addr = kv_cache.data_ptr()
-    region_len = num_blocks * self.block_len
-    caches_data = [(base_addr, region_len, self.rank, "")]
-    # Register memory regions with NIXL
-    descs = self.nixl_wrapper.get_reg_descs(caches_data, "VRAM")
-    self.nixl_wrapper.register_memory(descs)
-    # Prepare local side of the transfer
-    blocks_data = []
-    for block_id in range(num_blocks):
-        block_offset = block_id * self.block_len
-        blocks_data.append((base_addr + block_offset, self.block_len, self.rank))
-    # Create transfer descriptors and prepare for transfers
-    self.local_blocks_descs = self.nixl_wrapper.get_xfer_descs(blocks_data, "VRAM")
-    # Create transfer handle with block descriptors for future transfers
-    self.local_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist("", self.local_blocks_descs)
 ```
+Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B
-3. **Remote Agent Communication**
+Node 2: namespace: llama3-1-8b, component: backend, endpoint: generat, model: /data/Llama-3.1-8B-Instruct/
-```python
-def get_agent_metadata(self):
-    # Get metadata for sharing with other agents
-    return self.nixl_wrapper.get_agent_metadata(), self.local_blocks_descs
-def add_remote_agent(self, engine_id: str, agent_metadata: bytes, remote_blocks_descs: bytes):
-    # Connect to remote NIXL agent
-    agent_name = self.nixl_wrapper.add_remote_agent(agent_metadata)
-    # Prepare remote side transfer handle using provided block descriptors
-    self.remote_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist(agent_name, remote_blocks_descs)
-    return agent_name
-# Example usage:
-# On decode worker:
-decode_metadata, decode_blocks_descs = nixl_connector.get_agent_metadata()
-# Share metadata with prefill worker through your communication channel
-# On prefill worker:
-nixl_connector.add_remote_agent(decode_engine_id, decode_metadata, decode_blocks_descs)
-```
-4. **KV Cache Transfer**
-```python
-def write_blocks(self, local_block_ids, remote_block_ids, notify_msg):
-    # Initiate asynchronous transfer using block IDs
-    # Block descriptors were specified during transfer preparation
-    handle = self.nixl_wrapper.make_prepped_xfer(
-        "WRITE",
-        self.local_xfer_side_handle,
-        local_block_ids,
-        self.remote_xfer_side_handle,
-        remote_block_ids,
-        notify_msg
-    )
-    status = self.nixl_wrapper.transfer(handle)
-# Example usage:
-# On prefill worker:
-nixl_connector.write_blocks([0, 3], [12, 16], "kv_transfer")
-```
-The NIXL connector provides:
- GPU memory registration for sharing between processes
- Connection establishment between Prefill and Decode workers
- Efficient block-based KV cache transfers
- Asynchronous transfer notifications
-For a complete implementation example using NIXL for disaggregated serving, see the [vLLM example](../examples/llm_deployment.md).
-#### Disaggregation in Dynamo
-Aside from the NIXL specifics above, at its core, disaggregation in Dynamo builds
-on the same concepts used for any Dynamo client<->worker or worker<->worker
-interaction over the DistributedRuntime.
-First you can define a worker for each as usual:
-```python
-class DecodeWorker:
-    # ...
-class PrefillWorker:
-    # ...
 ```
-Second, you decide when/how the (Decode) worker should decide to Prefill remotely
+Example 3: Different endpoints.
-(by calling a separate Prefill worker), or decide to simply do the Prefill itself.
-In some scenarios, it may be more efficient for the Decode worker to just do the
-Prefill itself rather than do the extra communication, such as if the input
-sequence length is below some small threshold. If you wanted to disable
-disaggregation, the DecodeWorker could just always do the Prefill step as well.
-```python
+The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
-@service(
-    dynamo={
-        "namespace": "your_namespace",
-    },
-)
-class DecodeWorker:
-    def __init__(self):
-        self.runtime = dynamo_context["runtime"]
-        # Whether the decode worker should call a separate Prefill worker or not
+Example 4: Multiple component in a pipeline.
-        self.do_remote_prefill = True
-        # Initialize client to PrefillWorker
-        self.prefill_client = await self.runtime
-                .namespace("your_namespace")
-                .component("PrefillWorker")
-                .endpoint("generate")
-                .client()
-    @endpoint()
-    async def generate(self, request):
-        if self.do_remote_prefill:
-            # Forward the request to the prefill worker
-            prefill_response = await self.prefill_client.generate(...)
-        # ... framework-specific decode logic ...
-@service(
-    dynamo={
-        "namespace": "your_namespace",
-    },
-)
-class PrefillWorker:
-    def __init__(self):
-        # ...
-    @endpoint()
-    async def generate(self, request):
-        # ... framework-specific prefill logic ...
-```
-Depending on the load distribution of requests and number of Prefill/Decode
-worker instances, instead of directly forwarding requests to the Prefill
-worker endpoint, it may be advantageous to send Prefill requests into a queue
-that the Prefill workers can pull from on-demand instead. You can see an example
-of that [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/hello_world/disagg_skeleton/components/prefill_worker.py).
-For an introductory example on doing disaggregation with Dynamo using simple models, see
-[this example](../examples/disagg_skeleton.md).
-For more information on Disaggregated Serving, see the
-[general guide](../architecture/disagg_serving.md) and [performance tuning guide](disagg_perf_tuning.md).
-## Best Practices
-1. **Resource Management**: Configure resource requirements based on your needs:
-   ```python
-   @service(
-       resources={
-           "cpu": "10",
-           "memory": "20Gi",
-           "gpu": "1",
-       }
-   )
-   ```
-2. **Async Operations**: Use async/await for I/O operations:
-   ```python
-   @endpoint()
-   async def generate(self, request):
-       # Use async operations for better performance
-       result = await self.some_async_operation()
-   ```
-## Additional Resources
+In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
- Check the [examples](https://github.com/ai-dynamo/dynamo/tree/main/examples) directory for more detailed implementations
- Refer to the [Dynamo SDK Docs](../API/sdk.md) for API details.
- For Disaggregated Serving, see the [general guide](../architecture/disagg_serving.md) and [performance tuning guide](disagg_perf_tuning.md).
--- a/docs/guides/dynamo_deploy/dynamo_operator.md
+++ b/docs/guides/dynamo_deploy/dynamo_operator.md
@@ -137,7 +137,6 @@ This section describes how to use FluxCD for GitOps-based deployment of Dynamo i
 - A Kubernetes cluster with [Dynamo Cloud](dynamo_cloud.md) installed
 - [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
 - A Git repository to store your deployment configurations
- [Dynamo CLI](../../get_started.md#alternative-setup-manual-installation) installed locally
 ### Workflow Overview

--- a/docs/guides/planner_benchmark/disagg_1p1d.yml
+++ b/docs/guides/planner_benchmark/disagg_1p1d.yml
--- a/docs/guides/planner_benchmark/disagg_2p2d.yaml
+++ b/docs/guides/planner_benchmark/disagg_2p2d.yaml
--- a/docs/guides/planner_benchmark/README.md
+++ b/docs/guides/planner_benchmark/README.md
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 # Planner Benchmark Example
@@ -50,8 +38,8 @@ For other models and GPU SKUs, adjust the request rate ranges accordingly to mat
 To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
 ```bash
-cd examples/llm
+# Start Kubernetes with one frontend node, one prefill and one decode worker
-dynamo serve graphs.disagg_router:Frontend -f disagg_1p1d.yml
+# TODO
 # in terminal 2
 genai-perf profile \
@@ -84,7 +72,8 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
 ```bash
 # in terminal 1
-dynamo serve graphs.disagg_router:Frontend -f disagg_2p2d.yml
+# Start Kubernetes with one frontend node, two prefill and two decode workers
+# TODO
 # in terminal 2
 genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -75,7 +75,6 @@ The examples below assume you build the latest image yourself from source. If us
   Welcome to Dynamo <self>
   Support Matrix <support_matrix.md>
-   Getting Started <get_started.md>
 .. toctree::
   :hidden:

--- a/docs/support_matrix.md
+++ b/docs/support_matrix.md
@@ -2,18 +2,6 @@
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 # Dynamo Support Matrix
@@ -72,7 +60,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 | :----------------- | :------------ | :----------------------------------- | :----------- |
 | ai-dynamo          | 0.3.2         | >=2.28                               |              |
 | ai-dynamo-runtime  | 0.3.2         | >=2.28 (Python 3.12 has known issues)|              |
-| ai-dynamo-vllm     | 0.8.4.post4¹  | >=2.28 (recommended)                 |              |
 | NIXL               | 0.4.0         | >=2.27                               | >=11.8       |
 ### Build Dependency
@@ -80,13 +67,10 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 | **Build Dependency** | **Version**                                                                      |
 | :------------------- | :------------------------------------------------------------------------------- |
 | **Base Container**   | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
-| **ai-dynamo-vllm**   | 0.8.4.post4¹                                                                     |
 | **TensorRT-LLM**     | 1.0.0rc²                                                                         |
 | **NIXL**             | 0.4.0                                                                            |
 > [!Important]
-> ¹ ai-dynamo-vllm `v0.8.4.post4` is a customized patch of `v0.8.4` from vLLM.
->
 > ² Specific versions of TensorRT-LLM supported by Dynamo are subject to change.
 ## Build Support

--- a/lib/llm/src/discovery/watcher.rs
+++ b/lib/llm/src/discovery/watcher.rs
@@ -178,7 +178,6 @@ impl ModelWatcher {
                Some(card)
            }
            Err(err) => {
-                // `dynamo serve` isn't using MDC yet so can't be an error
                tracing::info!(%err, "load_mdc did not complete");
                None
            }