docs: Update docs for new UX (#2070)

3c500ae7 · Graham King · GitHub · f3d784f3 · 3c500ae7 · 3c500ae7
Unverified Commit 3c500ae7 authored Jul 23, 2025 by Graham King Committed by GitHub Jul 23, 2025
20 changed files
--- a/README.md
+++ b/README.md
@@ -55,96 +55,68 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
 The following examples require a few system level packages.
 Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)

-```
-apt-get update
-DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
-python3 -m venv venv
-source venv/bin/activate
-
-pip install "ai-dynamo[all]"
-```
-> [!NOTE]
-> To ensure compatibility, please refer to the examples in the release branch or tag that matches the version you installed.
-
-### Building the Dynamo Base Image
+1. Install etcd and nats

-Although not needed for local development, deploying your Dynamo pipelines to Kubernetes will require you to use a Dynamo base image to your container registry. You can use any container registry of your choice, such as:
- Docker Hub (docker.io)
- NVIDIA NGC Container Registry (nvcr.io)
- Any private registry
+To co-ordinate across the data center Dynamo relies on an etcd and nats cluster. To run locally these need to be available.

-We publish our images in [nvcr.io](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)  and you can use them.
-Alternatively you could build and push an image from source:
+- [etcd](https://etcd.io/) can be run directly as `./etcd`.
+- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.

-```bash
-./container/build.sh
-docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
-docker login <your-registry>
-docker push <your-registry>/dynamo-base:latest-vllm
+The Dynamo team recommend the `uv` Python package manager, although anyway works. Install uv:
+```
+curl -LsSf https://astral.sh/uv/install.sh | sh
 ```

-Notes about builds for specific frameworks:
- For specific details on the `--framework vllm` build [read about the VLLM backend](components/backends/vllm/README.md)
-.
- For specific details on the `--framework tensorrtllm` build, see [Read about the TensorRT-LLM backend](components/backends/trtllm/README.md)
-.
+2. Select an engine

-Note about AWS environments:
- If deploying Dynamo in AWS, make sure to build the container with EFA support using the `--make-efa` flag.
+We publish Python wheels specialized for each of our supported engines: vllm, sglang, llama.cpp and trtllm. The examples that follow use sglang, read on for other engines.

-After building, you can use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
-```bash
-export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
 ```
+uv venv venv
+source venv/bin/activate
+uv pip install pip

-> [!NOTE]
-> We are working on leaner base images that can be built using the targets in the top-level Earthfile.
+# Choose one
+uv pip install "ai-dynamo[sglang]"
+uv pip install "ai-dynamo[vllm]"
+uv pip install "ai-dynamo[llama_cpp]" # CPU, see later for GPU
+```

 ### Running and Interacting with an LLM Locally

 You can run a model and interact with it locally using commands below.
-We support several backends including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.

 #### Example Commands

 ```
-python -m dynamo.frontend [--http-port 8080]
-python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+python -m dynamo.frontend --interactive
+python -m dynamo.sglang.worker Qwen/Qwen3-4B
 ```

 ```
-? User › Hello, how are you?
 ✔ User · Hello, how are you?
 Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
 ```

-### LLM Serving
+If the model is not available locally it will be downloaded from HuggingFace and cached.

-Dynamo provides a simple way to spin up a local set of inference
-components including:
+You can also pass a local path: `python -m dynamo.sglang.worker --model-path ~/llms/Qwen3-0.6B`
+
+### Running an LLM API server
+
+Dynamo provides a simple way to spin up a local set of inference components including:

 - **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
 - **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
 - **Workers** – Set of pre-configured LLM serving engines.

-To run a minimal configuration you can use a pre-configured
-example.
-
-#### Start Dynamo Distributed Runtime Services
-
-First start the Dynamo Distributed Runtime services:
-
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
 ```
-#### Start Dynamo LLM Serving Components
-
-Next serve a minimal configuration with an http server, basic
-round-robin router, and a single worker.
+# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
+python -m dynamo.frontend [--http-port 8080]

-```bash
-cd examples/llm
-dynamo serve graphs.agg:Frontend -f configs/agg.yaml
+# Start the vllm engine, connecting to nats and etcd to receive requests. You can run several of these,
+# both for the same model and for multiple models. The frontend node will discover them.
+python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 ```

 #### Send a Request
@@ -163,43 +135,143 @@ curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"
  }' | jq
 ```

+Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
+
+### Engines
+
+In the introduction we installed the `sglang` engine. There are other options.
+
+All of these requires nats and etcd, as well as a frontend (`python -m dynamo.frontend [--interactive]`).
+
+# vllm
+
+```
+uv pip install ai-dynamo[vllm]
+```
+
+Run the backend/worker like this:
+```
+python -m dynamo.vllm --help
+```
+
+vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
+
+To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
+
+# sglang
+
+```
+uv pip install ai-dynamo[sglang]
+```
+
+Run the backend/worker like this:
+```
+python -m dynamo.sglang.worker --help
+```
+
+You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
+
+# TRT-LLM
+
+This currently requires a container TODO ADD THE DOCS PLZ THANK YOU
+
+# llama.cpp
+
+To install llama.cpp for CPU inference:
+```
+uv pip install ai-dynamo[llama_cpp]
+```
+
+To build llama.cpp for CUDA:
+```
+pip install llama-cpp-python -C cmake.args="-DGGML_CUDA=on"
+uv pip install uvloop ai-dynamo
+```
+
+At time of writing the `uv pip` version does not support that syntax, so use `pip` directly inside the venv.
+
+To build llama.cpp for other accelerators see https://pypi.org/project/llama-cpp-python/ .
+
+Download a GGUF and run the engine like this:
+```
+python -m dynamo.llama_cpp --model-path ~/llms/Qwen3-0.6B-Q8_0.gguf
+```
+
+If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
+
 ### Local Development

-If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
+1. Install libraries
+
+**Ubuntu:**
+```
+sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
+```

-Otherwise, to develop locally, we recommend working inside of the container
+**macOS:**
+- [Homebrew](https://brew.sh/)
+```
+# if brew is not installed on your system, install it
+/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
+```
+- [Xcode](https://developer.apple.com/xcode/)

-```bash
-./container/build.sh
-./container/run.sh -it --mount-workspace
+```
+brew install cmake protobuf

-cargo build --release
-mkdir -p /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
-cp /workspace/target/release/dynamo-run /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
+## Check that Metal is accessible
+xcrun -sdk macosx metal
+```
+If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
+
+
+2. Install Rust

-uv pip install -e .
-export PYTHONPATH=$PYTHONPATH:/workspace/deploy/sdk/src:/workspace/components/planner/src
+```
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+source $HOME/.cargo/env
 ```

+3. Create a Python virtual env:

-#### Conda Environment
+```
+uv venv dynamo
+source dynamo/bin/activate
+```

-Alternately, you can use a conda environment
+4. Install build tools

-```bash
-conda activate <ENV_NAME>
+```
+uv pip install pip maturin
+```

-pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
+[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.

-cargo build --release
+5. Build the Rust bindings

-# To install ai-dynamo-runtime from source
+```
 cd lib/bindings/python
-pip install .
+maturin develop --uv
+```
+
+6. Install the wheel
+
+```
+cd $PROJECT_ROOT
+uv pip install .
+```
+
+Note editable (`-e`) does not work because the `dynamo` package is split over multiple directories, one per backend.
+
+You should now be able to run `python -m dynamo.frontend`.
+
+Remember that nats and etcd must be running (see earlier).
+
+Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
+
+If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.

-cd ../../../
-pip install ".[all]"
+### Deployment to Kubernetes

-Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md)
+Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md) to deploy to Kubernetes.

-```
\ No newline at end of file
--- a/components/metrics/README.md
+++ b/components/metrics/README.md
@@ -74,18 +74,14 @@ metrics --component MyComponent --endpoint my_endpoint
 To run a more realistic deployment to gathering metrics from,
 see the examples in [examples/llm](../../examples/llm).

-For example, for a VLLM + KV Routing based deployment that
-exposes statistics on an endpoint labeled
-`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
-with any other example such as examples/vllm_v0, vllm_v1, ...):
 ```bash
-cd deploy/examples/llm
-dynamo serve graphs.agg:Frontend -f configs/agg.yaml
+python -m dynamo.frontend &
+python -m dynamo.vllm --model-path <your-model-checkout>
 ```

 Then, to monitor the metrics of these VllmWorkers, run:
 ```bash
-metrics --component VllmWorker --endpoint load_metrics
+metrics --component backend --endpoint load_metrics
 ```

 **NOTE**: `load_metrics` is currently a

--- a/docs/API/python_bindings.md
+++ b/docs/API/python_bindings.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-https://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Dynamo Python Bindings
-
-Python bindings for the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
-
-## 🚀 Quick Start
-
-1. Install `uv`: https://docs.astral.sh/uv/#getting-started
-```
-curl -LsSf https://astral.sh/uv/install.sh | sh
-```
-
-2. Install `protoc` protobuf compiler: https://grpc.io/docs/protoc-installation/.
-
-For example on an Ubuntu/Debian system:
-```
-apt install protobuf-compiler
-```
-
-3. Setup a virtualenv
-
-```
-uv venv
-source .venv/bin/activate
-uv pip install maturin
-```
-
-4. Build and install dynamo wheel
-```
-maturin develop --uv
-```
-
-## Run Examples
-
-### Prerequisite
-
-See [README.md](../runtime/README.md#prerequisites).
-
-### Hello World Example
-
-1. Start 3 separate shells, and activate the virtual environment in each
-```
-source .venv/bin/activate
-```
-
-2. In one shell (shell 1), run example server the instance-1
-```
-python3 ./examples/hello_world/server.py
-```
-
-3. (Optional) In another shell (shell 2), run example the server instance-2
-```
-python3 ./examples/hello_world/server.py
-```
-
-4. In the last shell (shell 3), run the example client:
-```
-python3 ./examples/hello_world/client.py
-```
-
-If you run the example client in rapid succession, and you started more than
-one server instance above, you should see the requests from the client being
-distributed across the server instances in each server's output. If only one
-server instance is started, you should see the requests go to that server
-each time.
-
-## Performance
-
-The performance impacts of synchronizing the Python and Rust async runtimes
-is a critical consideration when optimizing the performance of a highly
-concurrent and parallel distributed system.
-
-The Python GIL is a global critical section and is ultimately the death of
-parallelism. To compound that, when Rust async futures become ready,
-accessing the GIL on those async event loop needs to be considered carefully.
-Under high load, accessing the GIL or performing CPU intensive tasks on
-on the event loop threads can starve out other async tasks for CPU resources.
-However, performing a `tokio::task::spawn_blocking` is not without overheads
-as well.
-
-If bouncing many small message back-and-forth between the Python and Rust
-event loops where Rust requires GIL access, this is pattern where moving the
-code from Python to Rust will give you significant gains.
--- a/docs/API/sdk.md
+++ b/docs/API/sdk.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Dynamo SDK
-
-## Introduction
-
-Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
-
-Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
-
-## Installation
-
-The SDK can be installed using pip:
-
-```bash
-pip install ai-dynamo
-```
-
-## Core Concepts
-
-As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
-
-### Defining a Service
-
-A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
-
-```python
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-```
-
-Key configuration options:
-
-1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
-2. `resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
-3. `workers`: Number of parallel instances of the service to spin up.
-
-### Writing a Service
-
-Let's walk through an example to understand how you write a dynamo service.
-
-```python
-import ServiceB
-
-@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
-class ServiceA:
-    # Define service dependencies
-    service_b = depends(ServiceB)
-
-    def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
-        self.model_name = model_name
-        self.engine = None
-
-    @async_on_start
-    async def async_init(self):
-        # Initialize resources that require async operations
-        self.engine = await initialize_model_engine(self.model_name)
-        print(f"ServiceA initialized with model: {self.model_name}")
-
-    @on_shutdown
-    def shutdown(self):
-        # Clean up resources
-        if self.engine:
-            self.engine.shutdown()
-        print("ServiceA engine shut down")
-
-    @endpoint()
-    async def generate(self, request: ChatCompletionRequest):
-        # Call dependent service
-        processed_request = await self.service_b.preprocess(request)
-
-        # Use the engine to generate a response
-        response = await self.engine.generate(processed_request)
-        return response
-```
-
-#### Class-Based Architecture
-
-Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
-
-1. Class attributes for dependencies using `depends()`
-2. An `__init__` method for standard initialization
-3. Optional lifecycle hooks like `@async_on_start` and `@on_shutdown`
-4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
-   via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
-   function being decorated if omitted.
-
-This approach provides a clean separation of concerns and makes the service structure easy to understand.
-
-#### Service Dependencies with `depends()`
-
-The `depends()` function is a powerful feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
-
-1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
-2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
-3. You are able to access `ServiceB` endpoints as if it were a local function!
-
-```python
-# What happens internally when you use depends(ServiceB)
-service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
-
-# But with Dynamo SDK, you simply write:
-service_b = depends(ServiceB)
-
-# And then call methods directly:
-result = await service_b.preprocess(data)
-```
-
-```{note}
-Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
-```
-
-```python
-import VllmWorker
-
-# this runtime object gives you access to the underlying python bindings
-runtime = dynamo_context["runtime"]
-comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
-print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
-self.worker_client = (
-    await runtime.namespace(comp_ns)
-    .component(comp_name)
-    .endpoint("generate")
-    .client()
-)
-```
-
-This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
-
-#### Lifecycle Hooks
-
-Dynamo supports key lifecycle hooks to manage service initialization and cleanup.
-
-##### `@async_on_start`
-
-The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
-
-```python
-@async_on_start
-async def async_init(self):
-    # Perfect for operations that need to be awaited
-    self.db = await connect_to_db()
-    self.tokenizer = await load_tokenizer()
-    self.engine = await initialize_engine(self.model)
-```
-
-This is especially useful for:
-
- Initializing external connections
- Setting up runtime resources that require async operations
-
-#### `@on_shutdown`
-
-The `@on_shutdown` hook is called when the service is shutdown handles cleanup.
-
-```python
-@on_shutdown
-def shutdown(self):
-    # gracefully Handle shutdown / cleanup
-    logger.info("worker shutting down")
-```
-
-This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
-
-### Configuring a Service
-
-Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
-
-1. Directly in the `@service` decorator
-2. Through YAML configuration files
-3. Via command-line arguments
-4. Using environment variables
-
-These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
-
-#### Configuration via Service Decorator
-
-The most basic method is to specify parameters directly in the service decorator:
-
-```python
-@service(
-    dynamo={"namespace": "prod"},
-    resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
-    workers=2,
-)
-class MyService:
-    def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
-        self.model_name = model_name
-        self.temperature = temperature
-```
-
-This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
-
-#### Configuration via YAML
-
-For more flexible configuration, especially across environments, you can use YAML files:
-
-```yaml
-# config.yaml
-MyService:
-  # Override service decorator settings
-  ServiceArgs:
-    workers: 4
-    resources:
-      gpu: 4
-
-  # Service instance parameters
-  model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
-  temperature: 0.8
-```
-
-The YAML file has a hierarchical structure:
-
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
-
-#### Loading YAML Configuration
-
-Use the CLI to load configuration from a YAML file:
-
-```bash
-dynamo serve service:MyService -f config.yaml
-```
-
-The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
-
-#### Configuration Precedence
-
-When multiple configuration sources are used, they follow this precedence order (highest to lowest):
-
-1. Command-line arguments
-2. YAML configuration
-3. Service decorator defaults
-4. Constructor defaults
-
-#### Accessing Configuration in Services
-
-Inside a service, you can access configuration using the `ServiceConfig` class:
-
-```python
-from dynamo.sdk.lib.config import ServiceConfig
-
-class MyService:
-    def __init__(self):
-        config = ServiceConfig.get_instance()
-
-        # Get with default value
-        self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
-        self.temperature = config.get("MyService", {}).get("temperature", 0.7)
-
-        # Require a config value (raises error if missing)
-        self.api_key = config.require("MyService", "api_key")
-
-        # Get all config for this service
-        all_my_config = config.get("MyService", {})
-```
-
-#### Parsing Configuration as CLI Arguments
-
-For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
-
-```python
-from dynamo.sdk.lib.config import ServiceConfig
-
-def setup_my_lib():
-    config = ServiceConfig.get_instance()
-
-    # Get all MyService config with prefix "lib_" as CLI args
-    cli_args = config.as_args("MyService", prefix="lib_")
-    # Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
-
-    # Pass to an external library's argument parser
-    lib_parser = MyLibArgumentParser()
-    lib_args = lib_parser.parse_args(cli_args)
-    return lib_args
-```
-
-This pattern is used in the example vLLM integration:
-
-```python
-def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
-    config = ServiceConfig.get_instance()
-    vllm_args = config.as_args(service_name, prefix=prefix)
-    parser = FlexibleArgumentParser()
-
-    # Add custom arguments
-    parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
-    parser.add_argument("--remote-prefill", action="store_true", default=False)
-
-    # Add VLLM's arguments (ServiceConfig handles True defaults automatically)
-    parser = AsyncEngineArgs.add_cli_args(parser)
-
-    # Parse both custom and VLLM arguments
-    args = parser.parse_args(vllm_args)
-
-    # Convert to engine arguments
-    engine_args = AsyncEngineArgs.from_cli_args(args)
-
-    # Add custom args to the engine args
-    engine_args.router = args.router
-    engine_args.remote_prefill = args.remote_prefill
-
-    return engine_args
-```
-
-#### Boolean Argument Handling
-
-ServiceConfig uses a targeted approach for boolean arguments to maintain compatibility with different argument parsers:
-
-1. Standard Boolean Handling:
- `true` → outputs just the flag (e.g., `--enable-feature`)
- `false` → omitted entirely (uses parser's default)
-
-2. vLLM True-Default Arguments (targeted override support):
- Automatically detects vLLM arguments that default to `True` and need explicit `false` handling
- `enable-prefix-caching: false` → `--no-enable-prefix-caching` (negative flag)
- `multi-step-stream-outputs: false` → `--no-multi-step-stream-outputs` (negative flag)
-
-```yaml
-# Example YAML configuration
-VllmWorker:
-  # Standard boolean flags (action="store_true" style)
-  enforce-eager: true          # → --enforce-eager
-  disable-logging: false       # → (omitted)
-
-  # vLLM arguments with True defaults (automatically handled)
-  enable-prefix-caching: false  # → --no-enable-prefix-caching
-
-  # Non-boolean arguments
-  max-model-len: 16384         # → --max-model-len 16384
-```
-
-#### Overriding Service Decorator with ServiceArgs
-
-The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
-
-```yaml
-MyService:
-  ServiceArgs:
-    dynamo:
-      namespace: "staging" # Override namespace
-    resources:
-      gpu: 4 # Use more GPUs
-    workers: 8 # Scale up workers
-```
-
-This is particularly useful for:
-
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
-
-Under the hood, the `DynamoService` class reads these arguments during initialization:
-
-```python
-def _get_service_args(self, service_name: str) -> Optional[dict]:
-    """Get ServiceArgs from environment config if specified"""
-    config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
-    if config_str:
-        config = json.loads(config_str)
-        service_config = config.get(service_name, {})
-        return service_config.get("ServiceArgs")
-    return None
-```
-
-#### Complete Configuration Example
-
-Here's a comprehensive example showing how all these pieces fit together:
-
-1. First, define your service with basic defaults:
-
-```python
-@service(
-    dynamo={"namespace": "default"},
-    resources={"gpu": 1},
-    workers=1,
-)
-class LLMService:
-    def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
-        self.model_name = model_name
-        self.temperature = temperature
-        self.max_tokens = max_tokens
-
-        # Get additional configuration
-        config = ServiceConfig.get_instance()
-        service_config = config.get("LLMService", {})
-
-        # Extract service-specific parameters
-        self.cache_size = service_config.get("cache_size", 1000)
-        self.use_kv_cache = service_config.get("use_kv_cache", True)
-```
-
-2. Create a YAML configuration for production:
-
-```yaml
-# prod_config.yaml
-LLMService:
-  ServiceArgs:
-    dynamo:
-      namespace: "prod"
-    resources:
-      gpu: 4
-      memory: "64Gi"
-    workers: 8
-
-  # Constructor parameters
-  model_name: "llama-3-70b-instruct"
-  temperature: 0.8
-  max_tokens: 2048
-
-  # Service-specific parameters
-  cache_size: 10000
-  use_kv_cache: true
-```
-
-3. Deploy with mixed configuration:
-
-```bash
-dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
-```
-
-The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
-
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
-
-#### Service Configuration Best Practices
-
-1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
-2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
-3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
-4. **Use CLI for Quick Testing**: Override specific values for experimentation
-5. **Document Configuration Keys**: Make sure to document all available configuration options
-
-Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
-
-### Deploying a Single Service
-
-You can deploy a single service for local development even if you have a dependency graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
-
-### Composing Services into an Graph
-
-There are two main ways to compose services in Dynamo:
-
-1. Use `depends()` (Recommended)
-   The depends() approach is the recommended way for production deployments:
-
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
-
-2. Use `.link()` (Experimental)
-   Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
-
-#### Understanding the `.link()` syntax
-
-Lets take the example of a `Processor` component. This component can currently do 2 things:
-
-1. Process a request and send it to a `Router` to decide what worker to send it to.
-2. Process a request and send it to a `Worker` directly.
-
-Consider this snippet of the Processor:
-
-```python
-class Processor(ProcessMixIn):
-    """
-    vLLM pre and post processing
-    """
-
-    worker = depends(VllmWorker)
-    router = depends(Router)
-
-    # logic for processing a request based on router or worker
-```
-
-Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
-
-```python
-Processor.link(Router)
-```
-
-This removes the `worker` dependency from the Processor and only spin up the Router.
--- a/docs/architecture/disagg_serving.md
+++ b/docs/architecture/disagg_serving.md
@@ -2,18 +2,6 @@
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->

 # Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
 - Add prefill worker: no explicit action needed.
 - Delete prefill worker: flush engine.

-### How this works under the hood
-
-#### Auto-Discovery for new workers
-
-In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
-
-You can watch this happen live by running the following:
-
-```bash
-# in terminal 1 - run the disaggregated serving example
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
-
-```bash
-# in terminal 2 - watch the namespace in etcd
-watch -cd etcdctl get --prefix <namespace>
-```
-
-You should see something like this show up as the disaggregated serving example starts up:
-
-```bash
-# worker information
-dynamo/components/PrefillWorker/mock:694d967da694ea1e
-{
-  "component": "PrefillWorker",
-  "endpoint": "mock",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009310,
-  "transport": {
-    "nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
-  }
-}
-dynamo/components/Processor/chat/completions:694d967da694ea16
-{
-  "component": "Processor",
-  "endpoint": "chat/completions",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009302,
-  "transport": {
-    "nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
-  }
-}
-dynamo/components/VllmWorker/generate:694d967da694ea1a
-{
-  "component": "VllmWorker",
-  "endpoint": "generate",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009306,
-  "transport": {
-    "nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
-  }
-}
-dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
-{
-  "component": "VllmWorker",
-  "endpoint": "load_metrics",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009306,
-  "transport": {
-    "nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
-  }
-}
-
-# nixl metadata
-dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2
-```
-
-#### Graceful worker shutdown
-
-Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
-
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
-
-You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
\ No newline at end of file
--- a/docs/architecture/kv_cache_routing.md
+++ b/docs/architecture/kv_cache_routing.md
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
+-->

-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+# KV Cache Routing
+This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.

-http://www.apache.org/licenses/LICENSE-2.0
+To enable KV cache aware routing start the frontend node like this:
+```
+python -m dynamo.frontend --router-mode kv
+```

-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
+The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.

->[!NOTE]
->This information is temporary and will change soon.
+For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.

-# KV Cache Routing
-This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
+The KV-aware routing arguments:

-## Architecture
-Dynamo's architecture consists of three key concepts:
+- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
+
+- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.

- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
- **Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
- **Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
+- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.

-A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).

-A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
+## Architecture

 Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.

@@ -150,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr

 The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.

-Example:
-```python
-from dynamo.llm import KvIndexer
-from dynamo.sdk import dynamo_context
-
-runtime = dynamo_context["runtime"]
-kv_listener = runtime.namespace("dynamo").component("VllmWorker")
-await kv_listener.create_service()
-
-indexer = KvIndexer(kv_listener, block_size=16)
-indexer.find_matches_for_request([INPUT SEQUENCE OF TOKEN IDs])
-```
-
-Sample Output:
-```
-{
-	123456789: 10,
-	987654321: 3,
-	543219876: 7,
-}
-```
-
-```{note}
-This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
-```
-
 ### WorkerMetricsPublisher
 We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
 - num_requests_waiting
@@ -191,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
 ### KvMetricsAggregator
 The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.

-Example:
-```python
-from dynamo.llm import KvMetricsAggregator
-from dynamo.sdk import dynamo_context
-
-runtime = dynamo_context["runtime"]
-kv_listener = runtime.namespace("dynamo").component("VllmWorker")
-await kv_listener.create_service()
-metrics_aggregator = KvMetricsAggregator(kv_listener)
-
-for endpoint in metrics_aggregator.get_metrics().endpoints:
-    print("Worker ID: ", endpoint.worker_id)
-    print("GPU Cache Usage: ", endpoint.gpu_cache_usage_perc)
-    print("Number of Requests Waiting: ", endpoint.num_requests_waiting)
-    print("GPU Prefix Cache Hit Rate: ", endpoint.gpu_prefix_cache_hit_rate)
-    print("***")
-```
-
-Sample Output:
-```
-Worker ID: 123456789
-GPU Cache Usage: 0.5
-Number of Requests Waiting: 2
-GPU Prefix Cache Hit Rate: 0.1
-***
-Worker ID: 987654321
-GPU Cache Usage: 0.5
-Number of Requests Waiting: 1
-GPU Prefix Cache Hit Rate: 0.1
-***
-```
-
-```{note}
-This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
-```
-
-### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
-The Router component makes intelligent worker selection decisions
-1. Receives incoming requests as tokens
-2. Queries the KVIndexer to find potential cache hits across workers
-3. Collects performance metrics from workers (via KvMetricsAggregator)
-4. Uses a cost function to determine the optimal worker for each request
-5. Returns chosen worker
-
-The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.
--- a/docs/examples/README.md
+++ b/docs/examples/README.md
@@ -19,8 +19,6 @@ If you are a **🧑‍💻 Dynamo Contributor** first follow the instructions in

 You would have to rebuild the dynamo platform images as the code evolves. For more details please look at the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)

-Export the [Dynamo Base Image](../get_started.md#building-the-dynamo-base-image) you want to use (or built during the prerequisites step) as the `DYNAMO_IMAGE` environment variable.
-
 ```bash
 export DYNAMO_IMAGE=<your-registry>/<your-image-name>:<your-tag>
 ```
@@ -74,4 +72,4 @@ kubectl port-forward svc/${SERVICE_NAME}-frontend 8000:8000 -n ${NAMESPACE}

 Consult the [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)

-More on [LLM examples](llm_deployment.md)
\ No newline at end of file
+More on [LLM examples](llm_deployment.md)
--- a/docs/examples/disagg_skeleton.md
+++ b/docs/examples/disagg_skeleton.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Hello World: Aggregated and Disaggregated Deployment Examples
-
-The `example` directory contains a [hello world example](../examples/hello_world.md) that implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
-
-## Components
-
- frontend: A simple http server handles incoming requests
- processor: A pre/post processing server and invokes router server
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- worker: A dummy decode worker
- prefill-worker: A dummy prefill worker
-
-## Deployment Architectures
-
-This figure shows an overview of the major components to deploy:
-
-```
-                                                 +----------------+
-                                                 | prefill worker |-------+
-                                                 |                |       |
-                                                 +----------------+       | pull
-                                                                          v
-+------+      +-----------+      +------------------+    push     +---------------+
-| HTTP |----->| processor |----->|  decode/monolith |------------>| prefill queue |
-|      |<-----|           |<-----|      worker      |             |               |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^
-       query best |    | return
-           worker |    | worker_id
-                  |    |         +------------------+
-                  |    +---------|      router      |
-                  +------------->|                  |
-                                 +------------------+
-
-```
-
-## The Aggregated Deployment
-
-This example uses 2 nodes to demo the disagg serving.
- Node 1
-  - Runs NATS and etcd services
-  - Deploys Frontend, Processor and Router
-  - Deploys DummyWorker as the monolith worker
- Node 2
-  - Deploys DummyWorker as the monolith worker
-
-### Prerequisites
-On Node 1, start required services (etcd and NATS) using [Docker Compose](https://github.com/ai-dynamo/dynamo/blob/main/deploy/docker-compose.yml)
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-
-### Run the Deployment
-
-1. Set environment variables for NATS and etcd services
-
-```bash
-export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
-export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
-```
-
-2. Launch Frontend, Processor and Router services:
-```
-cd dynamo/examples/hello_world/disagg_skeleton
-dynamo serve components.graph:Frontend
-```
-
-3. Open a new terminal on Node 1 and deploy Worker service
-```
-export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
-export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
-cd dynamo/examples/hello_world/disagg_skeleton
-dynamo serve components.worker:DummyWorker
-```
-
-4. Go to Node 2 and start Worker service as in step 3.
-Now you should see both workers are ready in Node 1's terminal.
-
-5. Query the Frontend with following two prompts. The router would assign different workers for each prompt and you can observe it from the responses.
- `Response: {"worker_output":"Tell me a joke_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
- `Response: {"worker_output":"Which team won 2020 World Series_GeneratedBy_NODE2HOSTNAME","request_id":"id_number"}`
-```
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "Tell me a joke",
-  "request_id":"id_number"
-}'
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "Which team won 2020 World Series",
-  "request_id":"id_number"
-}'
-```
-6. Then modify the prompt; prompts with similar prefixes are routed to the same worker due to the routing algorithm used in this demo. For example, following query is routed to the worker that proceesed `Tell me a joke` prompt.
-```
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "Tell me a fact",
-  "request_id":"id_number"
-}'
-```
-
- `Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
-
-## The Disaggregated Deployment
-
-This example uses 3 nodes to demo the disagg serving.
- Node 1
-  - Runs NATS and etcd services
-  - Deploys Frontend and Processor
-  - Deploys DummyWorker as the decode worker
- Node 2
-  - Deploys DummyWorker as the decode worker
- Node 3
-  - Deploys Prefill as the prefill worker
-
-### Run the Deployment
-1. Repeat step 1 to 4 to deploy Frontend, Processor, Router and 2 Workers as decode worker
-2. Go to Node 3 and start the prefill worker.
-```
-export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
-export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
-cd dynamo/examples/hello_world/disagg_skeleton
-dynamo serve components.prefill_worker:PrefillWorker
-```
-3. Query the Frontend. This time decode workers push requests to the prefill queue, and prefill worker pulles task from the queue to simulate the prefill task. The actual prefill is skipped in this demo.
-```
-curl -X 'POST' \
-  'http://localhost:8000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "prompt": "This is prefill disagg serving example",
-  "request_id":"12345"
-}'
-```
--- a/docs/examples/llm_deployment.md
+++ b/docs/examples/llm_deployment.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# LLM Deployment Examples
-
-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
-
-## Use the Latest Release
-
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Components
-
- workers: Prefill and decode worker handles actual LLM inference
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- frontend: OpenAI compatible http server handles incoming requests
-
-## Deployment Architectures
-
-### Aggregated
-Single-instance deployment where both prefill and decode are done by the same worker.
-
-### Disaggregated
-Distributed deployment where prefill and decode are done by separate workers that can scale independently.
-
-```mermaid
-sequenceDiagram
-    participant D as VllmWorker
-    participant Q as PrefillQueue
-    participant P as PrefillWorker
-
-    Note over D: Request is routed to decode
-    D->>D: Decide if prefill should be done locally or remotely
-
-        D->>D: Allocate KV blocks
-        D->>Q: Put RemotePrefillRequest on the queue
-
-        P->>Q: Pull request from the queue
-        P-->>D: Read cached KVs from Decode
-
-        D->>D: Decode other requests
-        P->>P: Run prefill
-        P-->>D: Write prefilled KVs into allocated blocks
-        P->>D: Send completion notification
-        Note over D: Notification received when prefill is done
-        D->>D: Schedule decoding
-```
-
-## Getting Started
-
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
-
-### Prerequisites
-
-Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-
-### Build the container image for your platform
-
-```bash
-# On an x86 machine
-./container/build.sh --framework VLLM
-
-# On an ARM machine (ex: GB200)
-./container/build.sh --framework VLLM --platform linux/arm64
-```
-
-```{note}
-Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
-
-You can tune the number of parallel build jobs for building VLLM from source
-on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
-
-For example, on an ARM machine with low system resources:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
-
-For example, on a GB200 which has very high CPU cores and memory resource:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
-
-When vLLM has pre-built ARM wheels published, this process can be improved.
-
-You can tune the number of parallel build jobs for building VLLM from source
-on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
-
-For example, on an ARM machine with low system resources:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
-
-For example, on a GB200 which has very high CPU cores and memory resource:
-`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
-
-When vLLM has pre-built ARM wheels published, this process can be improved.
-```
-### Run the container you have built
-
-```
-./container/run.sh -it --framework VLLM
-```
-
-## Run Deployment
-
-This figure shows an overview of the major components to deploy:
-
-```
-                                                 +----------------+
-                                          +------| prefill worker |-------+
-                                   notify |      |                |       |
-                                 finished |      +----------------+       | pull
-                                          v                               v
-+------+      +-----------+      +------------------+    push     +---------------+
-| HTTP |----->| processor |----->| decode/monolith  |------------>| prefill queue |
-|      |<-----|           |<-----|      worker      |             |               |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+
-
-```
-
-```{note}
-The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command.
-For more details, see [Planner Architecture Overview](../architecture/planner_intro.rst).
-```
--- a/docs/examples/multinode.md
+++ b/docs/examples/multinode.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Multinode Examples
-
-## Use the Latest Release
-
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Single node sized models
-You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node must be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
-
-##### Disaggregated Deployment with KV Routing
- Node 1: Frontend, Processor, Router, Decode Worker
- Node 2: Prefill Worker
- Node 3: Prefill Worker
-
-Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
-
-**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes the NATS/ETCD endpoints must be accessible by all other nodes.
-```bash
-# node 1
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-
-**Step 2**: Create the inference graph for this node. Here we use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we spin up the other decode worker and prefill worker separately on different nodes later).
-
-```python
-# graphs/agg_router.py
-Frontend.link(Processor).link(Router).link(VllmWorker)
-```
-
-**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
-
-**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
-```bash
-# node 1
-cd $DYNAMO_HOME/examples/llm
-dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
-```
-
-**Step 5**: Start the first prefill worker on node 2.
-Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
-
-```bash
-# node 2
-export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
-export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
-
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
-```
-
-**Step 6**: Start the second prefill worker on node 3.
-```bash
-# node 3
-export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
-export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
-
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
-```
-
-**Step 7**: [Optional] Start more decode workers on other nodes
-This example can be extended to more nodes as well. For example, if you want to spin up another decode worker, you can use
-```bash
-# node X
-export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
-export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
-
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.worker:VllmWorker -f ./configs/multinode-405b.yaml --service-name VllmWorker
-```
-
-Note the use of `--service-name`. This only spins up the worker that you are requesting and ignore any `depends` statements.
-
-###### Client
-
-In another terminal:
-```bash
-# this test request has around 200 tokens isl
-
-curl <node1-ip>:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Accept: text/event-stream" \
-  -d '{
-    "model": "nvidia/Llama-3.1-405B-Instruct-FP8",
-    "messages": [
-      {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-      }
-    ],
-    "stream": true,
-    "max_tokens": 300
-  }'
-```
-
-#### Multi-node sized models
-
-Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
-
-##### Aggregated Deployment
-
-The steps for aggregated deployment of multi-node sized models is similar to
-single-node sized models, except that you need to first configure the nodes
-to be interconnected according to the framework's multi-node deployment guide.
-In the below example, vLLM is be used as the framework to serve `DeepSeek-R1` model
-using tensor parallel 16 on two H100x8 nodes.
-
-**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
-collectively:
-```bash
-# head node
-ray start --head --port=6379
-
-# example output and keep note of the IP address of the head node
-# Local node IP: <head-node-address>
-
-# set vLLM env arg
-export VLLM_HOST_IP=<head-node-address>
-
-# other node
-ray start  --address=<head-node-address>:6379
-export VLLM_HOST_IP=<current-node-address>
-
-# verify the accessibility by checking aggregated GPU count shown in ray status
-ray status
-
-# Expected/Sample output for 2 nodes:
-# ```bash
-# ======== Autoscaler status: 2025-04-16 15:35:42.751688 ========
-# Node status
-# ---------------------------------------------------------------
-# Active:
-#  1 node_<hash_1>
-#  1 node_<hash_2>
-# Pending:
-#  (no pending nodes)
-# Recent failures:
-#  (no failures)
-# Resources
-# ---------------------------------------------------------------
-# Usage:
-# XXX CPU
-# XXX GPU
-# XXX memory
-# XXX object_store_memory
-# Demands:
-#  (no resource demands)
-```
-
-**Step 2**: On the head node, follow [LLM deployment examples](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/README.md) to
-setup dynamo deployment for aggregated serving, using the configuration file,
-`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
-```bash
-cd $DYNAMO_HOME/examples/llm
-dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
-```
-
-###### Client
-
-In another terminal, you can send the same curl request as described above but
-with `"model": "deepseek-ai/DeepSeek-R1"`
-```bash
-# this test request has around 200 tokens isl
-
-curl <node1-ip>:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Accept: text/event-stream" \
-  -d '{
-    "model": "deepseek-ai/DeepSeek-R1",
-    "messages": [
-      {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-      }
-    ],
-    "stream": true,
-    "max_tokens": 300
-  }'
-```
-
-##### Disaggregated Deployment
-
-In this example, we deploy two replicas of the model (one prefill worker
-and one decode worker). We use 4 H100x8 nodes and group every two of them
-into one Ray cluster in the same way as described in aggregated deployment.
-However, for etcd and nats server, we only run them in
-one node and let's consider that node to be the head node of the whole deployment.
-
-Note that if you are starting etcd server directly instead of using `docker compose`,
-you should add additional arguments to be discoverable in other node.
-```bash
-etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
-```
-
-**Step 1**: On every two nodes, set up Ray cluster as described in
-[aggregated deployment](#aggregated-deployment). After that, you should have
-two independent Ray cluster, each has access to 16 GPUs.
-
-**Step 2** start the deployment by running different flavors of `dynamo serve`
-on one of the node for each Ray cluster, using the configuration file,
-`configs/mutinode_disagg_r1.yaml`.
-
-For decode, the below command is used; the node is the entry point of
-the whole deployment. In other words, the ip of the node should be used to send
-requests to.
-```bash
-# if not head node
-export NATS_SERVER='nats://<nats-server-ip>:4222'
-export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
-
-cd $DYNAMO_HOME/examples/llm
-dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
-```
-
-For prefill:
-```bash
-# if not head node
-export NATS_SERVER='nats://<nats-server-ip>:4222'
-export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
-
-cd $DYNAMO_HOME/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
-```
-
-###### Client
-
-In another terminal, you can send the same curl request as described in
-[aggregated deployment](#aggregated-deployment), addressing to the ip of
-the decode node.
--- a/docs/examples/trtllm.md
+++ b/docs/examples/trtllm.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# LLM Deployment Examples using TensorRT-LLM
-
-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
-
-## Use the Latest Release
-
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Deployment Architectures
-
-See [Deployment Architectures](llm_deployment.md#deployment-architectures) to learn about the general idea of the architecture.
-Note that this TensorRT-LLM version does not support all the options yet.
-
-```{note}
-TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
-```
-
-## Getting Started
-
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
-
-### Prerequisites
-
-Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-
-### Build docker
-
-#### Step 1: Build TensorRT-LLM base container image
-
-Because of the known issue of C++11 ABI compatibility within the NGC pytorch container, we rebuild TensorRT-LLM from source.
-See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html) for more information.
-
-Use the helper script to build a TensorRT-LLM container base image. The script uses a specific commit id from TensorRT-LLM main branch.
-
-```bash
-# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
-apt-get update && apt-get -y install git git-lfs
-
-# The script uses python packages like docker-squash to squash image
-# layers within trtllm base image
-DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get -y install python3 python3-pip python3-venv
-
-./container/build_trtllm_base_image.sh
-```
-
-For more information see [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-build-tensorrt-llm-in-one-step) for more details on building from source.
-If you already have a TensorRT-LLM container image, you can skip this step.
-
-#### Step 2: Build the Dynamo container
-
-```
-# On an x86 machine:
-./container/build.sh --framework tensorrtllm
-
-# On an ARM machine:
-./container/build.sh --framework tensorrtllm --platform linux/arm64
-```
-
-This build script internally points to the base container image built with step 1. If you skipped previous step because you already have the container image available, you can run the build script with that image as a base.
-
-
-```bash
-# Build dynamo image with other TRTLLM base image.
-./container/build.sh --framework TENSORRTLLM --base-image <trtllm-base-image> --base-image-tag <trtllm-base-image-tag>
-```
-
-### Run container
-
-```
-./container/run.sh --framework tensorrtllm -it
-```
-## Run Deployment
-
-This figure shows an overview of the major components to deploy:
-
-```
-
-+------+      +-----------+      +------------------+             +---------------+
-| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
-|      |<-----|           |<-----|                  |<------------|     Worker    |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+
-
-```
-
-```{note}
-The above architecture illustrates all the components. The final components that get spawned depend upon the chosen graph.
-```
-
-### Example architectures
-
-#### Aggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
-```
-
-#### Aggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg_router:Frontend -f ./configs/agg_router.yaml
-```
-
-#### Disaggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
-
-We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
-cache between the context and generation workers.
-
-#### Disaggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
-```
-
-We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
-cache between the context and generation workers.
-
-### Client
-
-See [client](llm_deployment.md#client) section to learn how to send request to the deployment.
-
-### Close deployment
-
-See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
-
-Remaining tasks:
-
- [x] Add support for the disaggregated serving.
- [ ] Add integration test coverage.
- [ ] Add instructions for benchmarking.
- [ ] Add multi-node support.
- [ ] Merge the code base with llm example to reduce the code duplication.
- [ ] Use processor from dynamo-llm framework.
- [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
--- a/docs/get_started.md
+++ b/docs/get_started.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Getting Started
-
-
-## Development Environment
-
-This section describes how to set up your development environment.
-
-### Recommended Setup: Using Dev Container
-
-We recommend using our pre-configured development container:
-
-1. Install prerequisites:
-
-   - [Docker](https://www.docker.com/products/docker-desktop)
-   - [Visual Studio Code](https://code.visualstudio.com/)
-   - [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
-
-2. Get the code:
-
-   ```bash
-   git clone https://github.com/ai-dynamo/dynamo.git
-   cd dynamo
-   ```
-
-3. Open in Visual Studio Code:
-
-   1. Launch Visual Studio Code
-   2. Click the button in the bottom-left corner
-   3. Select **Reopen in Container**
-
-Visual Studio Code builds and starts a container with all necessary dependencies for Dynamo development.
-
-### Alternative Setup: Manual Installation
-
-If you don't want to use the dev container, you can set the environment up manually:
-
-1. Ensure you have:
-
-   - Ubuntu 24.04 (recommended)
-   - x86_64 CPU
-   - Python 3.x
-   - Git
-
-   See [Support Matrix](support_matrix.md) for more information.
-
-2. **If you plan to use vLLM or SGLang**, you must also install:
-   - etcd
-   - NATS.io
-
-   Before starting dynamo, run both etcd and NATS.io in separate processes.
-
-3. Install required system packages:
-   ```bash
-   apt-get update
-   DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
-   ```
-
-4. Set up the Python environment:
-   ```bash
-   python3 -m venv venv
-   source venv/bin/activate
-   ```
-
-5. Install Dynamo:
-   ```bash
-   pip install "ai-dynamo[all]"
-   ```
-
-> [!Important]
-> To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
-
-
-## Building the Dynamo Base Image
-
-Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry.
-You can use any private container registry of your choice, including:
-
- [Docker Hub](https://hub.docker.com/)
- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/)
-
-
-To build it:
-
-```bash
-./container/build.sh
-docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
-docker login <your-registry>
-docker push <your-registry>/dynamo-base:latest-vllm
-```
-
-This documentation describes these frameworks:
-
- `--framework vllm` build:
-   See [LLM Deployment Examples](examples/llm_deployment.md).
-
- `--framework tensorrtllm` build:
-   See [TRTLLM Deployment Examples](examples/trtllm.md).
-
-After building, use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
-
-```bash
-export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
-```
-
-
-## Running and Interacting with an LLM Locally
-
-Dynamo supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
-Use example commands below tp launch a model.
-
-### Example Command
-
-```bash
-python -m dynamo.frontend [--http-port 8080]
-python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-```
-
-```bash
-? User › Hello, how are you?
-✔ User · Hello, how are you?
-Okay, so I'm trying to figure out how to respond to the user's greeting.
-They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking."
-Hmm, I need to come up with a suitable reply. ...
-```
-
-
-## LLM Serving
-
-Dynamo provides a simple way to spin up a local set of inference components including:
-
- **OpenAI-compatible Frontend**:
-   High-performance OpenAI compatible http api server written in Rust.
-
- **Basic and Kv Aware Router**:
-   Route and load balance traffic to a set of workers.
-
- **Workers**:
-   Set of pre-configured LLM serving engines.
-
-To run a minimal configuration, use a pre-configured example.
-
-### Start Dynamo Distributed Runtime Services
-
-To start the Dynamo Distributed Runtime services the first time:
-
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-
-### Start Dynamo LLM Serving Components
-
-[Explore the VLLM Example](../components/backends/vllm/README.md)
-
-
-## Local Development
-
-If you use vscode or cursor, use the `.devcontainer` folder built on [Microsoft's Extension](https://code.visualstudio.com/docs/devcontainers/containers).
-For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
-
-Otherwise, to develop locally, we recommend working inside of the container:
-
-```bash
-./container/build.sh
-./container/run.sh -it --mount-workspace
-
-cargo build --release
-mkdir -p /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
-cp /workspace/target/release/dynamo-run /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
-
-uv pip install -e .
-export PYTHONPATH=$PYTHONPATH:/workspace/deploy/dynamo/sdk/src:/workspace/components/planner/src
-```
-
-### Conda Environment
-
-Alternatively, use a Conda environment:
-
-```bash
-conda activate <ENV_NAME>
-
-pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
-
-cargo build --release
-
-# To install ai-dynamo-runtime from source
-cd lib/bindings/python
-pip install .
-
-cd ../../../
-pip install .[all]
-
-# To test
-docker compose -f deploy/docker-compose.yml up -d
-python -m dynamo.frontend [--http-port 8080]
-python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-```
--- a/docs/guides/backend.md
+++ b/docs/guides/backend.md
--- a/docs/guides/dynamo_deploy/dynamo_operator.md
+++ b/docs/guides/dynamo_deploy/dynamo_operator.md
@@ -137,7 +137,6 @@ This section describes how to use FluxCD for GitOps-based deployment of Dynamo i
 - A Kubernetes cluster with [Dynamo Cloud](dynamo_cloud.md) installed
 - [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
 - A Git repository to store your deployment configurations
- [Dynamo CLI](../../get_started.md#alternative-setup-manual-installation) installed locally

 ### Workflow Overview


--- a/docs/guides/planner_benchmark/disagg_1p1d.yml
+++ b/docs/guides/planner_benchmark/disagg_1p1d.yml
--- a/docs/guides/planner_benchmark/disagg_2p2d.yaml
+++ b/docs/guides/planner_benchmark/disagg_2p2d.yaml
--- a/docs/guides/planner_benchmark/README.md
+++ b/docs/guides/planner_benchmark/README.md
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->

 # Planner Benchmark Example
@@ -50,8 +38,8 @@ For other models and GPU SKUs, adjust the request rate ranges accordingly to mat
 To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:

 ```bash
-cd examples/llm
-dynamo serve graphs.disagg_router:Frontend -f disagg_1p1d.yml
+# Start Kubernetes with one frontend node, one prefill and one decode worker
+# TODO

 # in terminal 2
 genai-perf profile \
@@ -84,7 +72,8 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n

 ```bash
 # in terminal 1
-dynamo serve graphs.disagg_router:Frontend -f disagg_2p2d.yml
+# Start Kubernetes with one frontend node, two prefill and two decode workers
+# TODO

 # in terminal 2
 genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -75,7 +75,6 @@ The examples below assume you build the latest image yourself from source. If us

   Welcome to Dynamo <self>
   Support Matrix <support_matrix.md>
-   Getting Started <get_started.md>

 .. toctree::
   :hidden:

--- a/docs/support_matrix.md
+++ b/docs/support_matrix.md
@@ -2,18 +2,6 @@
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->

 # Dynamo Support Matrix
@@ -72,7 +60,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 | :----------------- | :------------ | :----------------------------------- | :----------- |
 | ai-dynamo          | 0.3.2         | >=2.28                               |              |
 | ai-dynamo-runtime  | 0.3.2         | >=2.28 (Python 3.12 has known issues)|              |
-| ai-dynamo-vllm     | 0.8.4.post4¹  | >=2.28 (recommended)                 |              |
 | NIXL               | 0.4.0         | >=2.27                               | >=11.8       |

 ### Build Dependency
@@ -80,13 +67,10 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 | **Build Dependency** | **Version**                                                                      |
 | :------------------- | :------------------------------------------------------------------------------- |
 | **Base Container**   | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
-| **ai-dynamo-vllm**   | 0.8.4.post4¹                                                                     |
 | **TensorRT-LLM**     | 1.0.0rc²                                                                         |
 | **NIXL**             | 0.4.0                                                                            |

 > [!Important]
-> ¹ ai-dynamo-vllm `v0.8.4.post4` is a customized patch of `v0.8.4` from vLLM.
->
 > ² Specific versions of TensorRT-LLM supported by Dynamo are subject to change.

 ## Build Support

--- a/lib/llm/src/discovery/watcher.rs
+++ b/lib/llm/src/discovery/watcher.rs
@@ -178,7 +178,6 @@ impl ModelWatcher {
                Some(card)
            }
            Err(err) => {
-                // `dynamo serve` isn't using MDC yet so can't be an error
                tracing::info!(%err, "load_mdc did not complete");
                None
            }