docs: Add sphinx-theme based userguides (#528)

Signed-off-by: Suman Tatiraju <167138127+statiraju@users.noreply.github.com> Signed-off-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Kristen Kelleher <kkelleher@nvidia.com> Co-authored-by: Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com> Co-authored-by: Hannah Zhang <hannahz@nvidia.com>

docs: Add sphinx-theme based userguides (#528)
Signed-off-by: Suman Tatiraju <167138127+statiraju@users.noreply.github.com> Signed-off-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Kristen Kelleher <kkelleher@nvidia.com> Co-authored-by: Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com> Co-authored-by: Hannah Zhang <hannahz@nvidia.com>
8d636ebd · Suman Tatiraju · GitHub · 6d46288c · 8d636ebd · 8d636ebd
Unverified Commit 8d636ebd authored May 21, 2025 by Suman Tatiraju Committed by GitHub May 21, 2025
20 changed files
--- a/.github/workflows/generate-docs.yml
+++ b/.github/workflows/generate-docs.yml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: Generate Documentation
+
+on:
+  push:
+    branches:
+      - main
+      - release/*
+  pull_request:
+    paths:
+      - 'docs/**'
+      - 'container/Dockerfile.docs'
+      - '.github/workflows/generate-docs.yml'
+
+jobs:
+  build-docs:
+    name: Build Documentation
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Generate documentation
+        run: |
+          docker build -t docs-builder -f container/Dockerfile.docs .
+
+      - name: Copy documentation out of container
+        run: |
+          docker create --name docs-container docs-builder
+          docker cp docs-container:/workspace/dynamo/docs/build/html dynamo-docs/
+
+      - name: Remove documentation container
+        if: always()
+        run: |
+          docker rm docs-container || true
+
+      - name: Upload documentation artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: dynamo-docs-${{ github.run_id }}
+          path: dynamo-docs
+          retention-days: 15
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ limitations under the License.
 [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
 [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/nvidia-dynamo)

-| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Support Matrix](support_matrix.md)** | **[Guides](docs/guides)** | **[Architecture and Features](docs/architecture.md)** | **[APIs](lib/bindings/python/README.md)** | **[SDK](deploy/sdk/README.md)** |
+| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Support Matrix](docs/support_matrix.md)** | **[Guides](docs/guides)** | **[Architecture and Features](docs/architecture/architecture.md)** | **[APIs](lib/bindings/python/README.md)** | **[SDK](deploy/dynamo/sdk/README.md)** |

 ### 📢 **Please join us for our** [ **first Dynamo in-person meetup with vLLM and SGLang leads**](https://events.nvidia.com/nvidiadynamousermeetups) **on 6/5 (Thu) in SF!** ###

@@ -38,7 +38,7 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
 ### Installation

 The following examples require a few system level packages.
-Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [support_matrix.md](support_matrix.md)
+Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](support_matrix.md)

 ```
 apt-get update

--- a/container/Dockerfile.docs
+++ b/container/Dockerfile.docs
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM ubuntu:24.04
+
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        build-essential \
+        curl \
+        doxygen \
+        pandoc \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace/dynamo
+
+ENV VIRTUAL_ENV=/workspace/dynamo/.venv
+RUN uv venv $VIRTUAL_ENV --python 3.12 && \
+    uv pip install ablog \
+      attrs  \
+      breathe \
+      docutils \
+      exhale \
+      httplib2 \
+      ipython \
+      myst-nb \
+      nbclient \
+      nbsphinx \
+      nvidia-sphinx-theme \
+      sphinx \
+      sphinx-book-theme \
+      sphinx-copybutton \
+      sphinx-design \
+      sphinx-prompt \
+      sphinx-sitemap \
+      sphinx-tabs \
+      sphinxcontrib-bibtex \
+      sphinxcontrib-mermaid
+
+# Set visitor script to be included on every HTML page
+ENV VISITS_COUNTING_SCRIPT="//assets.adobedtm.com/b92787824f2e0e9b68dc2e993f9bd995339fe417/satelliteLib-7ba51e58dc61bcb0e9311aadd02a0108ab24cc6c.js"
+
+COPY . /workspace/dynamo
+
+RUN . .venv/bin/activate && \
+    python3 docs/generate_docs.py
\ No newline at end of file
--- a/deploy/sdk/docs/cli/README.md
+++ b/deploy/sdk/docs/cli/README.md
-# Dynamo CLI Documentation
-The Dynamo CLI is a powerful tool for serving, containerizing, and deploying Dynamo applications. It leverages core pieces of the BentoML deployment stack and provides a range of commands to manage your Dynamo services.
-
-Overview
-At a high level, the Dynamo CLI allows you to:
- `run` - quickly chat with a model
- `serve` - run a set of services locally (via `depends()` or `.link()`)
- `build` - create an archive of your services (called a `bento`)
-
-# Commands
-
-## `run`
-
-The `run` command allows you to quickly chat with a model. Under the hood - it is running the `dynamo-run` Rust binary. You can find the arguments that it takes here: [dynamo-run docs](../../../../../launch/README.md)
-
-**Example**
-```bash
-dynamo run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-```
-
-## `serve`
-
-The `serve` command lets you run a defined inference graph locally. You must point toward your file and intended class using file:Class syntax
-
-**Usage**
-```bash
-dynamo serve [SERVICE]
-```
-
-**Arguments**
- `SERVICE` - The service to start. You use file:Class syntax to specify the service.
-
-**Flags**
- `--file`/`-f` - Path to optional YAML configuration file. An example of the YAML file can be found in the configuration section of the [SDK docs](../sdk/README.md)
- `--dry-run` - Print out the dependency graph and values without starting any services.
- `--service-name` - Only serve the specified service name. The rest of the discoverable components in the graph are not started.
- `--working-dir` - Specify the directory to find the Service instance
- Any additional flags that follow Class.key=value will be passed to the service constructor for the target service and parsed. Please see the configuration section of the [SDK docs](../sdk/README.md) for more details.
-
-**Example**
-```bash
-cd examples
-# Spin up Frontend, Middle, and Backend components
-dynamo serve hello_world:Frontend
-
-# Spin up only the Middle component in the graph that is discoverable from the Frontend service
-dynamo serve  --service-name Middle hello_world:Frontend
-```
-
-## `build`
-
-The `build` commmand allows you to package up your inference graph and its dependancies and create an archive of it. This is commonly paired with the `--containerize` flag to create a single docker container that runs your inference graph. As with `serve`, you point toward the first service in your dependency graph.
-
-**Usage**
-```bash
-dynamo build [SERVICE]
-```
-
-**Arguments**
- `SERVICE` - The service to build. You use file:Class syntax to specify the service.
-
-**Flags**
- `--working-dir` - Specify the directory to find the Service instance
- `--containerize` - Whether to containerize the Bento after building
-
-**Example**
-```bash
-cd examples/hello_world
-dynamo build hello_world:Frontend
-```
--- a/deploy/sdk/docs/cli/README.md
+++ b/deploy/sdk/docs/cli/README.md
+../../../../docs/guides/cli_overview.md
\ No newline at end of file
--- a/deploy/sdk/docs/sdk/README.md
+++ b/deploy/sdk/docs/sdk/README.md
-# Documentation for the Dynamo SDK
-
-# Table of Contents
-
- [Introduction](#introduction)
- [Installation](#installation)
- [Core Concepts](#core-concepts)
- [Writing a Service](#writing-a-service)
- [Configuring a Service](#configuring-a-service)
- [Composing Services into an Graph](#composing-services-into-an-graph)
-
-# Introduction
-
-Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. Docs and examples for those can be found [here](../../../../../README.md).
-
-Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns and leverages many of its core primitives. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example [here](../../README.md).
-
-# Installation
-
-The SDK can be installed using pip:
-
-```bash
-pip install ai-dynamo
-```
-
-# Core Concepts
-As you read about each concept, it is helpful to have the [basic example](../../README.md) up as well so you can refer back to it.
-
-## Defining a Service
-
-A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
-
-```python
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-```
-
-Key configuration options:
-1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../../../../../docs/architecture.md)
-2. `resources`: Dictionary defining resource requirements. Used primarily when deploying to K8s, but gpu is also used for local execution.
-3. `workers`: Number of parallel instances of the service to spin up.
-
-## Writing a Service
-
-Let's walk through an example to understand how you write a dynamo service.
-
-```python
-import ServiceB
-
-@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
-class ServiceA:
-    # Define service dependencies
-    service_b = depends(ServiceB)
-
-    def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
-        self.model_name = model_name
-        self.engine = None
-
-    @async_on_start
-    async def async_init(self):
-        # Initialize resources that require async operations
-        self.engine = await initialize_model_engine(self.model_name)
-        print(f"ServiceA initialized with model: {self.model_name}")
-
-    @async_on_shutdown
-    async def async_shutdown(self):
-        # Clean up resources
-        if self.engine:
-            await self.engine.shutdown()
-            print("ServiceA engine shut down")
-
-    @endpoint()
-    async def generate(self, request: ChatCompletionRequest):
-        # Call dependent service
-        processed_request = await self.service_b.preprocess(request)
-
-        # Use the engine to generate a response
-        response = await self.engine.generate(processed_request)
-        return response
-```
-
-### Class-Based Architecture
-Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
-1. Class attributes for dependencies using `depends()`
-2. An `__init__` method for standard initialization
-3. Optional lifecycle hooks like `@async_on_start` and `@async_on_shutdown`
-4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
-   via `@endpoint("my_endpoint_name")`, but otherwise will default to the name of the
-   function being decorated if omitted.
-
-This approach provides a clean separation of concerns and makes the service structure easy to understand.
-
-### Service Dependencies with `depends()`
-The `depends()` function is a powerful BentoML feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
-1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
-2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
-3. You are able to access `ServiceB` endpoints as if it were a local function!
-
-```python
-# What happens internally when you use depends(ServiceB)
-service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
-
-# But with Dynamo SDK, you simply write:
-service_b = depends(ServiceB)
-
-# And then call methods directly:
-result = await service_b.preprocess(data)
-```
-
-**NOTE** - through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. You can do this via:
-
-```python
-import VllmWorker
-
-runtime = dynamo_context["runtime"]
-comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
-print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
-self.worker_client = (
-    await runtime.namespace(comp_ns)
-    .component(comp_name)
-    .endpoint("generate")
-    .client()
-)
-```
-
-This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
-
-You can find more docs on depends [here](https://docs.bentoml.com/en/latest/build-with-bentoml/distributed-services.html#interservice-communication)
-
-### Lifecycle Hooks
-Dynamo supports key lifecycle hooks to manage service initialization and cleanup. We currently only support a subset of BentoML's lifecycle hooks but are working on adding support for the rest.
-
-#### `@async_on_start`
-
-The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
-
-```python
-@async_on_start
-async def async_init(self):
-    # Perfect for operations that need to be awaited
-    self.db = await connect_to_db()
-    self.tokenizer = await load_tokenizer()
-    self.engine = await initialize_engine(self.model)
-```
-This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
-
-#### `@async_on_shutdown`
-The `@async_on_shutdown` hook is called when the service is shutdown handles cleanup.
-
-```python
-@async_on_shutdown
-async def async_shutdown(self):
-    if self._engine_context is not None:
-        await self._engine_context.__aexit__(None, None, None)
-    print("VllmWorkerRouterLess shutting down")
-```
-
-This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
-
-## Configuring a Service
-
-Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
-
-1. Directly in the `@service` decorator
-2. Through YAML configuration files
-3. Via command-line arguments
-4. Using environment variables
-
-These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
-
-### Configuration via Service Decorator
-
-The most basic method is to specify parameters directly in the service decorator:
-
-```python
-@service(
-    dynamo={"namespace": "prod"},
-    resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
-    workers=2,
-)
-class MyService:
-    def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
-        self.model_name = model_name
-        self.temperature = temperature
-```
-
-This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
-
-### Configuration via YAML
-
-For more flexible configuration, especially across environments, you can use YAML files:
-
-```yaml
-# config.yaml
-MyService:
-  # Override service decorator settings
-  ServiceArgs:
-    workers: 4
-    resources:
-      gpu: 4
-
-  # Service instance parameters
-  model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
-  temperature: 0.8
-```
-
-The YAML file has a hierarchical structure:
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
-
-### Loading YAML Configuration
-
-Use the CLI to load configuration from a YAML file:
-
-```bash
-dynamo serve service:MyService -f config.yaml
-```
-
-The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
-
-### Configuration Precedence
-
-When multiple configuration sources are used, they follow this precedence order (highest to lowest):
-
-1. Command-line arguments
-2. YAML configuration
-3. Service decorator defaults
-4. Constructor defaults
-
-### Accessing Configuration in Services
-
-Inside a service, you can access configuration using the `ServiceConfig` class:
-
-```python
-from dynamo.sdk.lib.config import ServiceConfig
-
-class MyService:
-    def __init__(self):
-        config = ServiceConfig.get_instance()
-
-        # Get with default value
-        self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
-        self.temperature = config.get("MyService", {}).get("temperature", 0.7)
-
-        # Require a config value (raises error if missing)
-        self.api_key = config.require("MyService", "api_key")
-
-        # Get all config for this service
-        all_my_config = config.get("MyService", {})
-```
-
-### Parsing Configuration as CLI Arguments
-
-For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
-
-```python
-from dynamo.sdk.lib.config import ServiceConfig
-
-def setup_my_lib():
-    config = ServiceConfig.get_instance()
-
-    # Get all MyService config with prefix "lib_" as CLI args
-    cli_args = config.as_args("MyService", prefix="lib_")
-    # Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
-
-    # Pass to an external library's argument parser
-    lib_parser = MyLibArgumentParser()
-    lib_args = lib_parser.parse_args(cli_args)
-    return lib_args
-```
-
-This pattern is used in the example vLLM integration:
-
-```python
-def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
-    config = ServiceConfig.get_instance()
-    vllm_args = config.as_args(service_name, prefix=prefix)
-    parser = FlexibleArgumentParser()
-
-    # Add custom arguments
-    parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
-    parser.add_argument("--remote-prefill", action="store_true")
-
-    # Add VLLM's arguments
-    parser = AsyncEngineArgs.add_cli_args(parser)
-
-    # Parse both custom and VLLM arguments
-    args = parser.parse_args(vllm_args)
-
-    # Convert to engine arguments
-    engine_args = AsyncEngineArgs.from_cli_args(args)
-
-    # Add custom args to the engine args
-    engine_args.router = args.router
-    engine_args.remote_prefill = args.remote_prefill
-
-    return engine_args
-```
-
-### Overriding Service Decorator with ServiceArgs
-
-The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
-
-```yaml
-MyService:
-  ServiceArgs:
-    dynamo:
-      namespace: "staging"  # Override namespace
-    resources:
-      gpu: 4  # Use more GPUs
-    workers: 8  # Scale up workers
-```
-
-This is particularly useful for:
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
-
-Under the hood, the `DynamoService` class reads these arguments during initialization:
-
-```python
-def _get_service_args(self, service_name: str) -> Optional[dict]:
-    """Get ServiceArgs from environment config if specified"""
-    config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
-    if config_str:
-        config = json.loads(config_str)
-        service_config = config.get(service_name, {})
-        return service_config.get("ServiceArgs")
-    return None
-```
-### Complete Configuration Example
-
-Here's a comprehensive example showing how all these pieces fit together:
-
-1. First, define your service with basic defaults:
-
-```python
-@service(
-    dynamo={"namespace": "default"},
-    resources={"gpu": 1},
-    workers=1,
-)
-class LLMService:
-    def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
-        self.model_name = model_name
-        self.temperature = temperature
-        self.max_tokens = max_tokens
-
-        # Get additional configuration
-        config = ServiceConfig.get_instance()
-        service_config = config.get("LLMService", {})
-
-        # Extract service-specific parameters
-        self.cache_size = service_config.get("cache_size", 1000)
-        self.use_kv_cache = service_config.get("use_kv_cache", True)
-```
-
-2. Create a YAML configuration for production:
-
-```yaml
-# prod_config.yaml
-LLMService:
-  ServiceArgs:
-    dynamo:
-      namespace: "prod"
-    resources:
-      gpu: 4
-      memory: "64Gi"
-    workers: 8
-
-  # Constructor parameters
-  model_name: "llama-3-70b-instruct"
-  temperature: 0.8
-  max_tokens: 2048
-
-  # Service-specific parameters
-  cache_size: 10000
-  use_kv_cache: true
-```
-
-3. Deploy with mixed configuration:
-
-```bash
-dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
-```
-
-The service will receive the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
-
-### Service Configuration Best Practices
-
-1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
-2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
-3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
-4. **Use CLI for Quick Testing**: Override specific values for experimentation
-5. **Document Configuration Keys**: Make sure to document all available configuration options
-
-Following these practices will help you create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
-
-### Composing Services into an Graph
-There are two main ways to compose services in Dynamo:
-1. Use `depends()` (Recommended)
-The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
-
-2. Use `.link()` (Experimental)
-Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
-
-### Understanding the `.link()` syntax
-Lets take the example of a `Processor` component. This component can currently do 2 things:
-1. Process a request and send it to a `Router` to decide what worker to send it to.
-2. Process a request and send it to a `Worker` directly.
-
-A snippet of the Processor is shown below:
-
-```python
-class Processor(ProcessMixIn):
-    """
-    vLLM pre and post processing
-    """
-
-    worker = depends(VllmWorker)
-    router = depends(Router)
-
-    # logic for processing a request based on router or worker
-```
-
-You can think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor will spin up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor which will remove the `worker` dependency from the Processor.
-
-```python
-Processor.link(Router)
-```
-
-This will remove the `worker` dependency from the Processor and only spin up the Router.
\ No newline at end of file
--- a/deploy/sdk/docs/sdk/README.md
+++ b/deploy/sdk/docs/sdk/README.md
+../../../../docs/API/sdk.md
\ No newline at end of file
--- a/docs/API/python_bindings.md
+++ b/docs/API/python_bindings.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+https://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Dynamo Python Bindings
+
+Python bindings for the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
+
+## 🚀 Quick Start
+
+1. Install `uv`: https://docs.astral.sh/uv/#getting-started
+```
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+
+2. Install `protoc` protobuf compiler: https://grpc.io/docs/protoc-installation/.
+
+For example on an Ubuntu/Debian system:
+```
+apt install protobuf-compiler
+```
+
+3. Setup a virtualenv
+
+```
+uv venv
+source .venv/bin/activate
+uv pip install maturin
+```
+
+4. Build and install dynamo wheel
+```
+maturin develop --uv
+```
+
+## Run Examples
+
+### Prerequisite
+
+See [README.md](../runtime/README.md#prerequisites).
+
+### Hello World Example
+
+1. Start 3 separate shells, and activate the virtual environment in each
+```
+source .venv/bin/activate
+```
+
+2. In one shell (shell 1), run example server the instance-1
+```
+python3 ./examples/hello_world/server.py
+```
+
+3. (Optional) In another shell (shell 2), run example the server instance-2
+```
+python3 ./examples/hello_world/server.py
+```
+
+4. In the last shell (shell 3), run the example client:
+```
+python3 ./examples/hello_world/client.py
+```
+
+If you run the example client in rapid succession, and you started more than
+one server instance above, you should see the requests from the client being
+distributed across the server instances in each server's output. If only one
+server instance is started, you should see the requests go to that server
+each time.
+
+## Performance
+
+The performance impacts of synchronizing the Python and Rust async runtimes
+is a critical consideration when optimizing the performance of a highly
+concurrent and parallel distributed system.
+
+The Python GIL is a global critical section and is ultimately the death of
+parallelism. To compound that, when Rust async futures become ready,
+accessing the GIL on those async event loop needs to be considered carefully.
+Under high load, accessing the GIL or performing CPU intensive tasks on
+on the event loop threads can starve out other async tasks for CPU resources.
+However, performing a `tokio::task::spawn_blocking` is not without overheads
+as well.
+
+If bouncing many small message back-and-forth between the Python and Rust
+event loops where Rust requires GIL access, this is pattern where moving the
+code from Python to Rust will give you significant gains.
--- a/docs/API/sdk.md
+++ b/docs/API/sdk.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Dynamo SDK
+
+# Table of Contents
+
+- [Introduction](#introduction)
+- [Installation](#installation)
+- [Core Concepts](#core-concepts)
+- [Writing a Service](#writing-a-service)
+- [Configuring a Service](#configuring-a-service)
+- [Composing Services into an Graph](#composing-services-into-an-graph)
+## Introduction
+
+Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See Python Bindings](./python_bindings.md).
+
+Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns and leverages many of its core primitives. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
+
+## Installation
+
+The SDK can be installed using pip:
+
+```bash
+pip install ai-dynamo
+```
+
+## Core Concepts
+As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
+
+### Defining a Service
+
+A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
+
+```python
+@service(
+    dynamo={
+        "namespace": "dynamo",
+    },
+    resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
+    workers=1,
+)
+```
+
+Key configuration options:
+1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
+2. `resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
+3. `workers`: Number of parallel instances of the service to spin up.
+
+### Writing a Service
+
+Let's walk through an example to understand how you write a dynamo service.
+
+```python
+import ServiceB
+
+@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
+class ServiceA:
+    # Define service dependencies
+    service_b = depends(ServiceB)
+
+    def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
+        self.model_name = model_name
+        self.engine = None
+
+    @async_on_start
+    async def async_init(self):
+        # Initialize resources that require async operations
+        self.engine = await initialize_model_engine(self.model_name)
+        print(f"ServiceA initialized with model: {self.model_name}")
+
+    @async_on_shutdown
+    async def async_shutdown(self):
+        # Clean up resources
+        if self.engine:
+            await self.engine.shutdown()
+            print("ServiceA engine shut down")
+
+    @endpoint()
+    async def generate(self, request: ChatCompletionRequest):
+        # Call dependent service
+        processed_request = await self.service_b.preprocess(request)
+
+        # Use the engine to generate a response
+        response = await self.engine.generate(processed_request)
+        return response
+```
+
+#### Class-Based Architecture
+Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
+1. Class attributes for dependencies using `depends()`
+2. An `__init__` method for standard initialization
+3. Optional lifecycle hooks like `@async_on_start` and `@async_on_shutdown`
+4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
+   via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
+   function being decorated if omitted.
+
+This approach provides a clean separation of concerns and makes the service structure easy to understand.
+
+#### Service Dependencies with `depends()`
+The `depends()` function is a powerful BentoML feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
+1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
+2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
+3. You are able to access `ServiceB` endpoints as if it were a local function!
+
+```python
+# What happens internally when you use depends(ServiceB)
+service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
+
+# But with Dynamo SDK, you simply write:
+service_b = depends(ServiceB)
+
+# And then call methods directly:
+result = await service_b.preprocess(data)
+```
+
+```{note}
+Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
+```
+
+```python
+import VllmWorker
+
+# this runtime object gives you access to the underlying python bindings
+runtime = dynamo_context["runtime"]
+comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
+print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
+self.worker_client = (
+    await runtime.namespace(comp_ns)
+    .component(comp_name)
+    .endpoint("generate")
+    .client()
+)
+```
+
+This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
+
+You can find more docs on depends [here](https://docs.bentoml.com/en/latest/build-with-bentoml/distributed-services.html#interservice-communication)
+
+#### Lifecycle Hooks
+Dynamo supports key lifecycle hooks to manage service initialization and cleanup. We currently only support a subset of BentoML's lifecycle hooks but are working on adding support for the rest.
+
+##### `@async_on_start`
+
+The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
+
+```python
+@async_on_start
+async def async_init(self):
+    # Perfect for operations that need to be awaited
+    self.db = await connect_to_db()
+    self.tokenizer = await load_tokenizer()
+    self.engine = await initialize_engine(self.model)
+```
+This is especially useful for:
+- Initializing external connections
+- Setting up runtime resources that require async operations
+
+#### `@async_on_shutdown`
+The `@async_on_shutdown` hook is called when the service is shutdown handles cleanup.
+
+```python
+@async_on_shutdown
+async def async_shutdown(self):
+    if self._engine_context is not None:
+        await self._engine_context.__aexit__(None, None, None)
+    print("VllmWorkerRouterLess shutting down")
+```
+
+This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
+
+### Configuring a Service
+
+Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
+
+1. Directly in the `@service` decorator
+2. Through YAML configuration files
+3. Via command-line arguments
+4. Using environment variables
+
+These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
+
+#### Configuration via Service Decorator
+
+The most basic method is to specify parameters directly in the service decorator:
+
+```python
+@service(
+    dynamo={"namespace": "prod"},
+    resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
+    workers=2,
+)
+class MyService:
+    def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
+        self.model_name = model_name
+        self.temperature = temperature
+```
+
+This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
+
+#### Configuration via YAML
+
+For more flexible configuration, especially across environments, you can use YAML files:
+
+```yaml
+# config.yaml
+MyService:
+  # Override service decorator settings
+  ServiceArgs:
+    workers: 4
+    resources:
+      gpu: 4
+
+  # Service instance parameters
+  model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
+  temperature: 0.8
+```
+
+The YAML file has a hierarchical structure:
+- Top level keys are service class names
+- `ServiceArgs` contains parameters for the service decorator
+- Other keys are passed as arguments to the service constructor
+- Additional keys specific to the service can be accessed via the config system
+
+#### Loading YAML Configuration
+
+Use the CLI to load configuration from a YAML file:
+
+```bash
+dynamo serve service:MyService -f config.yaml
+```
+
+The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
+
+#### Configuration Precedence
+
+When multiple configuration sources are used, they follow this precedence order (highest to lowest):
+
+1. Command-line arguments
+2. YAML configuration
+3. Service decorator defaults
+4. Constructor defaults
+
+#### Accessing Configuration in Services
+
+Inside a service, you can access configuration using the `ServiceConfig` class:
+
+```python
+from dynamo.sdk.lib.config import ServiceConfig
+
+class MyService:
+    def __init__(self):
+        config = ServiceConfig.get_instance()
+
+        # Get with default value
+        self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
+        self.temperature = config.get("MyService", {}).get("temperature", 0.7)
+
+        # Require a config value (raises error if missing)
+        self.api_key = config.require("MyService", "api_key")
+
+        # Get all config for this service
+        all_my_config = config.get("MyService", {})
+```
+
+#### Parsing Configuration as CLI Arguments
+
+For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
+
+```python
+from dynamo.sdk.lib.config import ServiceConfig
+
+def setup_my_lib():
+    config = ServiceConfig.get_instance()
+
+    # Get all MyService config with prefix "lib_" as CLI args
+    cli_args = config.as_args("MyService", prefix="lib_")
+    # Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
+
+    # Pass to an external library's argument parser
+    lib_parser = MyLibArgumentParser()
+    lib_args = lib_parser.parse_args(cli_args)
+    return lib_args
+```
+
+This pattern is used in the example vLLM integration:
+
+```python
+def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
+    config = ServiceConfig.get_instance()
+    vllm_args = config.as_args(service_name, prefix=prefix)
+    parser = FlexibleArgumentParser()
+
+    # Add custom arguments
+    parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
+    parser.add_argument("--remote-prefill", action="store_true")
+
+    # Add VLLM's arguments
+    parser = AsyncEngineArgs.add_cli_args(parser)
+
+    # Parse both custom and VLLM arguments
+    args = parser.parse_args(vllm_args)
+
+    # Convert to engine arguments
+    engine_args = AsyncEngineArgs.from_cli_args(args)
+
+    # Add custom args to the engine args
+    engine_args.router = args.router
+    engine_args.remote_prefill = args.remote_prefill
+
+    return engine_args
+```
+
+#### Overriding Service Decorator with ServiceArgs
+
+The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
+
+```yaml
+MyService:
+  ServiceArgs:
+    dynamo:
+      namespace: "staging"  # Override namespace
+    resources:
+      gpu: 4  # Use more GPUs
+    workers: 8  # Scale up workers
+```
+
+This is particularly useful for:
+- Changing resource allocations between environments
+- Modifying worker counts based on expected load
+- Switching between namespaces for different deployments
+
+Under the hood, the `DynamoService` class reads these arguments during initialization:
+
+```python
+def _get_service_args(self, service_name: str) -> Optional[dict]:
+    """Get ServiceArgs from environment config if specified"""
+    config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
+    if config_str:
+        config = json.loads(config_str)
+        service_config = config.get(service_name, {})
+        return service_config.get("ServiceArgs")
+    return None
+```
+#### Complete Configuration Example
+
+Here's a comprehensive example showing how all these pieces fit together:
+
+1. First, define your service with basic defaults:
+
+```python
+@service(
+    dynamo={"namespace": "default"},
+    resources={"gpu": 1},
+    workers=1,
+)
+class LLMService:
+    def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
+        self.model_name = model_name
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+
+        # Get additional configuration
+        config = ServiceConfig.get_instance()
+        service_config = config.get("LLMService", {})
+
+        # Extract service-specific parameters
+        self.cache_size = service_config.get("cache_size", 1000)
+        self.use_kv_cache = service_config.get("use_kv_cache", True)
+```
+
+2. Create a YAML configuration for production:
+
+```yaml
+# prod_config.yaml
+LLMService:
+  ServiceArgs:
+    dynamo:
+      namespace: "prod"
+    resources:
+      gpu: 4
+      memory: "64Gi"
+    workers: 8
+
+  # Constructor parameters
+  model_name: "llama-3-70b-instruct"
+  temperature: 0.8
+  max_tokens: 2048
+
+  # Service-specific parameters
+  cache_size: 10000
+  use_kv_cache: true
+```
+
+3. Deploy with mixed configuration:
+
+```bash
+dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
+```
+
+The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
+- `dynamo.namespace = "prod"`
+- `resources.gpu = 4`
+- `workers = 8`
+- `model_name = "llama-3-70b-instruct"`
+- `temperature = 0.9` (from CLI override)
+- `max_tokens = 2048`
+- `cache_size = 10000`
+- `use_kv_cache = true`
+
+#### Service Configuration Best Practices
+
+1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
+2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
+3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
+4. **Use CLI for Quick Testing**: Override specific values for experimentation
+5. **Document Configuration Keys**: Make sure to document all available configuration options
+
+Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
+
+### Deploying a Single Service
+You can deploy a single service for local development even if you have a dependancy graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
+
+### Composing Services into an Graph
+There are two main ways to compose services in Dynamo:
+1. Use `depends()` (Recommended)
+The depends() approach is the recommended way for production deployments:
+- Automatically deploys all dependencies
+- Creates a static inference graph at deployment time
+- Provides type hints and better IDE support
+
+2. Use `.link()` (Experimental)
+Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
+
+#### Understanding the `.link()` syntax
+Lets take the example of a `Processor` component. This component can currently do 2 things:
+1. Process a request and send it to a `Router` to decide what worker to send it to.
+2. Process a request and send it to a `Worker` directly.
+
+Consider this snippet of the Processor:
+
+```python
+class Processor(ProcessMixIn):
+    """
+    vLLM pre and post processing
+    """
+
+    worker = depends(VllmWorker)
+    router = depends(Router)
+
+    # logic for processing a request based on router or worker
+```
+
+Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
+
+```python
+Processor.link(Router)
+```
+
+This removes the `worker` dependency from the Processor and only spin up the Router.
\ No newline at end of file
--- a/docs/Makefile
+++ b/docs/Makefile
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS        ?=
+SPHINXBUILD       ?= sphinx-build
+SOURCEDIR          = .
+BUILDDIR           = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+clean:
+	@rm -fr ${BUILDDIR}
+.PHONY: help Makefile clean
+
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%:
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/_static/custom.js
+++ b/docs/_static/custom.js
+// Add RunLLM widget
+document.addEventListener("DOMContentLoaded", function () {
+    var script = document.createElement("script");
+    script.type = "module";
+    script.id = "runllm-widget-script"
+
+    script.src = "https://widget.runllm.com";
+
+    script.setAttribute("version", "stable");
+    script.setAttribute("runllm-keyboard-shortcut", "Mod+j"); // cmd-j or ctrl-j to open the widget.
+    script.setAttribute("runllm-name", "dynamo");
+    script.setAttribute("runllm-position", "BOTTOM_RIGHT");
+    script.setAttribute("runllm-position-y", "120px");
+    script.setAttribute("runllm-position-x", "20px");
+    script.setAttribute("runllm-assistant-id", "758");
+
+    script.async = true;
+    document.head.appendChild(script);
+  });
--- a/docs/_static/switcher.json
+++ b/docs/_static/switcher.json
+[
+    {
+        "name": "0.1.0 (current release)",
+        "version": "0.1.0",
+        "url": "https://docs.nvidia.com/dynamo/latest/index.html"
+    },
+    {
+        "name": "older releases",
+        "version": "archives",
+        "url": "https://docs.nvidia.com/dynamo/archives/"
+    }
+]
\ No newline at end of file
--- a/docs/architecture.md
+++ b/docs/architecture.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0

-# Dynamo architecture and key features
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# High Level Architecture

-Dynamo is high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as
+Dynamo is high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others), it captures LLM-specific capabilities such as:

 - **Disaggregated prefill & decode inference** – Maximizes GPU throughput and facilitates trade off between throughput and latency.
 - **Dynamic GPU scheduling** – Optimizes performance based on fluctuating demand
@@ -11,97 +28,83 @@ Dynamo is high-throughput low-latency inference framework designed for serving g

 Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach

-## Motivation
+## Motivation behind Dynamo

-Scaling inference for generative AI and reasoning models are fundamentally hard problems—not just in terms of performance, but also in correctness and efficiency. Today, most inference serving frameworks struggle to handle the sheer complexity of large-scale distributed execution.
+Scaling inference for generative AI and reasoning models are fundamentally hard problems—not just in terms of performance, but also in correctness and efficiency. Most inference serving frameworks struggle to handle the sheer complexity of large-scale distributed execution.

 There are multi-faceted challenges:

- *Extremely hard UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further amplifies inefficiencies. Developers need a clear, intuitive way to define, optimize, and modify inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, slowing down model deployment and innovation. A modern distributed inference stack must be designed with usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
+- *Difficult UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.

- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (where large prompt embeddings are generated) is highly compute-intensive, while decode (where tokens are generated) is latency-sensitive. A disaggregated approach is needed to separate prefill and decode, ensuring optimal GPU utilization and increasing overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
+- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).

- *Expensive KV cache re-computation*: When requests are not efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
+- *Expensive KV cache re-computation*: When requests aren't efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))

 - *Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/))

- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use case specific and are inherently dynamic—demand surges unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
+- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))

- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Existing contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it demands deep networking expertise to configure correctly, making it impractical for broad inference use cases. What developers really need is a library, optimized for inference workloads that can abstract heterogeneous memory (remote memory, or storage) and dynamically selects the best transport backend via a unified API.
+- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API.

-To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Furthermore, Dynamo features NIXL (Nvidia Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
+To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.

 ## High level architecture and key benefits

-The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes four key features.
+The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:

 - [Dynamo Disaggregated Serving](disagg_serving.md)
 - [Dynamo Smart Router](kv_cache_routing.md)
- [Dynamo Distributed KV Cache Manager](kv_cache_manager.md)
+- [Dynamo KV Cache Block Manager](kvbm_intro.rst)
+- [Planner](../guides/planner.md)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)

 Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.

-![](images/architecture.png "Dynamo Architecture")
+![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../images/architecture.png "Dynamo Architecture")

-Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if an increase in requests with long input sequences is detected, the Planner automatically scales up prefill workers to meet the heightened demand.
+Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.

 Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers.

-Dynamo prioritizes seamless integration. Its modular design allows it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. Critical performance-sensitive modules are built with Rust for speed, memory safety, and robust concurrency. Meanwhile, Python is employed for its flexibility, enabling rapid prototyping and effortless customization.
+Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization.

 ## Performance benefits of key features

 ### Disaggregated serving

-Disaggregating prefill and decode significantly boosts performance, gaining efficiency the more GPUs that are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
-
-<figure>
-    <img src='images/disagg_perf_benefit.png' alt='missing' width="1200" height="400" />
-    <p>Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL</p>
-</figure>
+Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.

-<!--
-![](images/disagg_perf_benefit.png)[1]
+![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../images/disagg_perf_benefit.png)

-[1]: Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
-->
+* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL


-The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation allows for tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
+The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.

 ### KV aware routing

+![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../images/kv_routing.png)

-<figure>
-    <img src='images/kv_routing.png' alt='missing' />
-    <p>Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL</p>
-</figure>
-
-<!--
-![](images/kv_routing.png)[2]
+* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL

-[2]: Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
-->

-Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The figures above illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
+Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.

 ### KV cache manager

-Dynamo's design enables KV cache offloading to system CPU memory, and will be extended to support SSDs and networked object storage in subsequent releases. In many accelerated servers, the CPU (system) memory is much larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating additional benefits beyond basic prefix caching.
+Dynamo's design enables KV cache offloading to system CPU memory. In accelerated servers, the CPU (system) memory is often larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating benefits beyond basic prefix caching.
+
+![Line graph comparing Pure GPU prefix caching and Dynamo KV manager host offloading for TTFT (Time To First Token) across rounds with 80 users](../images/kv_manager.png)

-<figure>
-    <img src='images/kv_manager.png' alt='missing' />
-    <p>Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL</p>
-</figure>
+* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL

-### NIXL
+### NVIDIA Inference Transfer Library (NIXL)

-NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL is able to abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support only one tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and overall throughput.
+NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput.

-## Acknowledgement
+## Acknowledgements

-We would like to acknowledge several open source software stacks for motivating us to create Dynamo.
+We'd like to acknowledge several open source software stacks that motivated our creation Dynamo.

 - vLLM and vLLM-project
 - SGLang

--- a/docs/disagg_serving.md
+++ b/docs/disagg_serving.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 # Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance

 The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
@@ -103,7 +121,7 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r

 #### Auto-Discovery for new workers

-In Dynamo, we use etcd (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it will add it's endpoint information to etcd allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers will put memory descriptors of their kv cache (used in NIXL transfer) in etcd. Newly added prefill workers also register with etcd for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers will lazy-pull the descriptors when they start serving a remote prefill request for the first time.
+In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.

 You can watch this happen live by running the following:

@@ -173,4 +191,4 @@ Since worker information is stored in etcd, we can shutdown workers by simply re
 - Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
 - Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish

-You can also visualize this by revoking a workers etcd lease while it has ongoing requests. We have an example script in the repo that does this [here](../lib/bindings/python/examples/hello_world/revoke_lease.py)
\ No newline at end of file
+You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
\ No newline at end of file
--- a/docs/distributed_runtime.md
+++ b/docs/distributed_runtime.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 # Dynamo Distributed Runtime

 ## Overview

 Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables distributed communication and coordination between different dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via binding (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:

- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., ETCD for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
+- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
 - `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
 - `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
 - `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.

-While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other, which will be covered later.
+While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.

 For example, the deployment configuration `examples/llm/configs/disagg.yaml` have four workers:

@@ -21,27 +38,28 @@ Since the four workers are deployed in different processes, each of them have th

 ## Initialization

-In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on ETCD.
+In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.

-> [!CAUTION]
-> The hierarchy and naming in ETCD and NATS might be changed and this document might not reflect the latest changes. However, the main idea would remain the same.
+```{caution}
+The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
+```

- `DistributedRuntime`: When a `DistributedRuntime` object is created, it will establish connections to the following two services:
-    - ETCD (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without ETCD.
+- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following two services:
+    - etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
    - NATS (both static and dynamic mode): for messaging.

-  where ETCD and NATS are two global services (there could be multiple ETCD and NATS services for high availability).
+  where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability).

-  For ETCD, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` will use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease will be revoked or expired and the kv pairs stored with this lease_id will be removed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in ETCD. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it will not be registered in ETCD. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
+  For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
+- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
+- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
 - `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
  - NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
-  - ETCD Registration: The endpoint information is stored in ETCD at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) will share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
+  - etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.

 ## Calling Endpoints

-Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an ETCD watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The ETCD watcher will continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
+Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.

 The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [PushRouter](/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:


--- a/docs/kv_cache_routing.md
+++ b/docs/kv_cache_routing.md
@@ -16,10 +16,10 @@ limitations under the License.
 -->


-# KV Cache Routing in Dynamo
+# KV Cache Routing
 This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.

-## Dynamo Architecture
+## Architecture
 Dynamo's architecture consists of three key concepts:

 - **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
@@ -28,11 +28,11 @@ Dynamo's architecture consists of three key concepts:

 A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).

-A common pattern is to spin up multiple of the same components which serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
+A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.

-Colloquially, we will refer to a dynamo component that serves an endpoint for LLM inference as a **worker**.
+Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.

-## Basic Routing in Dynamo
+## Basic Routing
 Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.

 First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
@@ -76,7 +76,7 @@ To get a feel for how KV Cache management works on a single worker with KV Cache
    - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
    - Selected blocks are evicted from the cache
    - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
-    - Alternatively, some systems may offload less-frequently used blocks to CPU memory. See [KV Offloading in Dynamo](kv_cache_manager.md).
+    - Alternatively, some systems may offload less-frequently used blocks to CPU memory.
 7. KV computation:
    - For new blocks, the model computes key and value tensors
    - These tensors are stored in the newly allocated cache blocks
@@ -111,7 +111,7 @@ In the above image, our cost function is (KV match - Load) so we select Worker 2
 - **Worker 2 = (0.50 - 0.50) = 0**
 - Worker 3 = (0.75 - 0.80) = -0.05

-## Dynamo Events
+## Events

 In Dynamo, we want to support KV Cache Routing and load balancing for many backends that have different implementations of KV Cache and record different metrics. To that end, we built a KVPublisher that can be plugged into any framework to publish KV Events and a KvMetricsPublisher that can publish Metric Events.

@@ -169,7 +169,10 @@ Sample Output:
 	543219876: 7,
 }
 ```
-> **Note**: This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
+
+```{note}
+This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
+```

 ### KvMetricsPublisher
 We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
@@ -217,10 +220,12 @@ Number of Requests Waiting: 1
 GPU Prefix Cache Hit Rate: 0.1
 ***
 ```
-> **Note**: This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.

+```{note}
+This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
+```

-### [KV Router](../examples/llm/components/kv_router.py)
+### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
 The Router component makes intelligent worker selection decisions
 1. Receives incoming requests as tokens
 2. Queries the KVIndexer to find potential cache hits across workers

--- a/docs/architecture/kvbm_architecture.md
+++ b/docs/architecture/kvbm_architecture.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# KVBM Architecture
+
+The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.
+
+![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../images/kvbm-arch.png)
+**High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**
+
+The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
+
+The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.
+
+The last layer, the NIXL layer, provides unified support for enabling all data and storage transactions. NIXL enables P2P GPU transfers, enables RDMA and NVLINK remote memory sharing, dynamic block registration and metadata exchange and provides a plugin interface for storage backends.
+
+NIXL integrates with several backends:
+
+- Block memory (Eg. GPU HBM, Host DRAM, Remote DRAM, Local SSD when exposed as block device)
+- Local file system (for example, POSIX)
+- Remote file system (for example, NFS)
+- Object stores (for example, S3-compatible)
+- Cloud storage (for example, blob storage APIs)
+
+**[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** abstracts away the registration and integration complexity for each backends via custom optimizable plugin architecture and enables memory blocks to be published, serialized, and accessed remotely, allowing the disaggregation of compute and memory across nodes. Combined with the Dynamo KV Block Manager (KVBM), storage providers no longer need to retrofit or optimize individual LLM inference engines. Instead, they can focus on tuning their own stack, providing optimized endpoints, knowing that integration is smooth, standardized, and efficient. And for those who *do* want to go further, Dynamo KVBM offers a clean separation of concerns, making custom optimization not only possible, but simple.
\ No newline at end of file
--- a/docs/architecture/kvbm_components.md
+++ b/docs/architecture/kvbm_components.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Understanding KVBM components
+
+The design of the KVBM is inspired from vLLM and SGLang KV block managers but with a twist from historical memory tiering design aspired in general GPU programming \[1,2\]. Figure 2 shows the internal architecture of KVBM and how it works across workers using NIXL.
+
+![Internal architecture and key modules in the Dynamo KVBM. ](../images/kvbm-internal-arch.png)
+**Internal architecture and key modules in the Dynamo KVBM**
+
+#### KvBlockManager as Orchestration Layer
+
+The \`KvBlockManager\<H, D\>\` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Critical to note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
+
+\`KvBlockManager\<H, D\>\` owns:
+
+* A device-side \`BlockPool\<Device\>\`
+* A host-side \`BlockPool\<Host\>\`
+* A remote NIXL agent that allows communication and memory sharing across nodes
+* A block set registry for remote lookup and import/export of block metadata
+
+Implementation-wise, \`KvBlockManagerState\` holds actual logic: it's initialized by \`KvBlockManagerConfig\`, which merges runtime, model, and layout configs. Remote awareness is injected by \`NixlOptions\`.
+
+#### Block Layout and Memory Mapping
+
+Each block is a 2D array \`\[num\_layers\]\[page\_size × inner\_dim\]\`. The memory layout is abstracted by the \`BlockLayouttrait\`. The default implementation is \`FullyContiguous\`, which stores all layers for all blocks in one region with alignment-aware stride computation:
+
+````
+```
+block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
+```
+````
+
+This memory layout is shared by both CPU and GPU pools but uses storage-specific backends:
+
+* \`DeviceStorage\` → CUDA device buffer
+* \`PinnedStorage\` → page-locked host memory
+* \`SystemStorage\` → CPU heap memory (fallback/test)
+* \`NixlStorage\` → remote memory via NIXL RDMA handles (includes storage)
+
+Each layout is constructed using a \`LayoutConfig\`, and storage is either passed directly or allocated via a StorageAllocator.
+
+#### BlockPool and Memory Pools (Active \+ Inactive)
+
+Each \`BlockPool\<T\>\` (where \`T\` is \`DeviceStorage\`, \`PinnedStorage\`, etc.) tracks two sub-pools:
+
+* \`ActivePool\`: Contains blocks currently in use by sequences
+* \`InactivePool\`: Recycled blocks ready for allocation. Think free list.
+
+When a token block is requested (for example, \`get\_mutable\_block()\`), the allocator pops from \`InactivePool\`, transitions its state, and returns a writable handle. On sequence commit or eviction, blocks are reset and returned to the inactive pool.
+
+The state machine (\`BlockState\`) that tracks the block lifecycle transitions includes:
+
+| State | Description | Ownership | Valid Actions / Transitions |
+| ----- | ----- | ----- | ----- |
+| Reset | Block is uninitialized or has been reset. No sequence is associated. | Held in InactivePool, reusable | init\_sequence(salt\_hash) → Partial |
+| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | add\_token() / add\_tokens() (accumulate)- commit() → Complete- reset() → Reset |
+| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | register() → Registered- reset() → Reset |
+| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. Can use block for lookups | Shared ownership (global registry) | Auto drop() → triggers Remove event and transitions to Reset |
+
+The valid KBVM block manager transitions are:
+
+| From → To | Trigger | Validation |
+| ----- | ----- | ----- |
+| Reset → Partial | init\_sequence(salt\_hash) | Must not be in use |
+| Partial → Complete | commit() | Must be full |
+| Complete → Registered | register() | Must be finalized |
+| Registered → Reset | Drop of RegistrationHandle | Automatic |
+| Partial → Reset | Aborted sequence | Explicit or drop |
+| Complete → Reset | Invalidated | Explicit or drop |
+
+An example lifecycle of a block in the KVBM block manager can be thought as below:
+
+Let’s say a sequence requests a new KV block:
+
+1. Allocator pops from InactivePool → Block is in Reset
+2. init\_sequence() → Transitions to Partial
+3. Tokens are appended → State remains Partial
+4. On full → commit() → State becomes Complete
+5. Register() → Block is hashed and moved to Registered. Blocks can now be used to lookup.
+6. On eviction or end-of-life → drop() of RAII handle returns block to Reset
+
+#### Lifecycle Management via RAII and Event Plane
+
+The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an \`EventManager\`. On registration and drop:
+
+* \`PublishHandle\` triggers Register events
+* Dropping it triggers Remove events
+
+This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane and any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation tailored and optimized for their platform.
+
+#### Remote Memory Integration via NIXL
+
+The NIXL agent exposes remote memory buffers using \`NixlBlockSet\`, \`RemoteBlocks\`, and layout descriptors. Key operations include:
+
+* \`nixl\_register()\`: Registers memory region with NIXL runtime
+* \`serialize() / deserialize()\`: Converts layout and memory into transferable descriptors
+* \`import\_remote\_blockset()\`: Loads remote node’s block layouts into the manager
+* \`get\_remote\_blocks\_mutable()\`: Fetches transferable memory views from another node
+
+\`RemoteBlocks\` is a lightweight abstraction over shared memory for cross-node block usage (via UCX or other backends).
+
+The left side of the Figure 2 illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2\) using NIXL.
+
+1. *Agent Creation & Memory Registration:*
+
+   Each worker independently sets up a NixlAgent:
+* Registers its memory regions (e.g., device memory) via nixl\_register().
+* These regions correspond to blocks managed in the local BlockPool.
+  Once memory is registered, NIXL creates remote-accessible descriptors, which are bound to the memory layout.
+
+2. *Metadata exchange:*
+
+   After memory registration, workers exchange serialized layout metadata, encapsulated in a \`SerializedNixlBlockLayout\`.
+   Why is this step critical?
+* LLM inference workloads often differ in *tensor parallel (TP)* configurations.
+  * Worker 1 might have TP=4, while Worker 2 has TP=8.
+  * Hence, even if both systems use similar \`FullyContiguous\` layouts, their internal slicing and alignment assumptions differ.
+* The metadata exchange bridges this semantic mismatch by sharing:
+  * LayoutConfig (num\_layers, page\_size, inner\_dim, dtype)
+  * BlockSetID
+  * Base address \+ stride information (including alignment)
+  * Device ID \+ memory type (host/device)
+* Once shared, each worker can reconstruct the layout on its side using deserialize().
+  This enables NIXL to:
+* Understand where each layer/block lives
+* Perform correct gather-scatter operations during RDMA-like transfers
+  Without this step, remote fetches would result in data corruption or misaligned tokens.
+
+
+3. *Serialization & Deserialization: Making Layouts Portable*
+
+   In the serialization stage, KVBM exports, \`FullyContiguous::serialize()\` encodes:
+* FullyContiguousConfig
+* base\_offset
+* Physical memory descriptors (NixlStorage) including:
+  * Memory type (VRAM, DRAM)
+  * Address & size
+  * Device ID
+
+  This is sent over using NIXL transfer and then injected into a KVBM scheduler state. In the deserialization stage, \`SerializedNixlBlockLayout::deserialize()\` rehydrates this into:
+
+* A fully reconstructed memory layout view
+* Local representation of a remote memory slice with correct offsets and size semantics
+* Enables direct access to remote memory with consistent logical semantics
+  This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
+
+4. *Ownership handles and lifetime tracking*
+
+Memory ownership in NIXL is tightly coupled with RAII-based handles:
+* When a block is registered, it returns a \`PublishHandle\` which wraps a \`RegistrationHandle\`.
+* On drop of this handle, an automatic Remove event is published, which:
+  * Deregisters the block from the NIXL layer
+  * Removes it from the remote block registry
+* This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes.
+  This mechanism avoids:
+* Stale memory access
+* Dangling pointers on GPU or host
+* Manual deregistration bugs
+  The system can batch and publish registration events via a Publisher, optimizing performance under high concurrency.
+
+
+#### Storage backends and pluggability
+
+Integrating KVBM with storage backend is extremely trivial by extending or wrapping \`NixlEnabledStorage\` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers.  We are deferring detailed integration guidance as we are actively collaborating with storage partners to simplify and standardize these integration paths.
+
+```
+An example system architecture
+                        +------------------------------+
+                        |Distributed Inference engine  |
+                        +------------------------------+
+                                  |
+                                  v
+                        +------------------------------+
+                        |  Dynamo KV Block Manager      |
+                        +------------------------------+
+                                  |
+                 +----------------+----------------+
+                 |                                 |
+                 v                                 v
+   +------------------------------+    +----------------------------+
+   |        NIXL Storage Agent     |    |        Event Plane          |
+   |  - Volume registration        |    |  - NATS-based Pub/Sub       |
+   |  - get()/put() abstraction    |    |  - StoreEvent / RemoveEvent |
+   +------------------------------+    +----------------------------+
+                 |                                 |
+                 v                                 v
+     +-----------------------------+   +-----------------------------+
+     |   G4 Storage Infrastructure  |   | Storage Provider Subscriber |
+     |  (SSD, Object store, etc.)   |   |  - Parse Events             |
+     |  - Store KV blocks           |   |  - Build fast tree/index    |
+     +-----------------------------+    |  - Optimize G4 tiering      |
+                                        +-----------------------------+
+```
+
+For now, the following breakdown provides a high-level understanding of how KVBM interacts with external storage using the NIXL storage interface and the Dynamo Event Plane:
+
+##### NIXL Storage Interface (for Backend Integration)
+
+The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
+
+* registerVolume(descriptor): Register a logical volume for KV cache data.
+* unregisterVolume(): Cleanly deregister and release volume mappings.
+* get() / put(): Block-level APIs used by KVBM to fetch and store token blocks.
+
+These abstractions allow backends to be integrated without tying into the host’s file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
+
+##### Dynamo Event Plane (Pub/Sub Coordination Layer)
+
+To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations. Particularly there are two events emitted.
+
+* StoreEvent: Emitted when a KV block is registered.
+* RemoveEvent: Emitted when a KV block is released or evicted.
+
+Each KVEvent (\~100 bytes) contains:
+
+* sequence\_hash: Unique identifier of the KV block
+* prefix\_hash: Prefix grouping for query-level aggregation
+* block\_size: Size in bytes
+* storage\_location: Logical volume identifier
+* event\_type: Store or Remove
+* extra\_metadata: Reserved fields for partner-specific optimization
+
+These events are batched and published periodically (e.g., every \~10s or dynamically based on system load) for scalability.
+
+##### A conceptual design of a storage advisor
+
+This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. ***Please note, this is optional and not required for KVBM to integrate with a backend.***
+
+External storage systems are not tightly coupled with Dynamo’s execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
+
+* Storage volumes are pre-provisioned and mounted by the storage provider.
+* These volumes are then registered with Dynamo via the NIXL Storage Agent using registerVolume() APIs. Dynamo itself does not manage mounts or provisioning.
+* The Dynamo KV Block Manager interacts only with logical block-level APIs (i.e., get() and put()).
+* In parallel, the Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel.
+* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Manager’s runtime behavior.
+* This decoupling ensures that external storage systems can optimize block placement and lifecycle tracking without modifying or instrumenting the core Dynamo codebase.
+
+Now, to enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream. Here is a high level conceptual design:
+
+* On receiving a StoreEvent, the storage system:
+  * Inserts a record into an internal prefix tree, hash map, or LRU index.
+  * This record includes the prefix\_hash and sequence\_hash, which logically identify the token block and its grouping.
+  * Associated metadata (e.g., block\_size, storage\_location) is also captured.
+* On receiving a RemoveEvent, the system:
+  * Deletes or prunes the corresponding record from its index.
+  * Optionally triggers cleanup or tier migration workflows.
+
+This event-driven indexing allows the storage system to track which KV blocks are live and where they belong—enabling low-latency lookup, efficient space reclamation, and multi-tier coordination. With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies, such as:
+
+* Hot block promotion: Frequently accessed KV blocks can be migrated to fast SSD volumes.
+* Cold block demotion: Infrequently used blocks can be demoted to slower storage (e.g., HDDs, cloud object storage).
+* Proactive compaction: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks.
+
+These optimizations are performed entirely outside of Dynamo, with the assumption that storage providers adhere to SLA guarantees and volume availability.
+
+Critically, this entire system is designed to be non-intrusive:
+
+* The Dynamo KV Block Manager remains agnostic to how data is stored or optimized.
+* The Event Plane does not block or intercept any critical path of inference.
+* Storage vendors are given the freedom to innovate and optimize without requiring changes to the inference runtime.
+
+This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
--- a/docs/architecture/kvbm_intro.rst
+++ b/docs/architecture/kvbm_intro.rst
+..
+    SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    SPDX-License-Identifier: Apache-2.0
+
+    Licensed under the Apache License, Version 2.0 (the "License");
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+
+KV Block Manager
+================
+The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM.
+
+It offers:
+
+* A **unified memory API** that spans GPU memory, pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
+* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to.
+* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
+
+The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs.
+
+.. toctree::
+   :hidden:
+
+   Motivation <kvbm_motivation.md>
+   KVBM Architecture <kvbm_architecture.md>
+   Understanding KVBM components <kvbm_components.md>
+   KVBM Further Reading <kvbm_reading>
--- a/docs/architecture/kvbm_motivation.md
+++ b/docs/architecture/kvbm_motivation.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Motivation behind KVBM
+
+Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges:
+
+* Tailored for GenAI use-cases
+* Lack of visibility into real-time block usage patterns.
+* Need for lightweight, ownership-driven memory management over complex object stores with unneeded overheads.
+* Modular and need simplified UX and to be memory safe.
+* Inability to differentiate between hot (frequently accessed) and cold (infrequently accessed) blocks across the stack without intrusive application-level changes.
+* Difficulty in optimizing storage placement across heterogeneous storage tiers (for example, SSDs, object storage, and cloud storage).
+
+Conventional systems either lack dynamic feedback mechanisms or require deep integration into core storage paths, which both increases complexity and reduces portability.