docs: update guides and filenames (#252)

c2a6b368 · Suman Tatiraju · GitHub · 5161250a · c2a6b368 · c2a6b368
Commit c2a6b368 authored Mar 17, 2025 by Suman Tatiraju Committed by GitHub Mar 17, 2025
6 changed files
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -35,8 +35,8 @@ To address the growing demands of distributed inference serving, NVIDIA introduc

 The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes four key features.

- [Dynamo Disaggregated Serving](dynamo_disagg_serving.md)
- [Dynamo Smart Router](dynamo_kv_cache_routing.md)
+- [Dynamo Disaggregated Serving](disagg_serving.md)
+- [Dynamo Smart Router](kv_cache_routing.md)
 - [Dynamo Distributed KV Cache Manager](kv_cache_manager.md)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)


--- a/docs/dynamo_disagg_serving.md
+++ b/docs/dynamo_disagg_serving.md
--- a/docs/guides/llm_deployment.md
+++ b/docs/guides/llm_deployment.md
@@ -27,34 +27,34 @@ This directory contains examples and reference implementations for deploying Lar

 ## Deployment Architectures

-### Aggregated
-Single-instance deployment where both prefill and decode are done by the same worker.
-
-### Disaggregated
-Distributed deployment where prefill and decode are done by separate workers that can scale independently.
+This figure shows an overview of the major components to deploy:

-```mermaid
-sequenceDiagram
-    participant D as VllmWorker
-    participant Q as PrefillQueue
-    participant P as PrefillWorker
+```
+                                                 +----------------+
+                                          +------| prefill worker |-------+
+                                   notify |      |                |       |
+                                 finished |      +----------------+       | pull
+                                          v                               v
+------+      +-----------+      +------------------+    push     +---------------+
+| HTTP |----->| processor |----->| decode/monolith  |------------>| prefill queue |
+|      |<-----|           |<-----|      worker      |             |               |
+------+      +-----------+      +------------------+             +---------------+
+                  |    ^                  |
+       query best |    | return           | publish kv events
+           worker |    | worker_id        v
+                  |    |         +------------------+
+                  |    +---------|     kv-router    |
+                  +------------->|                  |
+                                 +------------------+

-    Note over D: Request is routed to decode
-    D->>D: Decide if prefill should be done locally or remotely
+```

-        D->>D: Allocate KV blocks
-        D->>Q: Put RemotePrefillRequest on the queue

-        P->>Q: Pull request from the queue
-        P-->>D: Read cached KVs from Decode
+### Aggregated
+Single-instance deployment where both prefill and decode are done by the same worker.

-        D->>D: Decode other requests
-        P->>P: Run prefill
-        P-->>D: Write prefilled KVs into allocated blocks
-        P->>D: Send completion notification
-        Note over D: Notification received when prefill is done
-        D->>D: Schedule decoding
-```
+### Disaggregated
+Distributed deployment where prefill and decode are done by separate workers that can scale independently.

 ## Getting Started

@@ -69,7 +69,7 @@ Start required services (etcd and NATS) using [Docker Compose](/deploy/docker-co
 docker compose -f deploy/docker-compose.yml up -d
 ```

-### Build docker
+### Build container

 ```
 ./container/build.sh
@@ -82,27 +82,7 @@ docker compose -f deploy/docker-compose.yml up -d
 ```
 ## Run Deployment

-This figure shows an overview of the major components to deploy:
-
-```
-                                                 +----------------+
-                                          +------| prefill worker |-------+
-                                   notify |      |                |       |
-                                 finished |      +----------------+       | pull
-                                          v                               v
-+------+      +-----------+      +------------------+    push     +---------------+
-| HTTP |----->| processor |----->| decode/monolith  |------------>| prefill queue |
-|      |<-----|           |<-----|      worker      |             |               |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+

-```

 ### Example architectures


--- a/docs/guides/dynamo_sdk.md
+++ b/docs/guides/dynamo_sdk.md
-# Dynamo SDK
-
-Dynamo is a python based SDK for building and deploying distributed inference applications. Dynamo leverages concepts from open source projects like [BentoML](https://github.com/bentoml/bentoml) to provide a developer friendly experience to go from local development to K8s deployment.
-
-## Installation
-
-```bash
-pip install ai-dynamo
-```
-
-## Quickstart
-Lets build a simple distributed pipeline with 3 components: `Frontend`, `Middle` and `Backend`. The structure of the pipeline looks like this:
-
-```
-Users/Clients (HTTP)
-      │
-      ▼
-┌─────────────┐
-│  Frontend   │  HTTP API endpoint (/generate)
-└─────────────┘
-      │
-      ▼
-┌─────────────┐
-│   Middle    │
-└─────────────┘
-      │
-      ▼
-┌─────────────┐
-│  Backend    │
-└─────────────┘
-```
-
-The code for the pipeline looks like this:
-
-```python
-# filename: pipeline.py
-
-from dynamo.sdk import service, dynamo_endpoint, depends, api
-from pydantic import BaseModel
-
-class RequestType(BaseModel):
-    text: str
-
-@service(resources={"cpu": "1"})
-class Frontend:
-    middle = depends(Middle)
-
-    @api
-    async def generate(self, text: str):
-        request = RequestType(text=text)
-        async for response in self.middle.generate(request.model_dump_json()):
-            yield f"Frontend: {response}"
-
-@service(
-    resources={"cpu": "1"},
-    dynamo={"enabled": True, "namespace": "inference"}
-)
-class Middle:
-    backend = depends(Backend)
-
-    @dynamo_endpoint()
-    async def generate(self, req: RequestType):
-        text = f"{req.text}-mid"
-        for token in text.split():
-            yield f"Mid: {token}"
-
-@service(
-    resources={"cpu": "1"},
-    dynamo={"enabled": True, "namespace": "inference"}
-)
-class Backend:
-    @dynamo_endpoint()
-    async def generate(self, req: RequestType):
-        text = f"{req.text}-back"
-        for token in text.split():
-            yield f"Backend: {token}"
-```
-
-You can run this pipeline locally by spinning up ETCD and NATS and then running the pipeline:
-
-```bash
-# Spin up ETCD and NATS
-docker compose -f deploy/docker-compose.yml up -d
-```
-
-then
-
-```bash
-# Run the pipeline
-dynamo serve pipeline:Frontend
-```
-
-Once it's up and running, you can make a request to the pipeline using
-
-```bash
-curl -X POST http://localhost:3000/generate \
-    -H "Content-Type: application/json" \
-    -d '{"text": "federer"}'
-```
-
-You should see the following output:
-
-```bash
-federer-mid-back
-```
-
-You can find in-depth documentation for the Dynamo SDK [here](../../deploy/dynamo/sdk/docs/sdk/README.md) and the Dynamo CLI [here](../../deploy/dynamo/sdk/docs/cli/README.md)
-
--- a/docs/guides/hello_world.md
+++ b/docs/guides/hello_world.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-## Overview
-
-Pipeline Architecture:
-
-```
-Users/Clients (HTTP)
-      │
-      ▼
-┌─────────────┐
-│  Frontend   │  HTTP API endpoint (/generate)
-└─────────────┘
-      │ dynamo/runtime
-      ▼
-┌─────────────┐
-│   Middle    │
-└─────────────┘
-      │ dynamo/runtime
-      ▼
-┌─────────────┐
-│  Backend    │
-└─────────────┘
-```
-
-
-## Unified serve
-1. Launch all three services using a single command -
-
-```bash
-cd /workspace/examples/hello_world
-dynamo serve hello_world:Frontend
-```
-
-2. Send request to frontend using curl -
-
-```bash
-curl -X 'POST' \
-  'http://localhost:3000/generate' \
-  -H 'accept: text/event-stream' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "text": "test"
-}'
-```
--- a/docs/dynamo_kv_cache_routing.md
+++ b/docs/dynamo_kv_cache_routing.md