Commit c2a6b368 authored by Suman Tatiraju's avatar Suman Tatiraju Committed by GitHub
Browse files

docs: update guides and filenames (#252)

parent 5161250a
...@@ -35,8 +35,8 @@ To address the growing demands of distributed inference serving, NVIDIA introduc ...@@ -35,8 +35,8 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes four key features. The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes four key features.
- [Dynamo Disaggregated Serving](dynamo_disagg_serving.md) - [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](dynamo_kv_cache_routing.md) - [Dynamo Smart Router](kv_cache_routing.md)
- [Dynamo Distributed KV Cache Manager](kv_cache_manager.md) - [Dynamo Distributed KV Cache Manager](kv_cache_manager.md)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
......
...@@ -27,34 +27,34 @@ This directory contains examples and reference implementations for deploying Lar ...@@ -27,34 +27,34 @@ This directory contains examples and reference implementations for deploying Lar
## Deployment Architectures ## Deployment Architectures
### Aggregated This figure shows an overview of the major components to deploy:
Single-instance deployment where both prefill and decode are done by the same worker.
### Disaggregated
Distributed deployment where prefill and decode are done by separate workers that can scale independently.
```mermaid ```
sequenceDiagram +----------------+
participant D as VllmWorker +------| prefill worker |-------+
participant Q as PrefillQueue notify | | | |
participant P as PrefillWorker finished | +----------------+ | pull
v v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
Note over D: Request is routed to decode ```
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue ### Aggregated
P-->>D: Read cached KVs from Decode Single-instance deployment where both prefill and decode are done by the same worker.
D->>D: Decode other requests ### Disaggregated
P->>P: Run prefill Distributed deployment where prefill and decode are done by separate workers that can scale independently.
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
## Getting Started ## Getting Started
...@@ -69,7 +69,7 @@ Start required services (etcd and NATS) using [Docker Compose](/deploy/docker-co ...@@ -69,7 +69,7 @@ Start required services (etcd and NATS) using [Docker Compose](/deploy/docker-co
docker compose -f deploy/docker-compose.yml up -d docker compose -f deploy/docker-compose.yml up -d
``` ```
### Build docker ### Build container
``` ```
./container/build.sh ./container/build.sh
...@@ -82,27 +82,7 @@ docker compose -f deploy/docker-compose.yml up -d ...@@ -82,27 +82,7 @@ docker compose -f deploy/docker-compose.yml up -d
``` ```
## Run Deployment ## Run Deployment
This figure shows an overview of the major components to deploy:
```
+----------------+
+------| prefill worker |-------+
notify | | | |
finished | +----------------+ | pull
v v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
```
### Example architectures ### Example architectures
......
# Dynamo SDK
Dynamo is a python based SDK for building and deploying distributed inference applications. Dynamo leverages concepts from open source projects like [BentoML](https://github.com/bentoml/bentoml) to provide a developer friendly experience to go from local development to K8s deployment.
## Installation
```bash
pip install ai-dynamo
```
## Quickstart
Lets build a simple distributed pipeline with 3 components: `Frontend`, `Middle` and `Backend`. The structure of the pipeline looks like this:
```
Users/Clients (HTTP)
┌─────────────┐
│ Frontend │ HTTP API endpoint (/generate)
└─────────────┘
┌─────────────┐
│ Middle │
└─────────────┘
┌─────────────┐
│ Backend │
└─────────────┘
```
The code for the pipeline looks like this:
```python
# filename: pipeline.py
from dynamo.sdk import service, dynamo_endpoint, depends, api
from pydantic import BaseModel
class RequestType(BaseModel):
text: str
@service(resources={"cpu": "1"})
class Frontend:
middle = depends(Middle)
@api
async def generate(self, text: str):
request = RequestType(text=text)
async for response in self.middle.generate(request.model_dump_json()):
yield f"Frontend: {response}"
@service(
resources={"cpu": "1"},
dynamo={"enabled": True, "namespace": "inference"}
)
class Middle:
backend = depends(Backend)
@dynamo_endpoint()
async def generate(self, req: RequestType):
text = f"{req.text}-mid"
for token in text.split():
yield f"Mid: {token}"
@service(
resources={"cpu": "1"},
dynamo={"enabled": True, "namespace": "inference"}
)
class Backend:
@dynamo_endpoint()
async def generate(self, req: RequestType):
text = f"{req.text}-back"
for token in text.split():
yield f"Backend: {token}"
```
You can run this pipeline locally by spinning up ETCD and NATS and then running the pipeline:
```bash
# Spin up ETCD and NATS
docker compose -f deploy/docker-compose.yml up -d
```
then
```bash
# Run the pipeline
dynamo serve pipeline:Frontend
```
Once it's up and running, you can make a request to the pipeline using
```bash
curl -X POST http://localhost:3000/generate \
-H "Content-Type: application/json" \
-d '{"text": "federer"}'
```
You should see the following output:
```bash
federer-mid-back
```
You can find in-depth documentation for the Dynamo SDK [here](../../deploy/dynamo/sdk/docs/sdk/README.md) and the Dynamo CLI [here](../../deploy/dynamo/sdk/docs/cli/README.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## Overview
Pipeline Architecture:
```
Users/Clients (HTTP)
┌─────────────┐
│ Frontend │ HTTP API endpoint (/generate)
└─────────────┘
│ dynamo/runtime
┌─────────────┐
│ Middle │
└─────────────┘
│ dynamo/runtime
┌─────────────┐
│ Backend │
└─────────────┘
```
## Unified serve
1. Launch all three services using a single command -
```bash
cd /workspace/examples/hello_world
dynamo serve hello_world:Frontend
```
2. Send request to frontend using curl -
```bash
curl -X 'POST' \
'http://localhost:3000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"text": "test"
}'
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment