Unverified Commit 5e9370d3 authored by Kristen Kelleher's avatar Kristen Kelleher Committed by GitHub
Browse files

docs: fix sphinx errors admonitions adobe config (#1179)


Signed-off-by: default avatarKristen Kelleher <kkelleher@nvidia.com>
- Content, format, and structural changes to the Dynamo docs for 0.3.0. 
- Includes copyediting and the first batch of changes from the DMO review.
parent 25c711f8
......@@ -18,19 +18,19 @@ limitations under the License.
# High Level Architecture
Dynamo is high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others), it captures LLM-specific capabilities such as:
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and facilitates trade off between throughput and latency.
- **Dynamic GPU scheduling** – Optimizes performance based on fluctuating demand
- **LLM-aware request routing** – Eliminates unnecessary KV cache re-computation
- **Accelerated data transfer** – Reduces inference response time using NIXL.
- **KV cache offloading**Leverages multiple memory hierarchies for higher system throughput
- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling** – Optimizes performance based on real-time demand
- **LLM-aware request routing** – Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer** – Reduces inference response time using NIXL
- **KV cache offloading**Uses multiple memory hierarchies for higher system throughput
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach
## Motivation behind Dynamo
Scaling inference for generative AI and reasoning models are fundamentally hard problems—not just in terms of performance, but also in correctness and efficiency. Most inference serving frameworks struggle to handle the sheer complexity of large-scale distributed execution.
Scaling inference for generative AI and reasoning models presents complex challenges in three key areas: performance, correctness, and efficiency. Here's what we're solving:
There are multi-faceted challenges:
......@@ -55,7 +55,7 @@ The following diagram outlines Dynamo's high-level architecture. To enable large
- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](kv_cache_routing.md)
- [Dynamo KV Cache Block Manager](kvbm_intro.rst)
- [Planner](../guides/planner.md)
- [Planner](planner.md)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
......
......@@ -61,7 +61,7 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [PushRouter](/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_routers.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
- `random`: randomly select an endpoint to hit,
- `round_robin`: select endpoints in round-robin order,
......@@ -75,3 +75,18 @@ We provide native rust and python (through binding) examples for basic usage of
- Rust: `/lib/runtime/examples/`
- Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details.
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require exgtensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
```
This diff is collapsed.
......@@ -52,9 +52,11 @@ There are two additional rules set by planner to prevent over-compensation:
To ensure dynamo serve complies with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
> [!NOTE]
> Currently, the script considers a fixed ISL/OSL without KV cache reuse. If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
> Currently, we assume there is no piggy-backed prefill requests in the decode engine. Even if there are some short piggy-backed prefill requests in the decode engine, it should not affect the ITL too much in most conditions. However, if the piggy-backed prefill requests are too much, the ITL might be inaccurate.
```{note}
The script considers a fixed ISL/OSL without KV cache reuse. If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
We assume there is no piggy-backed prefill requests in the decode engine. Even if there are some short piggy-backed prefill requests in the decode engine, it should not affect the ITL too much in most conditions. However, if the piggy-backed prefill requests are too much, the ITL might be inaccurate.
```
```bash
python -m utils.profile_sla \
......@@ -78,7 +80,6 @@ For the prefill performance, the script will plot the TTFT for different TP size
For the decode performance, the script will plot the ITL for different TP sizes and different in-flight requests. Similarly, it will select the best point that satisfies the ITL SLA and delivers the best throughput per GPU and recommend the upper and lower bounds of the kv cache utilization rate to be used in planner.
The following information will be printed out in the terminal:
```
2025-05-16 15:20:24 - __main__ - INFO - Analyzing results and generate recommendations...
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
......@@ -89,113 +90,97 @@ The following information will be printed out in the terminal:
After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes. The results will be saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
## Usage
The planner is started automatically as part of Dynamo pipelines when running `dynamo serve`. You can configure the planner just as you would any other component in your pipeline either via YAML configuration or through CLI arguments.
Usage:
`dynamo serve` automatically starts the planner. Configure it through YAML files or command-line arguments:
```bash
# Configure the planner through YAML configuration
# YAML configuration
dynamo serve graphs.disagg:Frontend -f disagg.yaml
# disagg.yaml
# ...
# Planner:
# environment: local
# no-operation: false
# log-dir: log/planner
Planner:
environment: local
no-operation: false
log-dir: log/planner
# Configure the planner through CLI arguments
# Command-line configuration
dynamo serve graphs.disagg:Frontend -f disagg.yaml --Planner.environment=local --Planner.no-operation=false --Planner.log-dir=log/planner
```
The planner accepts the following configuration options:
* `namespace` (str, default: "dynamo"): Namespace planner will look at
* `environment` (str, default: "local"): Environment to run the planner in (local, kubernetes)
* `served-model-name` (str, default: "vllm"): Model name that is being served
* `no-operation` (bool, default: false): Do not make any adjustments, just observe the metrics and log to tensorboard
* `log-dir` (str, default: None): Tensorboard logging directory
* `adjustment-interval` (int, default: 30): Interval in seconds between scaling adjustments
* `metric-pulling-interval` (int, default: 1): Interval in seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum number of GPUs to use, planner will not scale up more than this number of GPUs for prefill plus decode workers
* `min-gpu-budget` (int, default: 1): Minimum number of GPUs to use, planner will not scale down below this number of GPUs for prefill or decode workers
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache utilization threshold to scale up decode workers
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache utilization threshold to scale down decode workers
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue utilization threshold to scale up prefill workers
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue utilization threshold to scale down prefill workers
* `decode-engine-num-gpu` (int, default: 1): Number of GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): Number of GPUs per prefill engine
Alternatively, you can run the planner as a standalone python process. The configuration options above can be directly passed in as CLI arguments.
Configuration options:
* `namespace` (str, default: "dynamo"): Target namespace for planner operations
* `environment` (str, default: "local"): Target environment (local, kubernetes)
* `served-model-name` (str, default: "vllm"): Target model name
* `no-operation` (bool, default: false): Run in observation mode only
* `log-dir` (str, default: None): Tensorboard log directory
* `adjustment-interval` (int, default: 30): Seconds between adjustments
* `metric-pulling-interval` (int, default: 1): Seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum GPUs for all workers
* `min-gpu-budget` (int, default: 1): Minimum GPUs per worker type
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache threshold for scale-up
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache threshold for scale-down
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue threshold for scale-up
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue threshold for scale-down
* `decode-engine-num-gpu` (int, default: 1): GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): GPUs per prefill engine
Run as standalone process:
```bash
PYTHONPATH=/workspace/examples/llm python components/planner.py <arguments>
# Example
# PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner
PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner
```
### Tensorboard
Planner logs to tensorboard to visualize the metrics and the scaling actions. You can start tensorboard with the following command:
Monitor metrics with Tensorboard:
```bash
tensorboard --logdir=<path-to-tensorboard-log-dir>
```
## Backends
We currently support two backends:
1. `local` - uses circus to start/stop worker subprocesses
2. `kubernetes` - uses kubernetes to scale up/down the number of worker pods by updating the replicas count of the DynamoGraphDeployment resource
### Local Backend
The planner supports local and kubernetes backends for worker management.
Circus is a Python program that can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it either adds or removes a watcher from the existing `circus`.
### Local Backend
``` {note}
Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers.
```
The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process. The planner adds or removes watchers to scale workers.
#### Statefile
Note: Circus's `increment` feature doesn't support GPU scheduling variables, so we create separate watchers per process.
The statefile is a json file created when initially running `dynamo serve` and is filled in with custom leases in `serve_dynamo`. Each worker is named `{namespace}_{component_name}` when it is initially created. The `resources` come from the allocator and allows us to keep track of which GPUs are available. This statefile is read in by the LocalConnector and after each planner update we make the relevant change to the statefile. Currently, this statefile is locally saved in `~/.dynamo/state/{namespace}.json` (or in `DYN_LOCAL_STATE_DIR `) and is automatically cleaned up when the arbiter dies.
#### State Management
When one Decode worker is spun up, the statefile looks like:
The planner maintains state in a JSON file at `~/.dynamo/state/{namespace}.json`. This file:
* Tracks worker names as `{namespace}_{component_name}`
* Records GPU allocations from the allocator
* Updates after each planner action
* Cleans up automatically when the arbiter exits
Example state file evolution:
```none
# Initial decode worker
{
"dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker": {..., resources={...}}
}
```
Now another decode worker is added:
```none
# After adding worker
{
"dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker_1": {..., resources={...}},
"dynamo_VllmWorker_1": {..., resources={...}}
}
```
Then one decode worker is removed:
```none
# After removing worker
{
"dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker": {..., resources={...}}
}
```
If the last decode worker is removed, the statefile looks like:
```none
# After removing last worker
{
"dynamo_VllmWorker": {...},
"dynamo_VllmWorker": {...}
}
```
We keep the initial non-suffix entry in order to know what cmd we'll need to spin up another worker. This is the same for prefill workers as well.
``` {note}
At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
```
Note: Start with one replica per worker. Multiple initial replicas currently share a single watcher.
### Kubernetes Backend
The Kubernetes backend works by updating the replicas count of the DynamoGraphDeployment custom resource. When the planner detects the need to scale up or down a specific worker type, it uses the Kubernetes API to patch the DynamoGraphDeployment resource, modifying the replicas count for the appropriate component. The Kubernetes operator then reconciles this change by creating or removing the necessary pods. This provides a seamless scaling experience in Kubernetes environments without requiring manual intervention.
The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts. When scaling needs change, the planner:
1. Updates the deployment's replica count
2. Lets the Kubernetes operator create/remove pods
3. Maintains seamless scaling without manual intervention
......@@ -16,9 +16,9 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Deployment Examples
# Hello World: Aggregated and Disaggregated Deployment Examples
This directory contains a hello world example which implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
The `example` directory contains a [hello world example](../examples/hello_world.md) that implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
## Components
......
......@@ -15,7 +15,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Hello World Example
# Hello World Example: Basic Pipeline
## Overview
......
......@@ -79,22 +79,31 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
./container/build.sh --framework vllm --platform linux/arm64
```
> [!NOTE]
> Building a vLLM docker image for ARM machines currently involves building vLLM from source,
> which has known issues with being slow and requiring a lot of system RAM:
> https://github.com/vllm-project/vllm/issues/8878
>
> You can tune the number of parallel build jobs for building VLLM from source
> on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
>
> For example, on an ARM machine with low system resources:
> `./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
>
> For example, on a GB200 which has very high CPU cores and memory resource:
> `./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
>
> When vLLM has pre-built ARM wheels published, this process can be improved.
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require exgtensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
```
### Run container
```
......@@ -125,11 +134,15 @@ This figure shows an overview of the major components to deploy:
```
> [!NOTE]
> The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see the [Planner documentation](../../components/planner/README.md).
```{note}
The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see [PLanner](../architecture/planner.md).
```
### Example architectures
_Note_: For a non-dockerized deployment, first export `DYNAMO_HOME` to point to the dynamo repository root, e.g. `export DYNAMO_HOME=$(pwd)`
```{note}
For a non-dockerized deployment, first export `DYNAMO_HOME` to point to the dynamo repository root, e.g. `export DYNAMO_HOME=$(pwd)`
```
#### Aggregated serving
```bash
......@@ -175,27 +188,29 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
```
### Multi-node deployment
### Multinode deployment
See [multinode-examples.md](multinode-examples.md) for more details.
See [Multinode Examples](../examples/multinode.md) for more details.
### Close deployment
See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
See [Close deployment](../guides/dynamo_serve.md#close-deployment) in the *Dynamo Run* topic to learn about how to close the deployment.
## Deploy to Kubernetes
These examples can be deployed to a Kubernetes cluster using [Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) and the Dynamo CLI.
These examples can be deployed to a Kubernetes cluster using [Dynamo Cloud](../guides/dynamo_deploy/dynamo_cloud.md) and the Dynamo CLI.
### Prerequisites
You must have first followed the instructions in [deploy/cloud/helm/README.md](../../deploy/cloud/helm/README.md) to install Dynamo Cloud on your Kubernetes cluster.
You must first follow the instructions in [deploy/cloud/helm/README.md](../../deploy/cloud/helm/README.md) to install Dynamo Cloud on your Kubernetes cluster.
**Note**: The `KUBE_NS` variable in the following steps must match the Kubernetes namespace where you installed Dynamo Cloud. You must also expose the `dynamo-store` service externally. This will be the endpoint the CLI uses to interface with Dynamo Cloud.
```{note}
The `KUBE_NS` variable in the following steps must match the Kubernetes namespace where you installed Dynamo Cloud. You must also expose the `dynamo-store` service externally. This will be the endpoint the CLI uses to interface with Dynamo Cloud.
```
### Deployment Steps
For detailed deployment instructions, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md). The following are the specific commands for the LLM examples:
For detailed deployment instructions, please refer to the [Operator Deployment Guide](../guides/dynamo_deploy/operator_deployment.md). The following are the specific commands for the LLM examples:
```bash
# Set your project root directory
......
......@@ -22,7 +22,7 @@ This directory contains examples and reference implementations for deploying Lar
## Deployment Architectures
See [deployment architectures](llm_deployment.md#Deployment Architectures) to learn about the general idea of the architecture.
See [Deployment Architectures](llm_deployment.md#deployment-architectures) to learn about the general idea of the architecture.
Note that this TensorRT-LLM version does not support all the options yet.
```{note}
......@@ -37,7 +37,7 @@ TensorRT-LLM disaggregation does not support conditional disaggregation yet. You
### Prerequisites
Start required services (etcd and NATS) using [Docker Compose](../../deploy/docker-compose.yml)
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/docker-compose.yml up -d
```
......@@ -110,9 +110,8 @@ This figure shows an overview of the major components to deploy:
```
```{note}
The above architecture illustrates all the components. The final components that get spawned depend upon the chosen graph.
```
Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.
### Example architectures
......
......@@ -20,33 +20,58 @@ limitations under the License.
## Development Environment
For a consistent development environment, use the provided devcontainer configuration. This requires:
- [Docker](https://www.docker.com/products/docker-desktop)
- [VS Code](https://code.visualstudio.com/) with the [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
This section describes how to set up your development environment.
To use the devcontainer:
1. Open the project in VS Code.
2. Click the button in the bottom-left corner.
3. Select **Reopen in Container**.
### Recommended Setup: Using Dev Container
This builds and starts a container with all the necessary dependencies for Dynamo development.
We recommend using our pre-configured development container:
1. Install prerequisites:
* [Docker](https://www.docker.com/products/docker-desktop)
* [Visual Studio Code](https://code.visualstudio.com/)
* [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
## Installation
2. Get the code:
```bash
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
```
```{note}
- The following examples require system level packages.
- We recommend Ubuntu 24.04 with a x86_64 CPU. See the [Support Matrix](support_matrix.md).
```
3. Open in Visual Studio Code:
* Launch Visual Studio Code
* Click the button in the bottom-left corner
* Select **Reopen in Container**
```
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
python3 -m venv venv
source venv/bin/activate
Visual Studio Code builds and starts a container with all necessary dependencies for Dynamo development.
pip install ai-dynamo[all]
```
### Alternative Setup: Manual Installation
If you don't want to use the dev container, you can set the environment up manually:
1. Ensure you have:
* Ubuntu 24.04 (recommended)
* x86_64 CPU
* Python 3.x
* Git
See [Support Matrix](support_matrix.md) for more information.
2. Install required system packages:
```bash
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
```
3. Set up Python environment:
```bash
python3 -m venv venv
source venv/bin/activate
```
4. Install Dynamo:
```bash
pip install "ai-dynamo[all]"
```
```{note}
To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
......@@ -79,7 +104,7 @@ export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
## Running and Interacting with an LLM Locally
To run a model and interact with it locally, call `dynamo run` with a hugging face model. `dynamo run` supports several backends including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
To run a model and interact with it locally, call `dynamo run` with a Hugging Face model. `dynamo run` supports several backends, including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
### Example Command
......@@ -103,7 +128,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
- **Basic and Kv Aware Router**–Route and load balance traffic to a set of workers.
- **Workers**–Set of pre-configured LLM serving engines.
To run a minimal configuration you can use a pre-configured example.
To run a minimal configuration, use a pre-configured example.
### Start Dynamo Distributed Runtime Services
......@@ -158,10 +183,9 @@ uv pip install -e .
export PYTHONPATH=$PYTHONPATH:/workspace/deploy/dynamo/sdk/src:/workspace/components/planner/src
```
### Conda Environment
Alternately, you can use a Conda environment:
Alternatively, use a Conda environment:
```bash
conda activate <ENV_NAME>
......
......@@ -17,19 +17,23 @@ limitations under the License.
-->
# About the Dynamo Command Line Interface
The Dynamo CLI is a powerful tool for serving, containerizing, and deploying Dynamo applications. It leverages core pieces of the BentoML deployment stack and provides a range of commands to manage your Dynamo services.
The Dynamo CLI lets you:
- [`run`](#run) - quickly chat with a model
- [`serve`](#serve) - run a set of services locally (via `depends()` or `.link()`)
- [`build`](#build) - create an archive of your services (called a `bento`)
- [`deploy`](#deploy) - create a pipeline on Dynamo Cloud
The Dynamo CLI serves, containerizes, and deploys Dynamo applications efficiently. It leverages core pieces of the BentoML deployment stack and provides intuitive commands to manage your Dynamo services.
## CLI Capabilities
With the Dynamo CLI, you can:
* Chat with models quickly using `run`
* Serve multiple services locally using `serve`
* Package your services into archives (called `bentos`) using `build`
* Deploy pipelines to Dynamo Cloud using `deploy`
## Commands
### `run`
The `run` command allows you to quickly chat with a model. Under the hood - it is running the `dynamo-run` Rust binary. For details, see [Running Dynamo](dynamo_run.md).
Use `run` to start an interactive chat session with a model. This command executes the `dynamo-run` Rust binary under the hood. For more details, see [Running Dynamo](dynamo_run.md).
**Example**
```bash
......@@ -38,7 +42,7 @@ dynamo run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
### `serve`
The `serve` command lets you run a defined inference graph locally. You must point toward your file and intended class using file:Class syntax. For details, see [Serving Inference Graphs](dynamo_serve.md).
Use `serve` to run your defined inference graph locally. You'll need to specify your file and intended class using the file:Class syntax. For more details, see [Serving Inference Graphs](dynamo_serve.md).
**Usage**
```bash
......@@ -46,28 +50,28 @@ dynamo serve [SERVICE]
```
**Arguments**
- `SERVICE` - The service to start. You use file:Class syntax to specify the service.
* `SERVICE`: Specify the service to start using file:Class syntax
**Flags**
- `--file`/`-f` - Path to optional YAML configuration file. An example of the YAML file can be found in the configuration section of the [SDK docs](../API/sdk.md)
- `--dry-run` - Print out the dependency graph and values without starting any services.
- `--service-name` - Only serve the specified service name. The rest of the discoverable components in the graph are not started.
- `--working-dir` - Specify the directory to find the Service instance
- Any additional flags that follow Class.key=value are passed to the service constructor for the target service and parsed. See the configuration section of the [SDK docs](../API/sdk.md) for more details.
* `--file`/`-f`: Path to optional YAML configuration file. For configuration examples, see the [SDK docs](../API/sdk.md)
* `--dry-run`: Print the dependency graph and values without starting services
* `--service-name`: Start only the specified service name
* `--working-dir`: Set the directory for finding the Service instance
* Additional flags following Class.key=value pattern are passed to the service constructor. For details, see the configuration section of the [SDK docs](../API/sdk.md)
**Example**
```bash
cd examples
# Spin up Frontend, Middle, and Backend components
# Start the Frontend, Middle, and Backend components
dynamo serve hello_world:Frontend
# Spin up only the Middle component in the graph that is discoverable from the Frontend service
# Start only the Middle component in the graph that is discoverable from the Frontend service
dynamo serve --service-name Middle hello_world:Frontend
```
### `build`
The `build` command allows you to package up your inference graph and its dependencies and create an archive of it. This is commonly paired with the `--containerize` flag to create a single docker container that runs your inference graph. As with `serve`, you point toward the first service in your dependency graph. For details about `dynamo build`, see [Serving Inference Graphs](dynamo_serve.md).
Use `build` to package your inference graph and its dependencies into an archive. Combine this with the `--containerize` flag to create a single Docker container for your inference graph. As with `serve`, you point toward the first service in your dependency graph. For more details, see [Serving Inference Graphs](dynamo_serve.md).
**Usage**
```bash
......@@ -75,11 +79,11 @@ dynamo build [SERVICE]
```
**Arguments**
- `SERVICE` - The service to build. You use file:Class syntax to specify the service.
* `SERVICE`: Specify the service to build using file:Class syntax
**Flags**
- `--working-dir` - Specify the directory to find the Service instance
- `--containerize` - Whether to containerize the Bento after building
* `--working-dir`: Specify the directory for finding the Service instance
* `--containerize`: Choose whether to create a container from the Bento after building
**Example**
```bash
......@@ -89,7 +93,7 @@ dynamo build hello_world:Frontend
### `deploy`
The `deploy` command creates a pipeline on Dynamo Cloud using parameters at the prompt or using a YAML configuration file. For details, see [Deploying Inference Graphs to Kubernetes](dynamo_deploy/README.md).
Use `deploy` to create a pipeline on Dynamo Cloud using either interactive prompts or a YAML configuration file. For more details, see [Deploying Inference Graphs to Kubernetes](dynamo_deploy/README.md).
**Usage**
```bash
......@@ -97,18 +101,14 @@ dynamo deploy [PIPELINE]
```
**Arguments**
- `pipeline` - The pipeline to deploy. Defaults to *None*; required.
* `PIPELINE`: The pipeline to deploy; defaults to *None*; required
**Flags**
- `--name` or `-n` - Deployment name. Defaults to *None*; required.
- `--config-file` or `-f` - Configuration file path. Defaults to *None*; required.
- `--wait` - Whether or not to wait for deployment to be ready. Defaults to wait.
`--no-wait`
- `--timeout` - The number of seconds that can elapse before deployment times out; measured in seconds. Defaults to 3600.
- `--endpoint` or `-e` - The Dynamo Cloud endpoint where the pipeline should be deployed. Defaults to *None*; required.
- `--help` or `-h` - Display in-line help for `dynamo deploy`.
**Example**
For a detailed example, see [Operator Deployment](dynamo_deploy/operator_deployment.md).
* `--name`/`-n`: Set the deployment name. Defaults to *None*; required
* `--config-file`/`-f`: Specify the configuration file path. Defaults to *None*; required
* `--wait`/`--no-wait`: Choose whether to wait for deployment readiness. Defaults to wait
* `--timeout`: Set maximum deployment time in seconds. Defaults to 3600
* `--endpoint`/`-e`: Specify the Dynamo Cloud deployment endpoint. Defaults to *None*; required
* `--help`/`-h`: Display command help
For a detailed deployment example, see [Operator Deployment](dynamo_deploy/operator_deployment.md).
......@@ -23,7 +23,7 @@ The Dynamo Cloud platform is a comprehensive solution for deploying and managing
The Dynamo cloud platform consists of several key components:
- **Dynamo Operator**: A Kubernetes operator that manages the lifecycle of Dynamo inference graphs from build ➡️ deploy. For more information on the operator, see the [Dynamo Operator Page](dynamo_operator.md).
- **Dynamo Operator**: A Kubernetes operator that manages the lifecycle of Dynamo inference graphs from build ➡️ deploy. For more information on the operator, see [Dynamo Kubernetes Operator Documentation](../dynamo_deploy/dynamo_operator.md)
- **API Store**: Stores and manages service configurations and metadata related to Dynamo deployments. Needs to be exposed externally.
- **Custom Resources**: Kubernetes custom resources for defining and managing Dynamo services
......@@ -154,7 +154,7 @@ kubectl config set-context --current --namespace=$NAMESPACE
./deploy.sh --crds
```
if you wish to be guided through the deployment process, you can run the deploy script with the `--interactive` flag:
if you want guidance during the process, run the deployment script with the `--interactive` flag:
```bash
./deploy.sh --crds --interactive
......@@ -165,7 +165,7 @@ omitting `--crds` will skip the CRDs installation/upgrade. This is useful when i
4. **Expose Dynamo Cloud Externally**
``` {note}
The script automatically displays information about the endpoint you can use to access Dynamo Cloud. In our docs, we refer to this externally available endpoint as `DYNAMO_CLOUD`.
The script automatically displays information about the endpoint that you can use to access Dynamo Cloud. We refer to this externally available endpoint as `DYNAMO_CLOUD`.
```
The simplest way to expose the `dynamo-store` service within the namespace externally is to use a port-forward:
......@@ -180,7 +180,7 @@ export DYNAMO_CLOUD=http://localhost:<local-port>
After deploying the Dynamo cloud platform, you can:
1. Deploy your first inference graph using the [Dynamo CLI](operator_deployment.md)
2. Deploy Dynamo LLM pipelines to Kubernetes using the [Dynamo CLI](../../examples/llm_deployment.md)!
2. Deploy Dynamo LLM pipelines to Kubernetes using the [Dynamo CLI](../../examples/llm_deployment.md)
3. Manage your deployments using the Dynamo CLI
For more detailed information about deploying inference graphs, see the [Dynamo Deploy Guide](README.md).
# Dynamo Kubernetes Operator Documentation
# Working with Dynamo Kubernetes Operator
## Overview
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
---
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Custom Resource Definitions (CRDs)](#custom-resource-definitions-crds)
- [Installation](#installation)
- [GitOps Deployment with FluxCD](#gitops-deployment-with-fluxcd)
- [Reconciliation Logic](#reconciliation-logic)
- [Configuration](#configuration)
- [Troubleshooting](#troubleshooting)
- [Development](#development)
- [References](#references)
---
## Architecture
- **Operator Deployment:**
......@@ -37,7 +20,7 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu
3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
4. Status fields are updated to reflect the current state.
---
## Custom Resource Definitions (CRDs)
......@@ -92,8 +75,6 @@ spec:
value: some_specific_value
```
---
### CRD: `DynamoComponentDeployment`
| Field | Type | Description | Required | Default |
......@@ -149,8 +130,6 @@ spec:
serviceName: Frontend
```
---
### CRD: `DynamoComponent`
| Field | Type | Description | Required | Default |
......@@ -180,25 +159,21 @@ spec:
dynamoComponent: frontend:jh2o6dqzpsgfued4
```
---
## Installation
[See installation steps](dynamo_cloud.md#deployment-steps)
[See installation steps](dynamo_cloud.md#overview)
---
## GitOps Deployment with FluxCD
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../examples/llm/README.md) to demonstrate the workflow.
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../../examples/llm/README.md) to demonstrate the workflow.
### Prerequisites
- A Kubernetes cluster with [Dynamo Cloud](dynamo_cloud.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
- [Dynamo CLI](../../get_started.md#installation) installed locally
- [Dynamo CLI](../../get_started.md#alternative-setup-manual-installation) installed locally
### Workflow Overview
......@@ -308,7 +283,6 @@ kubectl get dynamocomponentdeployment -n $KUBE_NS
[See deployment steps](operator_deployment.md)
---
## Reconciliation Logic
......@@ -335,8 +309,6 @@ kubectl get dynamocomponentdeployment -n $KUBE_NS
- **Status Management:**
- `.status.conditions`: Reflects readiness, failure, progress states
---
## Configuration
......@@ -365,7 +337,6 @@ kubectl get dynamocomponentdeployment -n $KUBE_NS
| `--etcdAddr` | Address of etcd server | "" |
---
## Troubleshooting
......@@ -375,7 +346,6 @@ kubectl get dynamocomponentdeployment -n $KUBE_NS
| Status not updated | CRD schema mismatch | Regenerate CRDs with kubebuilder |
| Image build hangs | Misconfigured DynamoComponent | Check image build logs |
---
## Development
......@@ -387,7 +357,6 @@ The operator is built using Kubebuilder and the operator-sdk, with the following
- `api/v1alpha1/` – CRD types
- `config/` – Manifests and Helm charts
---
## References
......
# Fluid: Cloud-Native Data Orchestration and Acceleration
# Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration
Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
---
## Table of Contents
1. [Key Features](#key-features)
2. [Installation](#installation)
3. [Quick Start](#quick-start)
4. [Mounting Data Sources](#mounting-data-sources)
- [WebUFS Example](#webufs-example)
- [S3 Example](#s3-example)
5. [Using HuggingFace Models with Fluid](#using-huggingface-models-with-fluid)
6. [Usage with Dynamo](#usage-with-dynamo)
7. [Troubleshooting & FAQ](#troubleshooting--faq)
8. [Resources](#resources)
---
## Key Features
- **Data Caching and Acceleration:** Cache remote data close to compute workloads for faster access.
......@@ -26,11 +9,9 @@ Fluid is an open-source, cloud-native data orchestration and acceleration platfo
- **Kubernetes Native:** Integrates with Kubernetes using CRDs for data management.
- **Scalability:** Supports large-scale data and compute clusters.
---
## Installation
Fluid can be installed on any Kubernetes cluster using Helm.
You can install Fluid on any Kubernetes cluster using Helm.
**Prerequisites:**
- Kubernetes >= 1.18
......@@ -46,15 +27,12 @@ helm install fluid fluid/fluid -n fluid-system
```
For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation).
---
## Quick Start
1. **Install Fluid (see above).**
2. **Create a Dataset and Runtime (see examples below).**
3. **Mount the resulting PVC in your workload.**
1. Install Fluid (see [Installation](#installation)).
2. Create a Dataset and Runtime (see [the following example](#webufs-example)).
3. Mount the resulting PVC in your workload.
---
## Mounting Data Sources
......@@ -87,9 +65,7 @@ spec:
high: "0.95"
low: "0.7"
```
> After applying, Fluid creates a PersistentVolumeClaim (PVC) named `webufs-model` containing the files.
---
After applying, Fluid creates a PersistentVolumeClaim (PVC) named `webufs-model` containing the files.
### S3 Example
......@@ -137,9 +113,8 @@ spec:
- path: "/"
replicas: 1
```
> The resulting PVC is named `s3-model`.
---
The resulting PVC is named `s3-model`.
## Using HuggingFace Models with Fluid
......@@ -190,9 +165,8 @@ spec:
- name: tmp-volume
emptyDir: {}
```
> You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
---
You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
## Usage with Dynamo
......@@ -221,20 +195,17 @@ spec:
mountPoint: /model
```
---
## Troubleshooting & FAQ
- **PVC not created?** Check Fluid and AlluxioRuntime pod logs.
- **Model not found?** Ensure the model was uploaded to the correct bucket/path.
- **Permission errors?** Verify S3/MinIO credentials and bucket policies.
---
## Resources
- [Fluid Documentation](https://fluid-cloudnative.github.io/)
- [Alluxio Documentation](https://docs.alluxio.io/)
- [MinIO Documentation](https://min.io/docs/)
- [HuggingFace Hub](https://huggingface.co/docs/hub/index)
- [Dynamo Documentation](README.md)
\ No newline at end of file
- [Dynamo README](../../../README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/)
\ No newline at end of file
......@@ -23,7 +23,7 @@ This guide walks you through deploying an inference graph created with the Dynam
Before proceeding with deployment, ensure you have:
- [Dynamo Python package](../../get_started.md#installation) installed
- [Dynamo Python package](../../get_started.md#alternative-setup-manual-installation) installed
- A Kubernetes cluster with the [Dynamo cloud platform](dynamo_cloud.md) installed
- Ubuntu 24.04 as the base image for your services
- Required dependencies:
......@@ -224,4 +224,3 @@ dynamo deploy $DYNAMO_TAG -n $DEPLOYMENT_NAME -f ./configs/agg.yaml \
--env-from-secret ANOTHER_SECRET=another_secret.key \
--target kubernetes
```
# Running Dynamo (`dynamo run`)
- [Running Dynamo (`dynamo run`)](#running-dynamo-dynamo-run)
- [Quickstart with pip and vllm](#quickstart-with-pip-and-vllm)
- [Use model from Hugging Face](#use-model-from-hugging-face)
- [Run a model from local file](#run-a-model-from-local-file)
- [Download model from Hugging Face](#download-model-from-hugging-face)
- [Run model from local file](#run-model-from-local-file)
- [Distributed System](#distributed-system)
- [Network names](#network-names)
- [KV-aware routing](#kv-aware-routing)
- [Full usage details](#full-usage-details)
- [Getting Started](#getting-started)
- [Setup](#setup)
- [Step 1: Install libraries](#step-1-install-libraries)
- [Step 2: Install Rust](#step-2-install-rust)
- [Step 3: Build](#step-3-build)
- [Defaults](#defaults)
- [Running Inference with Pre-built Engines](#running-inference-with-pre-built-engines)
- [mistralrs](#mistralrs)
- [llamacpp](#llamacpp)
- [sglang](#sglang)
- [vllm](#vllm)
- [trtllm](#trtllm)
- [Step 1: Build the environment](#step-1-build-the-environment)
- [Step 2: Run the environment](#step-2-run-the-environment)
- [Step 3: Execute `dynamo run` command](#step-3-execute-dynamo-run-command)
- [Echo Engines](#echo-engines)
- [echo\_core](#echo_core)
- [echo\_full](#echo_full)
- [Configuration](#configuration)
- [Batch mode](#batch-mode)
- [Extra engine arguments](#extra-engine-arguments)
- [Writing your own engine in Python](#writing-your-own-engine-in-python)
This guide explains the`dynamo run` command.
`dynamo-run` is a CLI tool for exploring the Dynamo components. It's also an example of how to use components from Rust. If you use the Python wheel, it's available as `dynamo run` .
......@@ -421,7 +388,9 @@ uv pip install pip
uv pip install vllm==0.8.4 setuptools
```
**Note: If you're on Ubuntu 22.04 or earlier, you must add `--python=python3.10` to your `uv venv` command**
```{note}
If you're on Ubuntu 22.04 or earlier, you must add `--python=python3.10` to your `uv venv` command.
```
2. Build:
```
......@@ -438,7 +407,7 @@ Inside that virtualenv:
```
To pass extra arguments to the vllm engine see [Extra engine arguments](#extra-engine-arguments) below.
To pass extra arguments to the vllm engine see [Extra engine arguments](#extra-engine-arguments).
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
......@@ -574,8 +543,6 @@ dynamo-run in=http out=trtllm TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-
### Writing your own engine in Python
Note: This section replaces "bring-your-own-engine".
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
The Python file must do three things:
......@@ -587,10 +554,10 @@ The Python file must do three things:
from dynamo.llm import ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker
# 1. Decorate a function to get the runtime
#
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):
# 1. Decorate a function to get the runtime
#
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):
# 2. Register ourselves on the network
#
......
......@@ -23,4 +23,3 @@
guides/README.md
runtime/README.md
\ No newline at end of file
examples/disagg_skeleton.md
\ No newline at end of file
......@@ -17,10 +17,10 @@
Welcome to NVIDIA Dynamo
========================
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
Dive in: Examples
-----------------------
-----------------
.. grid:: 1 2 2 2
:gutter: 3
......@@ -54,10 +54,7 @@ Dive in: Examples
Overview
--------
The NVIDIA Dynamo Platform is a high-performance, low-latency inference platform
designed to serve all AI models—across any framework, architecture, or deployment scale.
Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures
LLM-specific capabilities such as:
Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures LLM-specific capabilities such as:
* **Disaggregated prefill & decode inference** - Maximizes GPU throughput and facilitates trade off between throughput and latency.
* **Dynamic GPU scheduling** - Optimizes performance based on fluctuating demand.
......@@ -66,7 +63,7 @@ LLM-specific capabilities such as:
* **KV cache offloading** - Leverages several memory hierarchies for higher system throughput.
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source
and driven by a transparent, OSS (Open Source Software) first development approach.
and is driven by a transparent development approach. Check out our repo at https://github.com/ai-dynamo/.
.. toctree::
:hidden:
......@@ -84,7 +81,7 @@ and driven by a transparent, OSS (Open Source Software) first development approa
Disaggregated Serving <architecture/disagg_serving.md>
KV Block Manager <architecture/kvbm_intro.rst>
KV Cache Routing <architecture/kv_cache_routing.md>
Planner <guides/planner.md>
Planner <architecture/planner.md>
.. toctree::
:hidden:
......@@ -104,6 +101,7 @@ and driven by a transparent, OSS (Open Source Software) first development approa
Disaggregation and Performance Tuning <guides/disagg_perf_tuning.md>
KV Cache Router Performance Tuning <guides/kv_router_perf_tuning.md>
Planner Benchmark Example <guides/planner_benchmark/benchmark_planner.md>
Working with Dynamo Kubernetes Operator <guides/dynamo_deploy/dynamo_operator.md>
.. toctree::
:hidden:
......@@ -113,6 +111,7 @@ and driven by a transparent, OSS (Open Source Software) first development approa
Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform <guides/dynamo_deploy/operator_deployment.md>
Manual Helm Deployment <guides/dynamo_deploy/manual_helm_deployment.md>
Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
Model Caching with Fluid <guides/dynamo_deploy/model_caching_with_fluid.md>
.. toctree::
:hidden:
......@@ -125,7 +124,8 @@ and driven by a transparent, OSS (Open Source Software) first development approa
:hidden:
:caption: Examples
Hello World Example <examples/hello_world.md>
Hello World Example: Basic <examples/hello_world.md>
Hello World Example: Aggregated and Disaggregated Deployment <examples/disagg_skeleton.md>
LLM Deployment Examples <examples/llm_deployment.md>
Multinode Examples <examples/multinode.md>
LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>
......
......@@ -50,7 +50,7 @@ maturin develop --uv
### Prerequisite
See [README.md](../../../docs/runtime/README.md#prerequisites).
See [README.md](../../../docs/../../docs/runtime/README.md).
### Hello World Example
......
......@@ -195,18 +195,19 @@ module = ["vllm.*", "bentoml.*", "fs.*", "_bentoml_sdk.*"]
follow_imports = "skip"
ignore_missing_imports = true
[tool.sphinx]
# extra-content-head
#extra_content_head = [
# '''
# <script src="https://assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js" ></script>
# ''',
#]
#[tool.sphinx]
# extra-content-footer
#extra_content_footer = [
# '''
# <script type="text/javascript">if (typeof _satellite !== "undefined") {_satellite.pageBottom();}</script>
# ''',
#]
extra_content_head = [
'''
<script src="https://assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js" ></script>
''',
]
#extra-content-footer
extra_content_footer = [
'''
<script type="text/javascript">if (typeof _satellite !== "undefined") {_satellite.pageBottom();}</script>
''',
]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment