Unverified Commit c6b59045 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: address Harry/VDR feedback + fixing broken links across repository (#3802)


Signed-off-by: default avatarHarry Kim <harry_kim@live.com>
Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Signed-off-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Signed-off-by: default avatarHarrison Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: default avatarHarrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Co-authored-by: default avatarHarry Kim <harry_kim@live.com>
Co-authored-by: default avatarClaude <noreply@anthropic.com>
Co-authored-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Co-authored-by: default avatarHarrison Saturley-Hall <hsaturleyhal@nvidia.com>
parent d712ce8d
......@@ -93,8 +93,8 @@ For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/N
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](./deployment/multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
Dynamo Cloud also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Cloud Deployment Installation Guide](installation_guide.md) for more details.
\ No newline at end of file
Dynamo Cloud also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Cloud Deployment Installation Guide](./installation_guide.md) for more details.
\ No newline at end of file
......@@ -19,18 +19,42 @@ limitations under the License.
Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
## Quick Start Paths
## Before You Start
Platform is installed using Dynamo Kubernetes Platform [helm chart](/deploy/cloud/helm/platform/README.md).
Determine your cluster environment:
**Path A: Production Install**
Install from published artifacts on your existing cluster → [Jump to Path A](#path-a-production-install)
**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
- Must use namespace-restricted installation (see note in installation steps)
**Path B: Local Development**
Set up Minikube first → [Minikube Setup](minikube.md) → Then follow Path A
**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)
**Path C: Custom Development**
Build from source for customization → [Jump to Path C](#path-c-custom-development)
**Local Development** (Minikube, testing):
- See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below
To check if CRDs already exist:
```bash
kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
```
## Installation Paths
Platform is installed using Dynamo Kubernetes Platform [helm chart](../../deploy/cloud/helm/platform/README.md).
**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)
**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)
All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:
......@@ -48,31 +72,39 @@ helm install ...
## Prerequisites
Verify before proceeding:
- Kubernetes cluster v1.24+ access
- kubectl v1.24+ installed and configured
- Helm v3.0+ installed
- Cluster type determined (shared vs dedicated)
- CRD status checked if on shared cluster
- NGC credentials if using NVIDIA images (optional for public images)
Estimated time: 5-30 minutes depending on path
```bash
# Required tools
# Check required tools
kubectl version --client # v1.24+
helm version # v3.0+
docker version # Running daemon
docker version # Running daemon (for Path D only)
# Set your inference runtime image
# Set your release version
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${RELEASE_VERSION}
# Also available: sglang-runtime, tensorrtllm-runtime
```
> [!TIP]
> No cluster? See [Minikube Setup](minikube.md) for local development.
> No cluster? See [Minikube Setup](deployment/minikube.md) for local development.
## Path A: Production Install
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts) in 3 steps.
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
......@@ -81,10 +113,27 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
**For Shared/Multi-Tenant Clusters:**
If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:
```bash
# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true
```
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
If you see this validation error, you need namespace restriction:
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
> [!TIP]
> For multinode deployments, you need to enable Grove and Kai Scheduler.
> For multinode deployments, you need to enable Grove and KAI Scheduler.
> You might chose to install them manually or through the dynamo-platform helm install command.
> When using the dynamo-platform helm install command, Grove and Kai Scheduler are NOT installed by default. You can enable their installation by setting the following flags in the helm install command:
> When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags in the helm install command:
```bash
--set "grove.enabled=true"
......@@ -111,9 +160,11 @@ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace
[Verify Installation](#verify-installation)
## Path B: Custom Development
## Path B: Custom Build from Source
Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
Build and deploy from source for customization.
Note: This gives you access to the latest unreleased features and fixes on the main branch.
```bash
# 1. Set environment
......@@ -190,16 +241,43 @@ kubectl get pods -n ${NAMESPACE}
```
2. **Explore Backend Guides**
- [vLLM Deployments](/components/backends/vllm/deploy/README.md)
- [SGLang Deployments](/components/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](/components/backends/trtllm/deploy/README.md)
- [vLLM Deployments](../../components/backends/vllm/deploy/README.md)
- [SGLang Deployments](../../components/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](../../components/backends/trtllm/deploy/README.md)
3. **Optional:**
- [Set up Prometheus & Grafana](metrics.md)
- [Set up Prometheus & Grafana](./observability/metrics.md)
- [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting
**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
Solution: Add namespace restriction to your installation:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
**CRDs already exist**
Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).
Solution: Skip step 2 (CRD installation), proceed directly to platform installation.
To check if CRDs exist:
```bash
kubectl get crd | grep dynamo
```
**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
......@@ -232,8 +310,7 @@ just add the following to the helm install command:
## Advanced Options
- [Helm Chart Configuration](/deploy/cloud/helm/platform/README.md)
- [GKE-specific setup](gke_setup.md)
- [Create custom deployments](create_deployment.md)
- [Dynamo Operator details](dynamo_operator.md)
- [Helm Chart Configuration](../../deploy/cloud/helm/platform/README.md)
- [Create custom deployments](./deployment/create_deployment.md)
- [Dynamo Operator details](./dynamo_operator.md)
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)
......@@ -17,7 +17,7 @@ This guide demonstrates how to set up logging for Dynamo in Kubernetes using Gra
### 1. Dynamo Cloud Kubernetes Operator
This guide assumes you have installed Dynamo Cloud Kubernetes Operator. For more information, see [Dynamo Cloud Operator](./README.md).
This guide assumes you have installed Dynamo Cloud Kubernetes Operator. For more information, see [Dynamo Cloud Operator](../README.md).
### 2. Kube-prometheus
......
......@@ -28,7 +28,7 @@ helm install prometheus -n monitoring --create-namespace prometheus-community/ku
> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](/docs/kubernetes/installation_guide.md) for detailed instructions on deploying the Dynamo operator.
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../installation_guide.md) for detailed instructions on deploying the Dynamo operator.
Make sure to set the `prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
```bash
......@@ -53,7 +53,7 @@ If the output is empty, you need to install the dcgm-exporter. For more informat
Let's start by deploying a simple vLLM aggregated deployment:
```bash
export NAMESPACE=dynamo # namespace where dynamo operator is installed
export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
pushd components/backends/vllm/deploy
kubectl apply -f agg.yaml -n $NAMESPACE
popd
......@@ -64,8 +64,8 @@ This will create two components:
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md)
- Available metrics: See the [metrics guide](/docs/observability/metrics.md)
- Deployment configuration: See the [vLLM README](../../backends/vllm/README.md)
- Available metrics: See the [metrics guide](../../observability/metrics.md)
### Validate the Deployment
......@@ -87,7 +87,7 @@ curl localhost:8000/v1/chat/completions \
}'
```
For more information about validating the deployment, see the [vLLM README](../backends/vllm/README.md).
For more information about validating the deployment, see the [vLLM README](../../backends/vllm/README.md).
## Set Up Metrics Collection
......@@ -137,7 +137,7 @@ Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`
![Prometheus UI showing Dynamo metrics](../images/prometheus-k8s.png)
![Prometheus UI showing Dynamo metrics](../../images/prometheus-k8s.png)
### In Grafana
```bash
......@@ -155,4 +155,4 @@ Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.
![Grafana dashboard showing Dynamo metrics](../images/grafana-k8s.png)
![Grafana dashboard showing Dynamo metrics](../../images/grafana-k8s.png)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multimodal Inference in Dynamo:
You can find example workflows and reference implementations for deploying a multimodal model using Dynamo in [multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal).
## EPD vs. PD Disaggregation
Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.
### 1. EPD (Encode-Prefill-Decode) Disaggregation
The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.
* **Media Input:** Image, video, audio, or an embedding URL is provided.
* **Process Flow:**
1. A dedicated **Encode Worker** is launched separately to handle the embedding extraction from the media input.
2. The extracted embeddings are transferred to the main engine via the **NVIDIA Inference Xfer Library (NIXL)**.
3. The main **Engine** performs the remaining **Prefill Decode Disaggregation** steps to generate the output.
* **Benefit:** This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.
### 2. PD (Prefill-Decode) Disaggregation
The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.
* **Media Input:** Image, video, or audio is loaded.
* **Process Flow:**
1. The main **Engine** receives the media input.
2. The Engine executes the full sequence: **Encode + Prefill + Decode**.
* **Note:** In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.
## Inference Framework Support Matrix
Dynamo supports multimodal capabilities across leading LLM inference backends, including **vLLM**, **TensorRT-LLM (TRT-LLM)**, and **SGLang**. The table below details the current support level for EPD/PD and various media types for each stack.
| Stack | EPD Support | PD Support | Image | Video | Audio |
| --------- | --------- | --------- | --------- |---------| --------- |
| **vLLM** | ✅ | ✅ | ✅ | ✅ | 🚧 |
| **TRT-LLM** | ✅ (Currently via precomputed Embeddings URL) | ✅ | ✅ | ❌ | ❌ |
| **SGLang** | ✅ | ❌ | ✅ | ❌ | ❌ |
......@@ -195,6 +195,6 @@ date: Wed, 03 Sep 2025 13:42:45 GMT
## Related Documentation
- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
......@@ -185,7 +185,7 @@ curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages":
## Related Documentation
- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
- [Log Aggregation in Kubernetes](../kubernetes/logging.md)
- [Log Aggregation in Kubernetes](../kubernetes/observability/logging.md)
......@@ -31,7 +31,7 @@ Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also a
**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics.
**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](../kubernetes/metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana.
**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](../kubernetes/observability/metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana.
## Metrics Hierarchy
......@@ -94,8 +94,8 @@ The metrics system includes a pre-configured Grafana dashboard for visualizing s
## Related Documentation
- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
- [Metrics Implementation Examples](../../deploy/metrics/README.md#implementation-examples)
- [Complete Metrics Setup Guide](../../deploy/metrics/README.md)
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Finding Best Initial Configs using AIConfigurator
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
## Why Use AIConfigurator?
When deploying LLMs with Dynamo, you need to make several critical decisions:
- **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload?
- **Worker Configuration**: How many prefill and decode workers to deploy?
- **Parallelism Settings**: What tensor/pipeline parallel configuration to use?
- **SLA Compliance**: How to meet your TTFT and TPOT targets?
AIConfigurator answers these questions in seconds, providing:
- Optimal configurations that meet your SLA requirements
- Ready-to-deploy Dynamo configuration files
- Performance comparisons between different deployment strategies
- Up to 1.7x better throughput compared to manual configuration
## Quick Start
```bash
# Install
pip3 install aiconfigurator
# Find optimal configuration
aiconfigurator cli default \
--model QWEN3_32B \ # Model name (QWEN3_32B, LLAMA3.1_70B, etc.)
--total_gpus 32 \ # Number of available GPUs
--system h200_sxm \ # GPU type (h100_sxm, h200_sxm, a100_sxm)
--isl 4000 \ # Input sequence length (tokens)
--osl 500 \ # Output sequence length (tokens)
--ttft 300 \ # Target Time To First Token (ms)
--tpot 10 \ # Target Time Per Output Token (ms)
--save_dir ./dynamo-configs
# Deploy
kubectl apply -f ./dynamo-configs/disagg/top1/disagg/k8s_deploy.yaml
```
## Example Output
```text
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: QWEN3_32B (is_moe: False)
Total GPUs: 32
Best Experiment Chosen: disagg at 812.92 tokens/s/gpu (1.70x better)
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 812.92 tokens/s/gpu
- User Throughput: 120.23 tokens/s/user
- TTFT: 276.76ms
- TPOT: 8.32ms
----------------------------------------------------------------------------
Pareto Frontier:
QWEN3_32B Pareto Frontier: tokens/s/gpu vs tokens/s/user
┌────────────────────────────────────────────────────────────────────────┐
1600.0┤ •• disagg │
│ ff agg │
│ xx disagg best │
│ │
1333.3┤ f │
│ ff │
│ ff • │
│ f •••••••• │
1066.7┤ f •• │
│ fff •••••••• │
│ f •• │
│ f •••• │
800.0┤ fffff •••x │
│ fff •• │
│ fff • │
│ fffff •• │
533.3┤ ffff •• │
│ ffff •• │
│ fffffff ••••• │
│ ffffff •• │
266.7┤ fffff ••••••••• │
│ ffffffffff │
│ f │
│ │
0.0┤ │
└┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
0 60 120 180 240
tokens/s/gpu tokens/s/user
1. **Performance Comparison**: Shows disaggregated vs aggregated serving performance
2. **Optimal Configuration**: The best configuration that meets your SLA targets
3. **Deployment Files**: Ready-to-use Dynamo configuration files
## Key Features
### Fast Profiling Integration
```bash
# Use with Dynamo's SLA planner (20-30 seconds vs hours)
python3 -m benchmarks.profiler.profile_sla \
--config ./components/backends/trtllm/deploy/disagg.yaml \
--backend trtllm \
--use-ai-configurator \
--aic-system h200_sxm \
--aic-model-name QWEN3_32B
```
### Custom Configuration
```bash
# For advanced users: define custom search space
aiconfigurator cli exp --yaml_path custom_config.yaml
```
## Common Use Cases
```bash
# Strict SLAs (low latency)
aiconfigurator cli default --model QWEN2.5_7B --total_gpus 8 --system h200_sxm --ttft 100 --tpot 5
# High throughput (relaxed latency)
aiconfigurator cli default --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 1000 --tpot 50
```
## Supported Configurations
**Models**: GPT, LLAMA2/3, QWEN2.5/3, Mixtral, DEEPSEEK_V3
**GPUs**: H100, H200, A100, B200 (preview), GB200 (preview)
**Backend**: TensorRT-LLM (vLLM and SGLang coming soon)
## Additional Options
```bash
# Web interface
aiconfigurator webapp # Visit http://127.0.0.1:7860
# Docker
docker run -it --rm nvcr.io/nvidia/aiconfigurator:latest \
aiconfigurator cli default --model LLAMA3.1_70B --total_gpus 16 --system h100_sxm
```
## Troubleshooting
**Model name mismatch**: Use exact model name that matches your deployment
**GPU allocation**: Verify available GPUs match `--total_gpus`
**Performance variance**: Results are estimates - benchmark actual deployment
## Learn More
- [Dynamo Installation Guide](/docs/kubernetes/installation_guide.md)
- [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md)
- [Benchmarking Guide](/docs/benchmarks/benchmarking.md)
\ No newline at end of file
......@@ -21,7 +21,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/planner/src/dynamo/planner/defaults.py).
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/src/dynamo/planner/defaults.py).
```mermaid
flowchart LR
......
......@@ -15,7 +15,7 @@ The deployment process consists of two mandatory phases:
2. **SLA Planner Deployment** (5-10 minutes) - Enables autoscaling
> [!TIP]
> **Fast Profiling with AI Configurator**: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See [AI Configurator section](/docs/benchmarks/pre_deployment_profiling.md#running-the-profiling-script-with-aiconfigurator) in the Profiling Guide.
> **Fast Profiling with AI Configurator**: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See [AI Configurator section](/docs/benchmarks/pre_deployment_profiling.md#running-the-profiling-script-with-ai-configurator) in the Profiling Guide.
```mermaid
flowchart TD
......@@ -38,7 +38,7 @@ flowchart TD
Before deploying the SLA planner, ensure:
- **Dynamo platform installed** (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/metrics.md) installed and running.** By default, the prometheus server is deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running.** By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
- **Benchmarking resources setup** (see [Kubernetes utilities for Dynamo Benchmarking and Profiling](../../deploy/utils/README.md)) The script will create a `dynamo-pvc` with `ReadWriteMany` access, if your cluster's default storageClassName does not allow `ReadWriteMany`, you need to specify a different storageClassName in `deploy/utils/manifests/pvc.yaml` which does support `ReadWriteMany`.
......
......@@ -152,19 +152,19 @@ The KV-aware routing arguments:
### Request Migration
In a [Distributed System](#distributed-system), you can enable [request migration](../architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
In a [Distributed System](#distributed-system), you can enable [request migration](../fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
```bash
dynamo-run in=dyn://... out=<engine> ... --migration-limit=3
```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../architecture/request_migration.md) documentation for details on how this works.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../fault_tolerance/request_migration.md) documentation for details on how this works.
### Request Cancellation
When using the HTTP interface (`in=http`), if the HTTP request connection is dropped by the client, Dynamo automatically cancels the downstream request to the worker. This ensures that computational resources are not wasted on generating responses that are no longer needed.
For detailed information about how request cancellation works across the system, see the [Request Cancellation Architecture](../architecture/request_cancellation.md) documentation.
For detailed information about how request cancellation works across the system, see the [Request Cancellation Architecture](../fault_tolerance/request_cancellation.md) documentation.
## Development
......
......@@ -255,7 +255,7 @@ python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync
>[!Note]
> If you need to start with a fresh state, you have two options:
> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](distributed_runtime.md)) which will start a new stream and NATS object store path
> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](/docs/design_docs/distributed_runtime.md)) which will start a new stream and NATS object store path
> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
## Understanding KV Cache
......
......@@ -30,17 +30,24 @@ Learn fundamental Dynamo concepts through these introductory examples:
- **[Disaggregated Serving](basics/disaggregated_serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
- **[Multi-node](basics/multinode/README.md)** - Distributed inference across multiple nodes and GPUs
## Framework Support
These examples show how Dynamo broadly works using major inference engines.
If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
## Deployment Examples
Platform-specific deployment guides for production environments:
- **[Amazon EKS](deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service
- **[Azure AKS](deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service
- **[Amazon ECS](deployments/ECS/)** - Deploy Dynamo on Amazon Elastic Container Service
- **[Router Standalone](deployments/router_standalone/)** - Standalone router deployment patterns
- **Amazon ECS** - _Coming soon_
- **Google GKE** - _Coming soon_
- **Ray** - _Coming soon_
- **NVIDIA Cloud Functions (NVCF)** - _Coming soon_
## Runtime Examples
......@@ -68,11 +75,4 @@ Before running any examples, ensure you have:
- **Python 3.9++** - For client scripts and utilities
- **Kubernetes cluster** - For any cloud deployment/K8s examples
## Framework Support
These examples show how Dynamo broadly works using major inference engines.
If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
......@@ -4,8 +4,8 @@ This example demonstrates running Dynamo across multiple nodes with **KV-aware r
For more information about the core concepts, see:
- [Dynamo Disaggregated Serving](../../../docs/architecture/disagg_serving.md)
- [KV Cache Routing Architecture](../../../docs/architecture/kv_cache_routing.md)
- [Dynamo Disaggregated Serving](../../../docs/design_docs/disagg_serving.md)
- [KV Cache Routing Architecture](../../../docs/router/kv_cache_routing.md)
## Architecture Overview
......@@ -65,7 +65,7 @@ This is particularly beneficial for:
- **Similar queries**: Common prefixes are computed once and reused
- **Batch processing**: Related requests can be routed to workers with shared context
For detailed technical information about how KV routing works, see the [KV Cache Routing Architecture documentation](../../../docs/architecture/kv_cache_routing.md).
For detailed technical information about how KV routing works, see the [KV Cache Routing Architecture documentation](../../../docs/router/kv_cache_routing.md).
## Prerequisites
......@@ -461,7 +461,7 @@ python -m dynamo.frontend \
--router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic)
```
For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [KV Cache Routing documentation](../../../docs/architecture/kv_cache_routing.md).
For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [KV Cache Routing documentation](../../../docs/router/kv_cache_routing.md).
## Cleanup
......
......@@ -88,4 +88,4 @@ python3 client.py --middle
- Both modes demonstrate the same cancellation behavior
- The middle server shows how to properly forward context in proxy scenarios
For more details on the request cancellation architecture, refer to the [architecture documentation](../../../docs/architecture/request_cancellation.md).
For more details on the request cancellation architecture, refer to the [architecture documentation](../../../docs/fault_tolerance/request_cancellation.md).
......@@ -165,7 +165,7 @@ Test complete scaling behavior including Kubernetes deployment and load generati
**Prerequisites:**
- **[kube-prometheus-stack](../../docs/kubernetes/metrics.md) installed and running.** The SLA planner requires Prometheus to observe metrics and make scaling decisions.
- **[kube-prometheus-stack](../../docs/kubernetes/observability/metrics.md) installed and running.** The SLA planner requires Prometheus to observe metrics and make scaling decisions.
- Ensure the Dynamo operator was installed with the Prometheus endpoint configured (see [SLA Planner Quickstart Guide](../../docs/planner/sla_planner_quickstart.md#prerequisites) for details).
**Prepare the test deployment manifest:**
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment