"lib/runtime/src/vscode:/vscode.git/clone" did not exist on "98d4abbbc64b5daf49ed53f6e72c288b0abea51e"
Unverified Commit c6b59045 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: address Harry/VDR feedback + fixing broken links across repository (#3802)


Signed-off-by: default avatarHarry Kim <harry_kim@live.com>
Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Signed-off-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Signed-off-by: default avatarHarrison Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: default avatarHarrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Co-authored-by: default avatarHarry Kim <harry_kim@live.com>
Co-authored-by: default avatarClaude <noreply@anthropic.com>
Co-authored-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Co-authored-by: default avatarHarrison Saturley-Hall <hsaturleyhal@nvidia.com>
parent d712ce8d
...@@ -52,9 +52,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -52,9 +52,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | TensorRT-LLM | Notes | | Feature | TensorRT-LLM | Notes |
|---------|--------------|-------| |---------|--------------|-------|
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned | | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | | | [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
...@@ -220,13 +220,13 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag ...@@ -220,13 +220,13 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag
## Request Migration ## Request Migration
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
```bash ```bash
python3 -m dynamo.trtllm ... --migration-limit=3 python3 -m dynamo.trtllm ... --migration-limit=3
``` ```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works. This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
## Request Cancellation ## Request Cancellation
...@@ -240,7 +240,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re ...@@ -240,7 +240,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
| **Disaggregated (Decode-First)** | ✅ | ✅ | | **Disaggregated (Decode-First)** | ✅ | ✅ |
| **Disaggregated (Prefill-First)** | ✅ | ✅ | | **Disaggregated (Prefill-First)** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../../docs/architecture/request_cancellation.md) documentation. For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
## Client ## Client
......
...@@ -35,9 +35,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -35,9 +35,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | vLLM | Notes | | Feature | vLLM | Notes |
|---------|------|-------| |---------|------|-------|
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP | | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | | | [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
...@@ -153,7 +153,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu ...@@ -153,7 +153,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
### Kubernetes Deployment ### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](/components/backends/vllm/deploy/README.md) For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../components/backends/vllm/deploy/README.md)
## Configuration ## Configuration
...@@ -178,17 +178,17 @@ When using KV-aware routing, ensure deterministic hashing across processes to av ...@@ -178,17 +178,17 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
```bash ```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
``` ```
See the high-level notes in [KV Cache Routing](../../../docs/architecture/kv_cache_routing.md) on deterministic event IDs. See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs.
## Request Migration ## Request Migration
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
```bash ```bash
python3 -m dynamo.vllm ... --migration-limit=3 python3 -m dynamo.vllm ... --migration-limit=3
``` ```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works. This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
## Request Cancellation ## Request Cancellation
...@@ -201,4 +201,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re ...@@ -201,4 +201,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
| **Aggregated** | ✅ | ✅ | | **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ | | **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../../docs/architecture/request_cancellation.md) documentation. For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.
...@@ -521,3 +521,18 @@ The built-in Python workflow connects to endpoints, benchmarks with aiperf, and ...@@ -521,3 +521,18 @@ The built-in Python workflow connects to endpoints, benchmarks with aiperf, and
3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process. 3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow. The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
---
## Testing with Mocker Backend
For development and testing purposes, Dynamo provides a [mocker backend](../../components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for:
- **Testing deployments** without expensive GPU infrastructure
- **Developing and debugging** router, planner, or frontend logic
- **CI/CD pipelines** that need to validate infrastructure without model execution
- **Benchmarking framework validation** to ensure your setup works before using real backends
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
See the [mocker directory](../../components/src/dynamo/mocker/) for usage examples and configuration options.
...@@ -70,7 +70,7 @@ html_theme_options = { ...@@ -70,7 +70,7 @@ html_theme_options = {
} }
], ],
"switcher": { "switcher": {
"json_url": "../versions1.json", "json_url": "versions1.json",
"version_match": release, "version_match": release,
}, },
"extra_head": { "extra_head": {
......
...@@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc ...@@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
- [Dynamo Disaggregated Serving](disagg_serving.md) - [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](kv_cache_routing.md) - [Dynamo Smart Router](../router/kv_cache_routing.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst) - [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
- [Planner](../planner/planner_intro.rst) - [Planner](../planner/planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
......
...@@ -74,7 +74,7 @@ The `model_type` can be: ...@@ -74,7 +74,7 @@ The `model_type` can be:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name. - `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM. - `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16. - `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../architecture/request_migration.md). Defaults to 0. - `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault_tolerance/request_migration.md). Defaults to 0.
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None. - `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
See `components/backends` for full code examples. See `components/backends` for full code examples.
...@@ -116,7 +116,7 @@ In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill. ...@@ -116,7 +116,7 @@ In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.
A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline. A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../architecture/request_migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed. In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault_tolerance/request_migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
> [!WARNING] > [!WARNING]
> We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future. > We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
...@@ -139,7 +139,7 @@ class RequestHandler: ...@@ -139,7 +139,7 @@ class RequestHandler:
When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns. When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
For more information about how request migration works, see the [Request Migration Architecture](../architecture/request_migration.md) documentation. For more information about how request migration works, see the [Request Migration Architecture](../fault_tolerance/request_migration.md) documentation.
## Request Cancellation ## Request Cancellation
...@@ -161,4 +161,4 @@ class RequestHandler: ...@@ -161,4 +161,4 @@ class RequestHandler:
The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument. The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument.
For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../architecture/request_cancellation.md) documentation. For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../fault_tolerance/request_cancellation.md) documentation.
...@@ -25,10 +25,9 @@ ...@@ -25,10 +25,9 @@
api/nixl_connect/README.md api/nixl_connect/README.md
kubernetes/api_reference.md kubernetes/api_reference.md
kubernetes/create_deployment.md kubernetes/deployment/create_deployment.md
kubernetes/fluxcd.md kubernetes/fluxcd.md
kubernetes/gke_setup.md
kubernetes/grove.md kubernetes/grove.md
kubernetes/model_caching_with_fluid.md kubernetes/model_caching_with_fluid.md
kubernetes/README.md kubernetes/README.md
...@@ -36,12 +35,12 @@ ...@@ -36,12 +35,12 @@
observability/metrics.md observability/metrics.md
kvbm/vllm-setup.md kvbm/vllm-setup.md
kvbm/trtllm-setup.md kvbm/trtllm-setup.md
guides/tool-calling.md agents/tool-calling.md
architecture/kv_cache_routing.md router/kv_cache_routing.md
planner/load_planner.md planner/load_planner.md
architecture/request_migration.md fault_tolerance/request_migration.md
architecture/request_cancellation.md fault_tolerance/request_cancellation.md
backends/trtllm/multinode/multinode-examples.md backends/trtllm/multinode/multinode-examples.md
backends/trtllm/multinode/multinode-multimodal-example.md backends/trtllm/multinode/multinode-multimodal-example.md
...@@ -66,14 +65,16 @@ ...@@ -66,14 +65,16 @@
examples/README.md examples/README.md
examples/runtime/hello_world/README.md examples/runtime/hello_world/README.md
architecture/distributed_runtime.md design_docs/distributed_runtime.md
architecture/dynamo_flow.md design_docs/dynamo_flow.md
backends/vllm/deepseek-r1.md backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md backends/vllm/gpt-oss.md
backends/vllm/multi-node.md backends/vllm/multi-node.md
backends/vllm/prometheus.md backends/vllm/prometheus.md
benchmarks/kv-router-ab-testing.md
.. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md .. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
have some outdated names/references and need a refresh. have some outdated names/references and need a refresh.
...@@ -43,22 +43,28 @@ Quickstart ...@@ -43,22 +43,28 @@ Quickstart
Quickstart <self> Quickstart <self>
Installation <_sections/installation> Installation <_sections/installation>
Support Matrix <reference/support-matrix.md> Support Matrix <reference/support-matrix.md>
Architecture <_sections/architecture>
Examples <_sections/examples> Examples <_sections/examples>
.. toctree:: .. toctree::
:hidden: :hidden:
:caption: Kubernetes Deployment :caption: Kubernetes Deployment
Quickstart (K8s) <../kubernetes/README.md> Deployment Guide <_sections/k8s_deployment>
Detailed Installation Guide <../kubernetes/installation_guide.md> Observability (K8s) <_sections/k8s_observability>
Creating Deployments <../kubernetes/create_deployment.md> Multinode <_sections/k8s_multinode>
API Reference <../kubernetes/api_reference.md>
Dynamo Operator <../kubernetes/dynamo_operator.md> .. toctree::
Metrics <../kubernetes/metrics.md> :hidden:
Logging <../kubernetes/logging.md> :caption: User Guides
Multinode <../kubernetes/multinode-deployment.md>
Minikube Setup <../kubernetes/minikube.md> Tool Calling <agents/tool-calling.md>
Multimodality Support <multimodal/multimodal_intro.md>
Finding Best Initial Configs <performance/aiconfigurator.md>
Benchmarking <benchmarks/benchmarking.md>
Tuning Disaggregated Performance <performance/tuning.md>
Writing Python Workers in Dynamo <development/backend-guide.md>
Observability (Local) <_sections/observability>
Glossary <reference/glossary.md>
.. toctree:: .. toctree::
:hidden: :hidden:
...@@ -71,13 +77,9 @@ Quickstart ...@@ -71,13 +77,9 @@ Quickstart
.. toctree:: .. toctree::
:hidden: :hidden:
:caption: Developer Guide :caption: Design Docs
Benchmarking Guide <benchmarks/benchmarking.md> Overall Architecture <design_docs/architecture.md>
KV Router A/B Testing <benchmarks/kv-router-ab-testing.md> Architecture Flow <design_docs/dynamo_flow.md>
SLA Planner (Autoscaling) Quickstart <planner/sla_planner_quickstart> Disaggregated Serving <design_docs/disagg_serving.md>
Logging <observability/logging.md> Distributed Runtime <design_docs/distributed_runtime.md>
Health Checks <observability/health-checks.md>
Tuning Disaggregated Serving Performance <performance/tuning.md>
Writing Python Workers in Dynamo <development/backend-guide.md>
Glossary <reference/glossary.md>
...@@ -15,14 +15,26 @@ See the License for the specific language governing permissions and ...@@ -15,14 +15,26 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
# Deploying Inference Graphs to Kubernetes # Deploying Dynamo on Kubernetes
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
## Pre-deployment Checks ## Important Terminology
**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
- Used for: Resource isolation, RBAC, organizing deployments
- Example: `dynamo-system`, `dynamo-cloud`, `team-a-namespace`
**Dynamo Namespace**: The logical namespace used by Dynamo components for service discovery via etcd.
- Used for: Runtime component communication, service discovery
- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
- Example: `my-llm`, `production-model`, `dynamo-dev`
Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](/deploy/cloud/pre-deployment/README.md) for more details. These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
## Pre-deployment Checks
Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](../../deploy/cloud/pre-deployment/README.md) for more details.
## 1. Install Platform First ## 1. Install Platform First
...@@ -31,7 +43,7 @@ Before deploying the platform, it is recommended to run the pre-deployment check ...@@ -31,7 +43,7 @@ Before deploying the platform, it is recommended to run the pre-deployment check
export NAMESPACE=dynamo-system export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs # 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
...@@ -40,22 +52,29 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$ ...@@ -40,22 +52,29 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
``` ```
For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**. **For Shared/Multi-Tenant Clusters:**
If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```
For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](./installation_guide.md)**.
## 2. Choose Your Backend ## 2. Choose Your Backend
Each backend has deployment examples and configuration options: Each backend has deployment examples and configuration options:
| Backend | Available Configurations | | Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|---------|--------------------------| |--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
| **[vLLM](/components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node | | **[SGLang](../../components/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[SGLang](/components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node | | **[TensorRT-LLM](../../components/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[TensorRT-LLM](/components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node | | **[vLLM](../../components/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## 3. Deploy Your First Model ## 3. Deploy Your First Model
```bash ```bash
export NAMESPACE=dynamo-cloud export NAMESPACE=dynamo-system
kubectl create namespace ${NAMESPACE} kubectl create namespace ${NAMESPACE}
# to pull model from HF # to pull model from HF
...@@ -75,6 +94,8 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE} ...@@ -75,6 +94,8 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models curl http://localhost:8000/v1/models
``` ```
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla_planner_quickstart.md).
## Understanding Dynamo's Custom Resources ## Understanding Dynamo's Custom Resources
Dynamo provides two main Kubernetes Custom Resources for deploying models: Dynamo provides two main Kubernetes Custom Resources for deploying models:
...@@ -103,15 +124,15 @@ A lower-level interface that defines your complete inference pipeline: ...@@ -103,15 +124,15 @@ A lower-level interface that defines your complete inference pipeline:
Use this when you need fine-grained control or have already completed profiling. Use this when you need fine-grained control or have already completed profiling.
Refer to the [API Reference and Documentation](/docs/kubernetes/api_reference.md) for more details. Refer to the [API Reference and Documentation](./api_reference.md) for more details.
## 📖 API Reference & Documentation ## 📖 API Reference & Documentation
For detailed technical specifications of Dynamo's Kubernetes resources: For detailed technical specifications of Dynamo's Kubernetes resources:
- **[API Reference](/docs/kubernetes/api_reference.md)** - Complete CRD field specifications for all Dynamo resources - **[API Reference](./api_reference.md)** - Complete CRD field specifications for all Dynamo resources
- **[Create Deployment](/docs/kubernetes/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment - **[Create Deployment](./deployment/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
- **[Operator Guide](/docs/kubernetes/dynamo_operator.md)** - Dynamo operator configuration and management - **[Operator Guide](./dynamo_operator.md)** - Dynamo operator configuration and management
### Choosing Your Architecture Pattern ### Choosing Your Architecture Pattern
...@@ -194,13 +215,13 @@ Key customization points include: ...@@ -194,13 +215,13 @@ Key customization points include:
## Additional Resources ## Additional Resources
- **[Examples](/examples/README.md)** - Complete working examples - **[Examples](../examples/README.md)** - Complete working examples
- **[Create Custom Deployments](/docs/kubernetes/create_deployment.md)** - Build your own CRDs - **[Create Custom Deployments](./deployment/create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works - **[Operator Documentation](./dynamo_operator.md)** - How the platform works
- **[Helm Charts](/deploy/helm/README.md)** - For advanced users - **[Helm Charts](../../deploy/helm/README.md)** - For advanced users
- **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users - **[GitOps Deployment with FluxCD](./fluxcd.md)** - For advanced users
- **[Logging](/docs/kubernetes/logging.md)** - For logging setup - **[Logging](./observability/logging.md)** - For logging setup
- **[Multinode Deployment](/docs/kubernetes/multinode-deployment.md)** - For multinode deployment - **[Multinode Deployment](./deployment/multinode-deployment.md)** - For multinode deployment
- **[Grove](/docs/kubernetes/grove.md)** - For grove details and custom installation - **[Grove](./grove.md)** - For grove details and custom installation
- **[Monitoring](/docs/kubernetes/metrics.md)** - For monitoring setup - **[Monitoring](./observability/metrics.md)** - For monitoring setup
- **[Model Caching with Fluid](/docs/kubernetes/model_caching_with_fluid.md)** - For model caching with Fluid - **[Model Caching with Fluid](./model_caching_with_fluid.md)** - For model caching with Fluid
# Creating Kubernetes Deployments # Creating Kubernetes Deployments
The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally. The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a kubernetes deployment for your inference graph. The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files. This guide explains how to create your own deployment files.
## Step 1: Choose Your Architecture Pattern ## Step 1: Choose Your Architecture Pattern
Before choosing a template, understand the different architecture patterns:
### Aggregated Serving (agg.yaml)
**Pattern**: Prefill and decode on the same GPU in a single process.
**Suggested to use for**:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput
**Tradeoffs**:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
**Example**: [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml)
### Aggregated + Router (agg_router.yaml)
**Pattern**: Load balancer routing across multiple aggregated worker instances.
**Suggested to use for**:
- Medium traffic requiring high availability
- Need horizontal scaling
- Want some load balancing without disaggregation complexity
**Tradeoffs**:
- Better scalability than plain aggregated
- High availability through multiple replicas
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated
**Example**: [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml)
### Disaggregated Serving (disagg_router.yaml)
**Pattern**: Separate prefill and decode workers with specialized optimization.
**Suggested to use for**:
- Production-style deployments
- High throughput requirements
- Large models (70B+ parameters)
- Maximum GPU utilization needed
**Tradeoffs**:
- Maximum performance and throughput
- Better GPU utilization (prefill and decode specialized)
- Independent scaling of prefill and decode
- More complex setup and debugging
- Requires understanding of prefill/decode separation
**Example**: [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml)
### Quick Selection Guide
Select the architecture pattern as your template that best fits your use case. Select the architecture pattern as your template that best fits your use case.
For example, when using the `VLLM` inference backend: For example, when using the `vLLM` backend:
- **Development / Testing** - **Development / Testing**: Use [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml) as the base configuration.
Use [`agg.yaml`](/components/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Production with Load Balancing** - **Production with Load Balancing**: Use [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
Use [`agg_router.yaml`](/components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **High Performance / Disaggregated Deployment** - **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
Use [`disagg_router.yaml`](/components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
## Step 2: Customize the Template ## Step 2: Customize the Template
...@@ -90,7 +144,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo ...@@ -90,7 +144,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]" The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command. Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](/docs/reference/cli.md) for details on how to run this command. If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
## Step 3: Key Customization Points ## Step 3: Key Customization Points
......
...@@ -58,5 +58,5 @@ kubectl get storageclass ...@@ -58,5 +58,5 @@ kubectl get storageclass
## Next Steps ## Next Steps
Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](./installation_guide.md) to deploy the platform to your local cluster. Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation_guide.md) to deploy the platform to your local cluster.
...@@ -23,11 +23,57 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu ...@@ -23,11 +23,57 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu
For the complete technical API reference for Dynamo Custom Resource Definitions, see: For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](/docs/kubernetes/api_reference.md)** **📖 [Dynamo CRD API Reference](./api_reference.md)**
## Installation ## Installation
[See installation steps](/docs/kubernetes/installation_guide.md#overview) ### Quick Install with Helm
```bash
# Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# Install Platform (includes operator)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
For namespace-restricted installations (shared clusters):
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace ${NAMESPACE} \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true
```
### Building from Source
```bash
# Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=your-registry.com/ # your container registry
export IMAGE_TAG=latest
# Build operator image
cd deploy/cloud/operator
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG .
docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
cd -
# Install CRDs
cd deploy/cloud/helm
helm install dynamo-crds ./crds/ --namespace default
# Install platform with custom operator image
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--create-namespace \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}"
```
For detailed installation options, see the [Installation Guide](./installation_guide.md)
## Development ## Development
......
# GitOps Deployment with FluxCD # GitOps Deployment with FluxCD
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](/docs/backends/vllm/README.md) to demonstrate the workflow. This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
## Prerequisites ## Prerequisites
- A Kubernetes cluster with [Dynamo Cloud](/docs/kubernetes/installation_guide.md) installed - A Kubernetes cluster with [Dynamo Cloud](./installation_guide.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster - [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations - A Git repository to store your deployment configurations
...@@ -18,7 +18,7 @@ The GitOps workflow for Dynamo deployments consists of three main steps: ...@@ -18,7 +18,7 @@ The GitOps workflow for Dynamo deployments consists of three main steps:
## Step 1: Build and Push Dynamo Cloud Operator ## Step 1: Build and Push Dynamo Cloud Operator
First, follow to [See Install Dynamo Cloud](/docs/kubernetes/installation_guide.md). First, follow to [See Install Dynamo Cloud](./installation_guide.md).
## Step 2: Create Initial Deployment ## Step 2: Create Initial Deployment
......
# GKE Workload Identity and Artifact Registry Setup Guide
This guide explains how to set up Workload Identity in GKE and configure access to Google Artifact Registry.
## Prerequisites
- Google Cloud SDK installed
- Access to a GKE cluster
- Required permissions to create and manage service accounts
## Project Setup
Set your project:
```bash
export NAMESPACE=your-k8s-namespace
export RELEASE=your-helm-release-name
export PROJECT=$(gcloud config get-value project)
# set the cluster related info (you can list cluster using gcloud container clusters list)
export CLUSTER_NAME=your-cluster-name
export CLUSTER_REGION=$(gcloud container clusters list --filter="name=${CLUSTER_NAME}" --format="get(location)")
gcloud config set project ${PROJECT}
# Retrieve the Workload Identifier Namespace associated with your cluster:
export CLUSTER_WIN=$(gcloud container clusters describe ${CLUSTER_NAME} \
--region=${CLUSTER_REGION} \
--format="value(workloadIdentityConfig.workloadPool)")
```
```{important}
Make sure Workload Identity is enabled in your cluster!
```
## Service Account Creation and Configuration
1. Create a service account for Workload Identity:
Go to the GCP console and create a new service account (or reuse an existing one)
```bash
gcloud iam service-accounts create workload-identity-sa\
--display-name="workload identity service account" \
--description="Service account to use for Workload Identity in GKE"
export SA=workload-identity-sa@${PROJECT}.iam.gserviceaccount.com
```
2. Configure Workload Identity bindings for Kubernetes service accounts:
```bash
gcloud iam service-accounts add-iam-policy-binding \
${SA} \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:${CLUSTER_WIN}[${NAMESPACE}/${RELEASE}-dynamo-operator-controller-manager]"
gcloud iam service-accounts add-iam-policy-binding \
${SA} \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:${CLUSTER_WIN}[${NAMESPACE}/${RELEASE}-dynamo-operator-image-builder]"
gcloud iam service-accounts add-iam-policy-binding \
${SA} \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:${CLUSTER_WIN}[${NAMESPACE}/${RELEASE}-dynamo-operator-component]"
```
## Artifact Registry Access
### Option 1: Project-Level Access
Grant read and write access at the project level:
```bash
# Grant reader role
gcloud projects add-iam-policy-binding ${PROJECT} \
--member="serviceAccount:${SA}" \
--role="roles/artifactregistry.reader"
# Grant writer role
gcloud projects add-iam-policy-binding ${PROJECT} \
--member="serviceAccount:${SA}" \
--role="roles/artifactregistry.writer"
```
### Option 2: Repository-Level Access
Grant access to specific repository:
```bash
gcloud artifacts repositories add-iam-policy-binding your-artifact-repository \
--location=${CLUSTER_REGION} \
--project=${PROJECT} \
--member="serviceAccount:${SA}" \
--role="roles/artifactregistry.reader"
```
## GKE Node Access to Artifact Registry
This is needed to make sure pods can pull images from Artifact Registry without needing to specify an imagePullSecret
### For GKE Autopilot
```bash
# Get project number
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT} --format='value(projectNumber)')
# Grant access to the default compute service account
gcloud projects add-iam-policy-binding ${PROJECT} \
--member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
--role="roles/artifactregistry.reader"
```
### For Standard GKE
```bash
# Get node service account
export NODE_SERVICE_ACCOUNT=$(gcloud container clusters describe ${CLUSTER_NAME} \
--region ${CLUSTER_REGION} \
--format="get(nodeConfig.serviceAccount)")
# Grant access to node service account
gcloud projects add-iam-policy-binding ${PROJECT} \
--member="serviceAccount:${NODE_SERVICE_ACCOUNT}" \
--role="roles/artifactregistry.reader"
```
## Adding annotations to enable Workload Identity
This is an example of values.yaml used to deploy Dynamo Cloud using custom GCP annotations to enable Workload Identity.
```yaml
dynamo-operator:
...
controllerManager:
serviceAccount:
create: true
annotations:
iam.gke.io/gcp-service-account: your-sa@your-gcp-project.iam.gserviceaccount.com
...
dynamo:
components:
serviceAccount:
annotations:
iam.gke.io/gcp-service-account: your-sa@your-gcp-project.iam.gserviceaccount.com
...
....
```
You can use it during helm installation:
```bash
helm upgrade --install ${RELEASE} platform/ -f values.yaml --namespace ${NAMESPACE}
```
## Important Notes
1. **Prerequisites for Image Pulling**:
- Workload Identity must be enabled on your GKE cluster
- GKE nodes' service account must have the `artifactregistry.reader` role
2. **Troubleshooting**:
- If pods can't pull images, verify both Workload Identity and node service account configurations
- Check service account annotations on Kubernetes service accounts
- Verify IAM bindings are correctly set up
## References
- [GKE Workload Identity Documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity)
- [Artifact Registry Authentication](https://cloud.google.com/artifact-registry/docs/docker/authentication)
- [IAM Roles for Artifact Registry](https://cloud.google.com/artifact-registry/docs/access-control)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment