For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/deploy/k8s_metrics.md](../../../docs/guides/deploy/k8s_metrics.md).
For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/dynamo_deploy/k8s_metrics.md](../../../docs/guides/dynamo_deploy/k8s_metrics.md).
@@ -48,7 +48,7 @@ There are multi-faceted challenges:
...
@@ -48,7 +48,7 @@ There are multi-faceted challenges:
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
## High level architecture and key benefits
## Key benefits
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
@@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
...
@@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
***Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
***Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
***Correction factors**: Adapts to real-world performance deviations from profiled data
***Correction factors**: Adapts to real-world performance deviations from profiled data
## Architecture
## Design
The SLA planner consists of several key components:
The SLA planner consists of several key components:
...
@@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
...
@@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
@@ -17,85 +17,130 @@ limitations under the License.
...
@@ -17,85 +17,130 @@ limitations under the License.
# Deploying Inference Graphs to Kubernetes
# Deploying Inference Graphs to Kubernetes
We expect users to deploy their inference graphs using CRDs or helm charts.
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
# 1. Install Dynamo Cloud.
## 1. Install Platform First
**[Dynamo Kubernetes Platform](dynamo_cloud.md)** - Main installation guide with 3 paths
Prior to deploying an inference graph the user should deploy the Dynamo Cloud Platform. Reference the [Quickstart Guide](quickstart.md) for steps to install Dynamo Cloud with Helm.
## 2. Choose Your Backend
Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you. This is a one-time action, only necessary the first time you deploy a DynamoGraph.
Each backend has deployment examples and configuration options:
We provide a Custom Resource YAML file for many examples under the components/backends/{engine}/deploy folders. Consult the examples below for the CRs for a specific inference backend.
You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment.
The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.
You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.
We provide a Custom Resource YAML file for many examples under the `deploy/` folder.
### Choosing Your Architecture Pattern
Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example.
**Note 1** Example Image
When creating a deployment, select the architecture pattern that best fits your use case:
The examples use a prebuilt image from the `nvcr.io` registry.
-**Development / Testing** - Use `agg.yaml` as the base configuration
You can utilize public images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) or build your own image and update the image location in your CR file prior to applying. Either way, you will need to overwrite the image in the example YAML.
-**Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
-**High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
To build your own image:
### Frontend and Worker Components
```bash
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
@@ -15,102 +15,167 @@ See the License for the specific language governing permissions and
...
@@ -15,102 +15,167 @@ See the License for the specific language governing permissions and
limitations under the License.
limitations under the License.
-->
-->
# Dynamo Cloud Kubernetes Platform
# Dynamo Kubernetes Platform
The Dynamo Cloud platform is a comprehensive solution for deploying and managing Dynamo inference graphs (also referred to as pipelines) in Kubernetes environments. It provides a streamlined experience for deploying, scaling, and monitoring your inference services.
Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
## Overview
## Quick Start Paths
The Dynamo cloud platform consists of several key components:
**Path A: Production Install**
Install from published artifacts on your existing cluster → [Jump to Path A](#path-a-production-install)
-**Dynamo Operator**: A Kubernetes operator that manages the lifecycle of Dynamo inference graphs from build ➡️ deploy. For more information on the operator, see [Dynamo Kubernetes Operator Documentation](../dynamo_deploy/dynamo_operator.md)
**Path B: Local Development**
-**Custom Resources**: Kubernetes custom resources for defining and managing Dynamo services
Set up Minikube first → [Minikube Setup](minikube.md) → Then follow Path A
**Path C: Custom Development**
Build from source for customization → [Jump to Path C](#path-c-custom-development)
## Deployment Prerequisites
## Prerequisites
Before getting started with the Dynamo cloud platform, ensure you have:
- A Kubernetes cluster (version 1.24 or later)
-[Earthly](https://earthly.dev/) installed for building components
- Docker installed and running
- Access to a container registry (e.g., Docker Hub, NVIDIA NGC, etc.)
# Also available: sglang-runtime, tensorrtllm-runtime
```
> [!TIP]
> [!TIP]
> Don't have a Kubernetes cluster? Check out our [Minikube setup guide](../../../docs/guides/dynamo_deploy/minikube.md) to set up a local environment! 🏠
> No cluster? See [Minikube Setup](minikube.md) for local development.
#### 🏗️ Build Dynamo inference runtime.
## Path A: Production Install
[One-time Action]
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts) in 3 steps.
Before you could use Dynamo make sure you have setup the Inference Runtime Image.
For basic cases you could use the prebuilt image for the Dynamo Inference Runtime.
Just export the environment variable. This will be the image used by your individual components. You pick whatever dynamo version you want or use the latest (default)
⚠️ **Note:** that omitting `--crds` will skip the CRDs installation/upgrade. This is useful when installing on a shared cluster as CRDs are cluster-scoped resources.
⚠️ **Note:** If you'd like to only generate the generated-values.yaml file without deploying to Kubernetes (e.g., for inspection, CI workflows, or dry-run testing), use:
Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
## Overview
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
### How Grove Works for Disaggregated Serving
Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
-**Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
-**Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
-**Resource Optimization**: Better utilization of hardware resources through specialized workload placement
-**Fault Isolation**: Issues in one component don't necessarily affect others
## Core Components and API Resources
Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
### PodGangSet
The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
### PodClique
Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
### PodCliqueScalingGroup
A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
## Key Capabilities for Disaggregated Serving
Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
### Flexible Gang Scheduling
PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodGangSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
### Multi-level Horizontal Auto-Scaling
Supports pluggable horizontal auto-scaling solutions to scale PodGangSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
## Use Cases and Examples
Grove specifically supports:
-**Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
-**Single-node disaggregated inference** for optimized resource utilization
-**Agentic pipelines of models** for complex AI workflows
-**Standard aggregated serving** patterns for single node or single GPU inference
## Integration with NVIDIA Dynamo
Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
### Complementary Roles
-**Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
-**Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
### Release Coordination
Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
### Unified AI Platform
The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
## Architecture Benefits
Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
1.**Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
2.**Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
3.**Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
4.**Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
## Getting Started
> **Note**: Grove is currently in development and aligning with NVIDIA Dynamo's release schedule.
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
@@ -7,7 +7,7 @@ This guide provides a walkthrough for collecting and visualizing metrics from Dy
...
@@ -7,7 +7,7 @@ This guide provides a walkthrough for collecting and visualizing metrics from Dy
## Prerequisites
## Prerequisites
### Install Dynamo Operator
### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Quickstart Guide](../dynamo_deploy/quickstart.md) for detailed instructions on deploying the Dynamo operator.
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../dynamo_deploy/dynamo_cloud.md) for detailed instructions on deploying the Dynamo operator.
### Install Prometheus Operator
### Install Prometheus Operator
If you don't have an existing Prometheus setup, you'll need to install the Prometheus Operator. The Prometheus Operator introduces custom resources that make it easy to deploy and manage Prometheus monitoring in Kubernetes:
If you don't have an existing Prometheus setup, you'll need to install the Prometheus Operator. The Prometheus Operator introduces custom resources that make it easy to deploy and manage Prometheus monitoring in Kubernetes:
...
@@ -39,7 +39,7 @@ This will create two components:
...
@@ -39,7 +39,7 @@ This will create two components:
- A Worker component exposing metrics on its system port
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../../components/backends/vllm/README.md)
- Deployment configuration: See the [vLLM README](../../components/backends/vllm/README.md)
- Available metrics: See the [metrics guide](../metrics.md)
- Available metrics: See the [metrics guide](../metrics.md)
### Validate the Deployment
### Validate the Deployment
...
@@ -47,7 +47,7 @@ Both components expose a `/metrics` endpoint following the OpenMetrics format, b
...
@@ -47,7 +47,7 @@ Both components expose a `/metrics` endpoint following the OpenMetrics format, b
Let's send some test requests to populate metrics:
Let's send some test requests to populate metrics:
@@ -17,21 +17,19 @@ limitations under the License.
...
@@ -17,21 +17,19 @@ limitations under the License.
# Minikube Setup Guide
# Minikube Setup Guide
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Cloud locally.
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## Setting Up Minikube
## 1. Install Minikube
### 1. Install Minikube
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
### 2. Configure GPU Support (Optional)
## 2. Configure GPU Support (Optional)
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
```{tip}
```{tip}
Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
Once your local environment is set up, you can proceed with the [Dynamo Cloud deployment guide](./dynamo_cloud.md) to deploy the platform to your local cluster.
Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform deployment guide](./dynamo_cloud.md) to deploy the platform to your local cluster.
@@ -31,7 +31,7 @@ Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also a
...
@@ -31,7 +31,7 @@ Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also a
**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics.
**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics.
**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](deploy/k8s_metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana.
**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](dynamo_deploy/k8s_metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana.
Note that this a very simple degenerate example which does not demonstrate the standard Dynamo FrontEnd-Backend deployment. The hello-world client is not a web server, it is a one-off function which sends the predefined text "world,sun,moon,star" to the backend. The example is meant to show the HelloWorldWorker. As such you will only see the HelloWorldWorker pod in deployment. The client will run and exit and the pod will not be operational.
Note that this a very simple degenerate example which does not demonstrate the standard Dynamo FrontEnd-Backend deployment. The hello-world client is not a web server, it is a one-off function which sends the predefined text "world,sun,moon,star" to the backend. The example is meant to show the HelloWorldWorker. As such you will only see the HelloWorldWorker pod in deployment. The client will run and exit and the pod will not be operational.
Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Cloud.
Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Kubernetes Platform.