Unverified Commit 129a2444 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: Consolidate documentation and fix redundant headings (#2518)

parent d9aef67e
...@@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **GB200 Support** | ✅ | | | **GB200 Support** | ✅ | |
## Quick Start ## SGLang Quick Start
Below we provide a guide that lets you run all of our common deployment patterns on a single node. Below we provide a guide that lets you run all of our common deployment patterns on a single node.
......
...@@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **DP Rank Routing**| ✅ | | | **DP Rank Routing**| ✅ | |
| **GB200 Support** | ✅ | | | **GB200 Support** | ✅ | |
## Quick Start ## TensorRT-LLM Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node. Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
......
...@@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks | | **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main | | **GB200 Support** | 🚧 | Container functional on main |
## Quick Start ## vLLM Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node. Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
......
# Dynamo Metrics Collection on Kubernetes # Dynamo Metrics Collection on Kubernetes
For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/deploy/k8s_metrics.md](../../../docs/guides/deploy/k8s_metrics.md). For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/dynamo_deploy/k8s_metrics.md](../../../docs/guides/dynamo_deploy/k8s_metrics.md).
...@@ -48,7 +48,7 @@ There are multi-faceted challenges: ...@@ -48,7 +48,7 @@ There are multi-faceted challenges:
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access. To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
## High level architecture and key benefits ## Key benefits
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
......
...@@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy ...@@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions * **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
* **Correction factors**: Adapts to real-world performance deviations from profiled data * **Correction factors**: Adapts to real-world performance deviations from profiled data
## Architecture ## Design
The SLA planner consists of several key components: The SLA planner consists of several key components:
...@@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill ...@@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md). For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
**Quick Start:** **To deploy SLA Planner:**
```bash ```bash
cd components/backends/vllm/deploy cd components/backends/vllm/deploy
kubectl apply -f disagg_planner.yaml -n {$NAMESPACE} kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
......
...@@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0 ...@@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start ## KV Router Quick Start
To launch the Dynamo frontend with the KV Router: To launch the Dynamo frontend with the KV Router:
......
...@@ -17,85 +17,130 @@ limitations under the License. ...@@ -17,85 +17,130 @@ limitations under the License.
# Deploying Inference Graphs to Kubernetes # Deploying Inference Graphs to Kubernetes
We expect users to deploy their inference graphs using CRDs or helm charts. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
# 1. Install Dynamo Cloud. ## 1. Install Platform First
**[Dynamo Kubernetes Platform](dynamo_cloud.md)** - Main installation guide with 3 paths
Prior to deploying an inference graph the user should deploy the Dynamo Cloud Platform. Reference the [Quickstart Guide](quickstart.md) for steps to install Dynamo Cloud with Helm. ## 2. Choose Your Backend
Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you. This is a one-time action, only necessary the first time you deploy a DynamoGraph. Each backend has deployment examples and configuration options:
# 2. Deploy your inference graph. | Backend | Available Configurations |
|---------|--------------------------|
| **[vLLM](../../../components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner |
| **[SGLang](../../../components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
| **[TensorRT-LLM](../../../components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router |
We provide a Custom Resource YAML file for many examples under the components/backends/{engine}/deploy folders. Consult the examples below for the CRs for a specific inference backend. ## 3. Deploy Your First Model
[View SGLang K8s](../../../components/backends/sglang/deploy/README.md) ```bash
# Set same namespace from platform install
[View vLLM K8s](../../../components/backends/vllm/deploy/README.md) export NAMESPACE=dynamo-cloud
[View TRT-LLM K8s](../../../components/backends/trtllm/deploy/README.md) # Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
### Deploying a particular example # Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
```bash # Test it
# Set your dynamo root directory kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
cd <root-dynamo-folder> curl http://localhost:8000/v1/models
export PROJECT_ROOT=$(pwd)
export NAMESPACE=<your-namespace> # the namespace you used to deploy Dynamo cloud to.
``` ```
Deploying an example consists of the simple `kubectl apply -f ... -n ${NAMESPACE}` command. For example: ## What's a DynamoGraphDeployment?
```bash It's a Kubernetes Custom Resource that defines your inference pipeline:
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} - Model configuration
``` - Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment. The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.
You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.
We provide a Custom Resource YAML file for many examples under the `deploy/` folder. ### Choosing Your Architecture Pattern
Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example.
**Note 1** Example Image When creating a deployment, select the architecture pattern that best fits your use case:
The examples use a prebuilt image from the `nvcr.io` registry. - **Development / Testing** - Use `agg.yaml` as the base configuration
You can utilize public images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) or build your own image and update the image location in your CR file prior to applying. Either way, you will need to overwrite the image in the example YAML. - **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
To build your own image: ### Frontend and Worker Components
```bash You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
./container/build.sh --framework <your-inference-framework>
```
For example for the `sglang` run - Provides OpenAI-compatible `/v1/chat/completions` endpoint
```bash - Auto-discovers backend workers via etcd
./container/build.sh --framework sglang - Routes requests and handles load balancing
``` - Validates and preprocesses requests
To overwrite the image in the example: ### Customizing Your Deployment
```bash Example structure:
extraPodSpec: ```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer: mainContainer:
image: <image-in-your-$DYNAMO_IMAGE> image: your-image
VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
dynamoNamespace: dynamo-dev
componentType: worker
replicas: 1
envFromSecret: hf-token-secret # for HuggingFace models
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: your-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
``` ```
**Note 2** Worker command examples per backend:
Setup port forward if needed when deploying to Kubernetes. ```yaml
# vLLM worker
List the services in your namespace: args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
```bash
kubectl get svc -n ${NAMESPACE} # SGLang worker
args:
- >-
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tp 1
--trust-remote-code
# TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--extra-engine-args engine_configs/agg.yaml
``` ```
Look for one that ends in `-frontend` and use it for port forward.
```bash Key customization points include:
SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1) - **Model Configuration**: Specify model in the args command
kubectl port-forward svc/${SERVICE_NAME}-frontend 8080:8080 -n ${NAMESPACE} - **Resource Allocation**: Configure GPU requirements under `resources.limits`
``` - **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
Additional Resources: ## Additional Resources
- [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
- [Examples Deployment Guide](../../examples/README.md#deploying-a-particular-example)
- **[Examples](../../examples/README.md)** - Complete working examples
- **[Create Custom Deployments](create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](dynamo_operator.md)** - How the platform works
- **[Helm Charts](../../../deploy/helm/README.md)** - For advanced users
\ No newline at end of file
...@@ -15,102 +15,167 @@ See the License for the specific language governing permissions and ...@@ -15,102 +15,167 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
# Dynamo Cloud Kubernetes Platform # Dynamo Kubernetes Platform
The Dynamo Cloud platform is a comprehensive solution for deploying and managing Dynamo inference graphs (also referred to as pipelines) in Kubernetes environments. It provides a streamlined experience for deploying, scaling, and monitoring your inference services. Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
## Overview ## Quick Start Paths
The Dynamo cloud platform consists of several key components: **Path A: Production Install**
Install from published artifacts on your existing cluster → [Jump to Path A](#path-a-production-install)
- **Dynamo Operator**: A Kubernetes operator that manages the lifecycle of Dynamo inference graphs from build ➡️ deploy. For more information on the operator, see [Dynamo Kubernetes Operator Documentation](../dynamo_deploy/dynamo_operator.md) **Path B: Local Development**
- **Custom Resources**: Kubernetes custom resources for defining and managing Dynamo services Set up Minikube first → [Minikube Setup](minikube.md) → Then follow Path A
**Path C: Custom Development**
Build from source for customization → [Jump to Path C](#path-c-custom-development)
## Deployment Prerequisites ## Prerequisites
Before getting started with the Dynamo cloud platform, ensure you have:
- A Kubernetes cluster (version 1.24 or later)
- [Earthly](https://earthly.dev/) installed for building components
- Docker installed and running
- Access to a container registry (e.g., Docker Hub, NVIDIA NGC, etc.)
- `kubectl` configured to access your cluster
- Helm installed (version 3.0 or later)
```bash
# Required tools
kubectl version --client # v1.24+
helm version # v3.0+
docker version # Running daemon
# Set your inference runtime image
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
# Also available: sglang-runtime, tensorrtllm-runtime
```
> [!TIP] > [!TIP]
> Don't have a Kubernetes cluster? Check out our [Minikube setup guide](../../../docs/guides/dynamo_deploy/minikube.md) to set up a local environment! 🏠 > No cluster? See [Minikube Setup](minikube.md) for local development.
#### 🏗️ Build Dynamo inference runtime. ## Path A: Production Install
[One-time Action] Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts) in 3 steps.
Before you could use Dynamo make sure you have setup the Inference Runtime Image.
For basic cases you could use the prebuilt image for the Dynamo Inference Runtime.
Just export the environment variable. This will be the image used by your individual components. You pick whatever dynamo version you want or use the latest (default)
```bash ```bash
export DYNAMO_IMAGE=nvcr.io/nvidia/dynamo:latest-vllm # 1. Set environment
export NAMESPACE=dynamo-kubernetes
export RELEASE_VERSION=0.4.0 # any version of Dynamo 0.3.2+
# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
kubectl create namespace ${NAMESPACE}
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
``` ```
For a custom setup build and push to your registry Dynamo Base Image for Dynamo inference runtime. This is a one-time operation. [Verify Installation](#verify-installation)
```bash ## Path C: Custom Development
# Run the script to build the default dynamo:latest-vllm image.
./container/build.sh
export IMAGE_TAG=<TAG>
# Tag the image
docker tag dynamo:latest-vllm <your-registry>/dynamo:${IMAGE_TAG}
docker push <your-registry>/dynamo:${IMAGE_TAG}
```
## 🚀 Deploying the Dynamo Cloud Platform Build and deploy from source for customization.
## Prerequisites ### Quick Deploy Script
```bash
# 1. Set environment
export NAMESPACE=dynamo-cloud
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
export IMAGE_TAG=0.4.0
# 2. Build operator
cd deploy/cloud/operator
earthly --push +docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
cd -
# 3. Create namespace and secrets
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
# 4. Deploy
helm repo add bitnami https://charts.bitnami.com/bitnami
./deploy.sh --crds
```
Before deploying Dynamo Cloud, ensure your Kubernetes cluster meets the following requirements: ### Manual Steps (Alternative)
#### 1. 🛡️ Istio Installation <details>
Dynamo Cloud requires Istio for service mesh capabilities. Verify Istio is installed and running: <summary>Click to expand manual installation steps</summary>
**Step 1: Install CRDs**
```bash ```bash
# Check if Istio is installed helm install dynamo-crds ./crds/ --namespace default
kubectl get pods -n istio-system ```
# Expected output should show running Istio pods **Step 2: Install Platform**
# istiod-* pods should be in Running state ```bash
helm dep build ./platform/
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret"
``` ```
</details>
[Verify Installation](#verify-installation)
#### 2. 💾 PVC Support with Default Storage Class ## Verify Installation
Dynamo Cloud requires Persistent Volume Claim (PVC) support with a default storage class. Verify your cluster configuration:
```bash ```bash
# Check if default storage class exists # Check CRDs
kubectl get storageclass kubectl get crd | grep dynamo
# Expected output should show at least one storage class marked as (default) # Check operator and platform pods
# Example: kubectl get pods -n ${NAMESPACE}
# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE # Expected: dynamo-operator-* and etcd-* pods Running
# standard (default) kubernetes.io/gce-pd Delete Immediate true 1d
``` ```
## Installation ## Next Steps
Follow [Quickstart Guide](./quickstart.md) to install the Dynamo Cloud 1. **Deploy Model/Workflow**
```bash
# Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
⚠️ **Note:** that omitting `--crds` will skip the CRDs installation/upgrade. This is useful when installing on a shared cluster as CRDs are cluster-scoped resources. # Port forward and test
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
⚠️ **Note:** If you'd like to only generate the generated-values.yaml file without deploying to Kubernetes (e.g., for inspection, CI workflows, or dry-run testing), use: 2. **Explore Backend Guides**
- [vLLM Deployments](../../../components/backends/vllm/deploy/README.md)
- [SGLang Deployments](../../../components/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](../../../components/backends/trtllm/deploy/README.md)
```bash 3. **Optional:**
./deploy_dynamo_cloud.py --yaml-only - [Set up Prometheus & Grafana](k8s_metrics.md)
``` - [SLA Planner Deployment Guide](sla_planner_deployment.md) (for advanced SLA-aware scheduling and autoscaling)
## Troubleshooting
**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}
```
### Cloud Provider-Specific deployment **HuggingFace model access?**
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
#### Google Kubernetes Engine (GKE) deployment **Clean uninstall?**
```bash
./uninstall.sh # Removes all CRDs and platform
```
You can find detailed instructions for deployment in GKE [here](../dynamo_deploy/gke_setup.md) ## Advanced Options
- [GKE-specific setup](gke_setup.md)
- [Create custom deployments](create_deployment.md)
- [Dynamo Operator details](dynamo_operator.md)
\ No newline at end of file
# Grove Deployment Guide
Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
## Overview
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
### How Grove Works for Disaggregated Serving
Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement
- **Fault Isolation**: Issues in one component don't necessarily affect others
## Core Components and API Resources
Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
### PodGangSet
The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
### PodClique
Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
### PodCliqueScalingGroup
A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
## Key Capabilities for Disaggregated Serving
Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
### Flexible Gang Scheduling
PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodGangSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
### Multi-level Horizontal Auto-Scaling
Supports pluggable horizontal auto-scaling solutions to scale PodGangSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
## Use Cases and Examples
Grove specifically supports:
- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
- **Single-node disaggregated inference** for optimized resource utilization
- **Agentic pipelines of models** for complex AI workflows
- **Standard aggregated serving** patterns for single node or single GPU inference
## Integration with NVIDIA Dynamo
Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
### Complementary Roles
- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
### Release Coordination
Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
### Unified AI Platform
The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
## Architecture Benefits
Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
## Getting Started
> **Note**: Grove is currently in development and aligning with NVIDIA Dynamo's release schedule.
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
\ No newline at end of file
...@@ -7,7 +7,7 @@ This guide provides a walkthrough for collecting and visualizing metrics from Dy ...@@ -7,7 +7,7 @@ This guide provides a walkthrough for collecting and visualizing metrics from Dy
## Prerequisites ## Prerequisites
### Install Dynamo Operator ### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Quickstart Guide](../dynamo_deploy/quickstart.md) for detailed instructions on deploying the Dynamo operator. Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../dynamo_deploy/dynamo_cloud.md) for detailed instructions on deploying the Dynamo operator.
### Install Prometheus Operator ### Install Prometheus Operator
If you don't have an existing Prometheus setup, you'll need to install the Prometheus Operator. The Prometheus Operator introduces custom resources that make it easy to deploy and manage Prometheus monitoring in Kubernetes: If you don't have an existing Prometheus setup, you'll need to install the Prometheus Operator. The Prometheus Operator introduces custom resources that make it easy to deploy and manage Prometheus monitoring in Kubernetes:
...@@ -39,7 +39,7 @@ This will create two components: ...@@ -39,7 +39,7 @@ This will create two components:
- A Worker component exposing metrics on its system port - A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../../components/backends/vllm/README.md) - Deployment configuration: See the [vLLM README](../../components/backends/vllm/README.md)
- Available metrics: See the [metrics guide](../metrics.md) - Available metrics: See the [metrics guide](../metrics.md)
### Validate the Deployment ### Validate the Deployment
...@@ -47,7 +47,7 @@ Both components expose a `/metrics` endpoint following the OpenMetrics format, b ...@@ -47,7 +47,7 @@ Both components expose a `/metrics` endpoint following the OpenMetrics format, b
Let's send some test requests to populate metrics: Let's send some test requests to populate metrics:
```bash ```bash
curl localhost:8080/v1/chat/completions \ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "Qwen/Qwen3-0.6B", "model": "Qwen/Qwen3-0.6B",
......
...@@ -17,21 +17,19 @@ limitations under the License. ...@@ -17,21 +17,19 @@ limitations under the License.
# Minikube Setup Guide # Minikube Setup Guide
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Cloud locally. Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## Setting Up Minikube ## 1. Install Minikube
### 1. Install Minikube
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system. First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
### 2. Configure GPU Support (Optional) ## 2. Configure GPU Support (Optional)
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding. Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
```{tip} ```{tip}
Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads! Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
``` ```
### 3. Start Minikube ## 3. Start Minikube
Time to launch your local cluster! Time to launch your local cluster!
```bash ```bash
...@@ -44,7 +42,7 @@ minikube addons enable istio ...@@ -44,7 +42,7 @@ minikube addons enable istio
minikube addons enable storage-provisioner-rancher minikube addons enable storage-provisioner-rancher
``` ```
### 4. Verify Installation ## 4. Verify Installation
Let's make sure everything is working correctly! Let's make sure everything is working correctly!
```bash ```bash
...@@ -60,5 +58,5 @@ kubectl get storageclass ...@@ -60,5 +58,5 @@ kubectl get storageclass
## Next Steps ## Next Steps
Once your local environment is set up, you can proceed with the [Dynamo Cloud deployment guide](./dynamo_cloud.md) to deploy the platform to your local cluster. Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform deployment guide](./dynamo_cloud.md) to deploy the platform to your local cluster.
...@@ -27,7 +27,7 @@ helm install fluid fluid/fluid -n fluid-system ...@@ -27,7 +27,7 @@ helm install fluid fluid/fluid -n fluid-system
``` ```
For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation). For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation).
## Quick Start ## Pre-deployment Steps
1. Install Fluid (see [Installation](#installation)). 1. Install Fluid (see [Installation](#installation)).
2. Create a Dataset and Runtime (see [the following example](#webufs-example)). 2. Create a Dataset and Runtime (see [the following example](#webufs-example)).
......
...@@ -31,7 +31,7 @@ Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also a ...@@ -31,7 +31,7 @@ Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also a
**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics. **Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics.
**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](deploy/k8s_metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana. **Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](dynamo_deploy/k8s_metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana.
## Metrics Hierarchy ## Metrics Hierarchy
......
...@@ -106,7 +106,7 @@ Hello star! ...@@ -106,7 +106,7 @@ Hello star!
Note that this a very simple degenerate example which does not demonstrate the standard Dynamo FrontEnd-Backend deployment. The hello-world client is not a web server, it is a one-off function which sends the predefined text "world,sun,moon,star" to the backend. The example is meant to show the HelloWorldWorker. As such you will only see the HelloWorldWorker pod in deployment. The client will run and exit and the pod will not be operational. Note that this a very simple degenerate example which does not demonstrate the standard Dynamo FrontEnd-Backend deployment. The hello-world client is not a web server, it is a one-off function which sends the predefined text "world,sun,moon,star" to the backend. The example is meant to show the HelloWorldWorker. As such you will only see the HelloWorldWorker pod in deployment. The client will run and exit and the pod will not be operational.
Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Cloud. Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Kubernetes Platform.
Then deploy to kubernetes using Then deploy to kubernetes using
```bash ```bash
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment