Unverified Commit 2c3066bd authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: full migration of docs/ to fern format in fern/ (#6050)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent d59b9d72
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Speculative Decoding
Speculative decoding is an optimization technique that uses a smaller "draft" model to predict multiple tokens, which are then verified by the main model in parallel. This can significantly reduce latency for autoregressive generation.
## Backend Support
| Backend | Status | Notes |
|---------|--------|-------|
| vLLM | ✅ | Eagle3 draft model support |
| SGLang | 🚧 | Not yet documented |
| TensorRT-LLM | 🚧 | Not yet documented |
## Overview
Speculative decoding works by:
1. **Draft phase**: A smaller, faster model generates candidate tokens
2. **Verify phase**: The main model verifies these candidates in a single forward pass
3. **Accept/reject**: Tokens are accepted if they match what the main model would have generated
This approach trades off additional compute for lower latency, as multiple tokens can be generated per forward pass of the main model.
## Quick Start (vLLM + Eagle3)
This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single GPU with at least 16GB VRAM.
### Prerequisites
1. Start infrastructure services:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
2. Build and run the vLLM container:
```bash
./container/build.sh --framework VLLM
./container/run.sh -it --framework VLLM --mount-workspace
```
3. Set up Hugging Face access (Meta-Llama-3.1-8B-Instruct is gated):
```bash
export HUGGING_FACE_HUB_TOKEN="your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
```
### Run Speculative Decoding
```bash
cd examples/backends/vllm
bash launch/agg_spec_decoding.sh
```
### Test the Deployment
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
],
"max_tokens": 250
}'
```
## Backend-Specific Guides
| Backend | Guide |
|---------|-------|
| vLLM | [speculative_decoding_vllm.md](./speculative-decoding-vllm.md) |
## See Also
- [vLLM Backend](../../backends/vllm/README.md) - Full vLLM deployment guide
- [Disaggregated Serving](../../design-docs/disagg-serving.md) - Alternative optimization approach
- [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
...@@ -3,56 +3,52 @@ ...@@ -3,56 +3,52 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3) # Speculative Decoding with vLLM
This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node. Using Speculative Decoding with the vLLM backend.
Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
> **See also**: [Speculative Decoding Overview](./README.md) for cross-backend documentation.
## Prerequisites
## Step 1: Set Up Your Docker Environment - vLLM container with Eagle3 support
- GPU with at least 16GB VRAM
- Hugging Face access token (for gated models)
First, we’ll initialize a Docker container using the VLLM backend. ## Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3
You can refer to the [VLLM Quickstart Guide](README.md#vllm-quick-start) — or follow the full steps below.
### 1. Launch Docker Compose This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single node.
```bash ### Step 1: Set Up Your Docker Environment
docker compose -f deploy/docker-compose.yml up -d
```
### 2. Build the Container First, initialize a Docker container using the vLLM backend. See the [vLLM Quickstart Guide](../../backends/vllm/README.md#vllm-quick-start) for details.
```bash ```bash
./container/build.sh --framework VLLM # Launch infrastructure services
``` docker compose -f deploy/docker-compose.yml up -d
### 3. Run the Container # Build the container
./container/build.sh --framework VLLM
```bash # Run the container
./container/run.sh -it --framework VLLM --mount-workspace ./container/run.sh -it --framework VLLM --mount-workspace
``` ```
### Step 2: Get Access to the Llama-3 Model
The **Meta-Llama-3.1-8B-Instruct** model is gated. Request access on Hugging Face:
[Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
## Step 2: Get Access to the Llama-3 Model Approval time varies depending on Hugging Face review traffic.
The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face. Once approved, set your access token inside the container:
Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
Approval usually takes around **5 minutes**.
Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
```bash ```bash
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here" export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
``` ```
### Step 3: Run Aggregated Speculative Decoding
## Step 3: Run Aggregated Speculative Decoding
Now that your environment is ready, start the aggregated server with **speculative decoding**.
```bash ```bash
# Requires only one GPU # Requires only one GPU
...@@ -60,14 +56,9 @@ cd examples/backends/vllm ...@@ -60,14 +56,9 @@ cd examples/backends/vllm
bash launch/agg_spec_decoding.sh bash launch/agg_spec_decoding.sh
``` ```
Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model. Once the weights finish downloading, the server will be ready for inference requests.
## Step 4: Example Request ### Step 4: Test the Deployment
To verify your setup, try sending a simple prompt to your model:
```bash ```bash
curl http://localhost:8000/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
...@@ -88,7 +79,10 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -88,7 +79,10 @@ curl http://localhost:8000/v1/chat/completions \
"id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8", "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
"choices": [ "choices": [
{ {
"text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.", "message": {
"role": "assistant",
"content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
},
"index": 0, "index": 0,
"finish_reason": "stop" "finish_reason": "stop"
} }
...@@ -102,9 +96,25 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -102,9 +96,25 @@ curl http://localhost:8000/v1/chat/completions \
} }
``` ```
## Configuration
Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:
- Target model: `meta-llama/Meta-Llama-3.1-8B-Instruct`
- Draft model: Eagle3 variant
- Aggregated serving mode
See `examples/backends/vllm/launch/agg_spec_decoding.sh` for the full configuration.
## Limitations
- Currently only supports Eagle3 as the draft model
- Requires compatible model architectures between target and draft
## Additional Resources ## See Also
* [VLLM Quickstart](README.md#vllm-quick-start) | Document | Path |
* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |----------|------|
\ No newline at end of file | Speculative Decoding Overview | [README.md](./README.md) |
| vLLM Backend Guide | [vLLM README](../../backends/vllm/README.md) |
| Meta-Llama-3.1-8B-Instruct | [Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
...@@ -3,71 +3,28 @@ ...@@ -3,71 +3,28 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Dynamo Examples The examples below assume you build the latest image yourself from source. If using a prebuilt image, follow the examples from the corresponding branch.
This directory contains practical examples demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases. ## Hello World
> **Want to see a specific example?** Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph.
> Open a [GitHub issue](https://github.com/ai-dynamo/dynamo/issues) to request an example you'd like to see, or [open a pull request](https://github.com/ai-dynamo/dynamo/pulls) if you'd like to contribute your own!
## Basics & Tutorials [View Hello World Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/runtime/hello_world)
Learn fundamental Dynamo concepts through these introductory examples: ## vLLM
- **[Quickstart](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/quickstart/README.md)** - Simple aggregated serving example with vLLM backend Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with vLLM.
- **[Disaggregated Serving](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/disaggregated_serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
- **[Multi-node](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/multinode/README.md)** - Distributed inference across multiple nodes and GPUs
## Framework Support [View vLLM Backend Guide](../backends/vllm/README.md)
These examples show how Dynamo broadly works using major inference engines. ## SGLang
If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Examples Backends](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/) directory: Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.
- **[vLLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
## Deployment Examples [View SGLang Backend Guide](../backends/sglang/README.md)
Platform-specific deployment guides for production environments: ## TensorRT-LLM
- **[Amazon EKS](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.
- **[Azure AKS](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service
- **[Amazon ECS](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/ECS/)** - Deploy Dynamo on Amazon Elastic Container Service
- **Google GKE** - _Coming soon_
## Runtime Examples [View TensorRT-LLM Backend Guide](../backends/trtllm/README.md)
Low-level runtime examples for developers using Python/Rust bindings:
- **[Hello World](https://github.com/ai-dynamo/dynamo/blob/main/examples/custom_backend/hello_world/README.md)** - Minimal Dynamo runtime service demonstrating basic concepts
## Getting Started
1. **Choose your deployment pattern**: Start with the [Quickstart](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/quickstart/README.md) for a simple local deployment, or explore [Disaggregated Serving](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/disaggregated_serving/README.md) for advanced architectures.
2. **Set up prerequisites**: Most examples require etcd and NATS services. You can start them using:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
3. **Follow the example**: Each directory contains detailed setup instructions and configuration files specific to that deployment pattern.
## Prerequisites
Before running any examples, ensure you have:
- **Docker & Docker Compose** - For containerized services
- **CUDA-compatible GPU** - For LLM inference (except hello_world, which is non-GPU aware)
- **Python 3.9+** - For client scripts and utilities
### For Kubernetes Deployments
If you're running Kubernetes/cloud deployment examples (EKS, AKS, GKE), you'll also need:
| Tool | Minimum Version | Installation |
|------|-----------------|--------------|
| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
See the [Kubernetes Installation Guide](../kubernetes/installation-guide.md#prerequisites) for detailed setup instructions and pre-deployment checks.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Installation
## Pip (PyPI)
Install a pre-built wheel from PyPI.
```bash
# Create a virtual environment and activate it
uv venv venv
source venv/bin/activate
# Install Dynamo from PyPI (choose one backend extra)
uv pip install "ai-dynamo[sglang]" # or [vllm], [trtllm]
```
## Pip from source
Install directly from a local checkout for development.
```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Create a virtual environment and activate it
uv venv venv
source venv/bin/activate
uv pip install ".[sglang]" # or [vllm], [trtllm]
```
## Docker
Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
```bash
# Run a container (mount your workspace if needed)
docker run --rm -it \
--gpus all \
--network host \
nvcr.io/nvidia/ai-dynamo/sglang-runtime:latest # or vllm, tensorrtllm
```
...@@ -3,95 +3,148 @@ ...@@ -3,95 +3,148 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Welcome to NVIDIA Dynamo This guide covers running Dynamo **using the CLI on your local machine or VM**.
The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale. <Info>
**Looking to deploy on Kubernetes instead?**
See the [Kubernetes Installation Guide](../kubernetes/installation-guide.md)
and [Kubernetes Quickstart](../kubernetes/README.md) for cluster deployments.
</Info>
> [!TIP] ## Install Dynamo
> **Discover the Latest Developments!**
>
> This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo/releases/latest).
## Quickstart **Option A: Containers (Recommended)**
Get started with Dynamo locally in just a few commands: Containers have all dependencies pre-installed. No setup required.
### 1. Install Dynamo ```bash
# SGLang
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
# TensorRT-LLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
# vLLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
```
<Tip>
To run frontend and worker in the same container, either:
- Run processes in background with `&` (see Run Dynamo section below), or
- Open a second terminal and use `docker exec -it <container_id> bash`
</Tip>
See [Release Artifacts](../reference/release-artifacts.md#container-images) for available
versions and backend guides for run instructions: [SGLang](../backends/sglang/README.md) |
[TensorRT-LLM](../backends/trtllm/README.md) | [vLLM](../backends/vllm/README.md)
**Option B: Install from PyPI**
```bash ```bash
# Install uv (recommended Python package manager) # Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install Dynamo # Create virtual environment
uv venv venv uv venv venv
source venv/bin/activate source venv/bin/activate
# Use prerelease flag to install RC versions of flashinfer and/or other dependencies uv pip install pip
uv pip install --prerelease=allow "ai-dynamo[sglang]" # or [vllm], [trtllm]
``` ```
### 2. Start etcd/NATS Install system dependencies and the Dynamo wheel for your chosen backend:
**SGLang**
```bash ```bash
# Fetch and start etcd and NATS using Docker Compose sudo apt install python3-dev
VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2) uv pip install --prerelease=allow "ai-dynamo[sglang]"
curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
docker compose -f docker-compose.yml up -d
``` ```
### 3. Run Dynamo <Note>
For CUDA 13 (B300/GB300), the container is recommended. See
[SGLang install docs](https://docs.sglang.io/get_started/install.html) for details.
</Note>
**TensorRT-LLM**
```bash ```bash
# Start the OpenAI compatible frontend (default port is 8000) sudo apt install python3-dev
python -m dynamo.frontend pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
```
<Note>
TensorRT-LLM requires `pip` due to a transitive Git URL dependency that
`uv` doesn't resolve. We recommend using the TensorRT-LLM container for
broader compatibility. See the [TRT-LLM backend guide](../backends/trtllm/README.md)
for details.
</Note>
# In another terminal, start an SGLang worker **vLLM**
python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
```bash
sudo apt install python3-dev libxcb1
uv pip install --prerelease=allow "ai-dynamo[vllm]"
``` ```
### 4. Test Your Deployment ## Run Dynamo
<Tip>
**(Optional)** Before running Dynamo, verify your system configuration:
`python3 deploy/sanity_check.py`
</Tip>
Start the frontend, then start a worker for your chosen backend.
<Tip>
To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &`
to run processes in background. Example: `python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &`
</Tip>
```bash ```bash
curl localhost:8000/v1/chat/completions \ # Start the OpenAI compatible frontend (default port is 8000)
-H "Content-Type: application/json" \ # --store-kv file avoids needing etcd (frontend and workers must share a disk)
-d '{"model": "Qwen/Qwen3-0.6B", python3 -m dynamo.frontend --store-kv file
"messages": [{"role": "user", "content": "Hello!"}], ```
"max_tokens": 50}'
In another terminal (or same terminal if using background mode), start a worker:
**SGLang**
```bash
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
``` ```
## Key Features **TensorRT-LLM**
| Feature | Description | ```bash
|---------|-------------| python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
| **Multi-Backend Support** | vLLM, SGLang, and TensorRT-LLM backends | ```
| **Disaggregated Serving** | Separate prefill and decode for optimal performance |
| **KV Cache Routing** | Intelligent request routing based on KV cache state |
| **Kubernetes Native** | Full operator and Helm chart support |
| **Observability** | Prometheus metrics, Grafana dashboards, and tracing |
## Documentation Overview **vLLM**
### Backends ```bash
- [vLLM Backend](../backends/vllm/README.md) - High-throughput serving with vLLM python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
- [SGLang Backend](../backends/sglang/README.md) - Fast inference with SGLang --kv-events-config '{"enable_kv_cache_events": false}'
- [TensorRT-LLM Backend](../backends/trtllm/README.md) - Optimized inference with TensorRT-LLM ```
### Kubernetes Deployment <Note>
- [Installation Guide updated](../kubernetes/installation-guide.md) - Deploy Dynamo on Kubernetes For dependency-free local development, disable KV event publishing (avoids NATS):
- [Operator Guide](../kubernetes/dynamo-operator.md) - Using the Dynamo Operator
- [Autoscaling](../kubernetes/autoscaling.md) - Automatic scaling configuration
### Architecture - **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'`
- [System Architecture](../design-docs/architecture.md) - Overall system design - **SGLang:** No flag needed (KV events disabled by default)
- [Disaggregated Serving](../design-docs/disagg-serving.md) - P/D separation architecture - **TensorRT-LLM:** No flag needed (KV events disabled by default)
- [Distributed Runtime](../design-docs/distributed-runtime.md) - Runtime internals
### Performance & Tuning **TensorRT-LLM only:** The warning `Cannot connect to ModelExpress server/transport error. Using direct download.`
- [Performance Tuning](../performance/tuning.md) - Optimize your deployment is expected and can be safely ignored.
- [Benchmarking](../benchmarks/benchmarking.md) - Measure and compare performance </Note>
- [AI Configurator](../performance/aiconfigurator.md) - Automated configuration
## Getting Help ## Test Your Deployment
- **GitHub Issues**: [Report bugs or request features](https://github.com/ai-dynamo/dynamo/issues) ```bash
- **Discussions**: [Ask questions and share ideas](https://github.com/ai-dynamo/dynamo/discussions) curl localhost:8000/v1/chat/completions \
- **Reference**: [CLI Reference](../reference/cli.md) | [Glossary](../reference/glossary.md) | [Support Matrix](../reference/support-matrix.md) -H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50}'
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# FlexKV Integration in Dynamo
## Introduction
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM, and SGLang.
### Key Features
- **Multi-level caching**: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading
- **Distributed KV cache reuse**: Share KV cache across multiple nodes using distributed RadixTree
- **High-performance I/O**: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer
- **Asynchronous operations**: Get and put operations can overlap with computation through prefetching
## Prerequisites
1. **Dynamo installed** with vLLM support
2. **Infrastructure services running**:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
3. **FlexKV dependencies** (for SSD offloading):
```bash
apt install liburing-dev libxxhash-dev
```
## Quick Start
### Enable FlexKV
Set the `DYNAMO_USE_FLEXKV` environment variable and use the `--connector flexkv` flag:
```bash
export DYNAMO_USE_FLEXKV=1
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
```
## Aggregated Serving
### Basic Setup
```bash
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Start vLLM worker with FlexKV
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
```
### With KV-Aware Routing
For multi-worker deployments with KV-aware routing to maximize cache reuse:
```bash
# Terminal 1: Start frontend with KV router
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states &
# Terminal 2: Worker 1
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \
CUDA_VISIBLE_DEVICES=0 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' &
# Terminal 3: Worker 2
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
```
## Disaggregated Serving
FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers.
```bash
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Decode worker (without FlexKV)
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl &
# Terminal 3: Prefill worker (with FlexKV)
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker \
--connector nixl flexkv
```
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DYNAMO_USE_FLEXKV` | Enable FlexKV integration | `0` (disabled) |
| `FLEXKV_CPU_CACHE_GB` | CPU memory cache size in GB | Required |
| `FLEXKV_CONFIG_PATH` | Path to FlexKV YAML config file | Not set |
| `FLEXKV_SERVER_RECV_PORT` | IPC port for FlexKV server | Auto |
### CPU-Only Offloading
For simple CPU memory offloading:
```bash
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32
```
### CPU + SSD Tiered Offloading
For multi-tier offloading with SSD storage, create a configuration file:
```bash
cat > ./flexkv_config.yml <<EOF
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF
export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
```
### Configuration Options
| Option | Description |
|--------|-------------|
| `cpu_cache_gb` | CPU memory cache size in GB |
| `ssd_cache_gb` | SSD cache size in GB |
| `ssd_cache_dir` | SSD cache directories (semicolon-separated for multiple SSDs) |
| `enable_gds` | Enable GPU Direct Storage for SSD I/O |
> **Note:** For full configuration options, see the [FlexKV Configuration Reference](https://github.com/taco-project/FlexKV/blob/main/docs/flexkv_config_reference/README_en.md).
## Distributed KV Cache Reuse
FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:
- **Distributed RadixTree**: Each node maintains a local snapshot of the global index
- **Lease Mechanism**: Ensures data validity during cross-node transfers
- **RDMA-based Transfer**: Uses Mooncake Transfer Engine for high-performance KV cache transfer
For setup instructions, see the [FlexKV Distributed Reuse Guide](https://github.com/taco-project/FlexKV/blob/main/docs/dist_reuse/README_en.md).
## Architecture
FlexKV consists of three core modules:
### StorageEngine
Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.
### GlobalCacheEngine
The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:
- RadixTree for prefix matching
- Memory pool to track space usage and trigger eviction
### TransferEngine
The data plane that executes data transfers:
- Multi-threading for parallel transfers
- High-performance I/O (io_uring, GDS)
- Asynchronous operations overlapping with computation
## Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```
## See Also
- [FlexKV GitHub Repository](https://github.com/taco-project/FlexKV)
- [FlexKV vLLM Adapter Documentation](https://github.com/taco-project/FlexKV/blob/main/docs/vllm_adapter/README_en.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KV Event Publishing for Custom Engines
This document explains how to implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing.
## Overview
The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions.
There are two main publishing pathways:
1. **Direct NATS publishing** (`KvEventPublisher`) - Publishes events directly to NATS. Simplest approach for custom engines.
2. **ZMQ-based publishing** - For engines with ZMQ event output (like vLLM). Uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS.
## Event Types
The KV cache supports three event types:
| Event Type | Description | When to Publish |
|------------|-------------|-----------------|
| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds |
| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed |
| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart |
### Event Structure
Each event contains:
- **`event_id`**: Monotonically increasing identifier per worker
- **`dp_rank`**: Data parallel rank (0 if DP not enabled)
- **`data`**: One of `Stored`, `Removed`, or `Cleared`
For `BlockStored` events:
- **`token_ids`**: List of token IDs for the stored blocks
- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
- **`lora_id`**: LoRA adapter ID (0 if not using LoRA)
For `BlockRemoved` events:
- **`block_hashes`**: List of sequence block hashes being evicted
## Option 1: Direct NATS Publishing (Recommended)
The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines.
```mermaid
flowchart LR
subgraph Engine["Custom Engine"]
cache["KV Cache Manager"]
end
subgraph Worker["Dynamo Worker Process"]
pub["KvEventPublisher"]
end
subgraph NATS["NATS"]
subject["kv-events subject"]
end
subgraph Router["KV Router"]
indexer["KvIndexer"]
end
cache -->|"on_blocks_stored()<br/>on_blocks_removed()"| pub
pub -->|"publish to NATS"| subject
subject --> indexer
```
**When to use:**
- Building a custom inference engine from scratch
- Your engine doesn't have a ZMQ-based event system
- You want the simplest integration path
### Basic Setup
```python
from dynamo.llm import KvEventPublisher
class CustomEnginePublisher:
def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0):
self.block_size = block_size
self.event_id = 0
self.kv_publisher = KvEventPublisher(
component=component,
worker_id=worker_id,
kv_block_size=block_size,
dp_rank=dp_rank,
enable_local_indexer=False,
)
def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int],
lora_id: int = 0, parent_hash: int | None = None):
"""Call after KV cache blocks are allocated."""
self.event_id += 1
num_block_tokens = [self.block_size] * len(block_hashes)
self.kv_publisher.publish_stored(
event_id=self.event_id,
token_ids=token_ids,
num_block_tokens=num_block_tokens,
block_hashes=block_hashes,
lora_id=lora_id,
parent_hash=parent_hash,
)
def on_blocks_removed(self, block_hashes: list[int]):
"""Call when KV cache blocks are evicted."""
self.event_id += 1
self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes)
```
### Integration with Your Engine
```python
from dynamo.llm import register_llm
async def main():
# Register your engine with Dynamo
component, endpoint = await register_llm(
model="my-model",
generator=my_generate_fn,
)
# Initialize publisher
publisher = CustomEnginePublisher(
component=component,
worker_id=endpoint.connection_id(),
block_size=16, # Match your engine's block size
)
# Hook into your engine's cache events
def on_prefill_complete(request_id, token_ids, blocks):
block_hashes = [block.hash for block in blocks]
publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes)
def on_cache_eviction(evicted_blocks):
block_hashes = [block.hash for block in evicted_blocks]
publisher.on_blocks_removed(block_hashes=block_hashes)
```
## Option 2: ZMQ-based Publishing
For engines that publish events via ZMQ (like vLLM), this option uses two components that work together:
1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket
2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS
```mermaid
flowchart LR
subgraph Engine["Custom Engine / vLLM"]
cache["KV Cache Manager"]
zmq_pub["ZMQ Publisher<br/>(Pure Python)"]
end
subgraph ZMQ["ZMQ Socket"]
socket["tcp://127.0.0.1:5557"]
end
subgraph Worker["Dynamo Worker Process"]
zmq_sub["ZmqKvEventPublisher<br/>(Rust bindings)"]
end
subgraph NATS["NATS"]
subject["kv-events subject"]
end
subgraph Router["KV Router"]
indexer["KvIndexer"]
end
cache --> zmq_pub
zmq_pub -->|"PUB"| socket
socket -->|"SUB"| zmq_sub
zmq_sub --> subject
subject --> indexer
```
**When to use:**
- Your engine already has a ZMQ-based event system (like vLLM)
- You're integrating with a consolidator (like KVBM)
- You want to decouple event publishing from your engine's main loop
### Part 1: ZMQ Subscriber (Dynamo Bindings)
If your engine already publishes to ZMQ, use `KvEventPublisher` with a `ZmqKvEventPublisherConfig` to subscribe and forward to NATS:
```python
from dynamo.llm import KvEventPublisher, ZmqKvEventPublisherConfig
# Configure the ZMQ subscriber
config = ZmqKvEventPublisherConfig(
worker_id=endpoint.connection_id(),
kv_block_size=block_size,
zmq_endpoint="tcp://127.0.0.1:5557", # Where your engine publishes
zmq_topic="", # Subscribe to all topics
enable_local_indexer=False,
)
# Create publisher - it automatically subscribes to ZMQ and forwards to NATS
kv_publisher = KvEventPublisher(
component=component,
zmq_config=config,
)
```
### Part 2: ZMQ Publisher (Pure Python)
If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol:
```python
import zmq
import msgpack
import time
class ZmqKvEventPublisher:
"""Pure Python ZMQ publisher for KV events (vLLM-compatible format)."""
def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""):
self.kv_block_size = kv_block_size
self.topic = topic
self.ctx = zmq.Context()
self.socket = self.ctx.socket(zmq.PUB)
self.socket.bind(zmq_endpoint)
self.sequence = 0
self.data_parallel_rank = 0
def _to_signed_i64(self, value: int | None) -> int | None:
if value is None:
return None
return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value
def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int],
block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None):
event = {
"type": "BlockStored",
"block_hashes": [self._to_signed_i64(h) for h in block_hashes],
"parent_block_hash": self._to_signed_i64(parent_hash),
"token_ids": token_ids,
"block_size": self.kv_block_size,
"lora_id": lora_id if lora_id != 0 else None,
}
self._publish_event(event)
def publish_removed(self, event_id: int, block_hashes: list[int]):
event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]}
self._publish_event(event)
def publish_all_cleared(self):
self._publish_event({"type": "AllBlocksCleared"})
def _publish_event(self, event: dict):
batch = [time.time(), [event], self.data_parallel_rank]
payload = msgpack.packb(batch, use_bin_type=True)
sequence_bytes = self.sequence.to_bytes(8, byteorder="big")
self.sequence += 1
self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload])
def shutdown(self):
self.socket.close()
self.ctx.term()
```
### ZMQ Wire Format
The ZMQ message format (compatible with vLLM):
| Frame | Description |
|-------|-------------|
| 1 | Topic (empty string for all topics) |
| 2 | Sequence number (8 bytes, big-endian) |
| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` |
Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`).
## Best Practices
1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter)
2. **Block size must match** your engine's actual `kv_block_size`
3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
## See Also
- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../components/router/router-guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../design-docs/router-design.md)**: Architecture details and event transport modes
...@@ -11,18 +11,12 @@ LMCache is a high-performance KV cache layer that supercharges LLM serving by en ...@@ -11,18 +11,12 @@ LMCache is a high-performance KV cache layer that supercharges LLM serving by en
This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency. This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency.
### Key Benefits
- **Reduced Time to First Token (TTFT)**: Eliminates redundant prefill computations
- **Memory Offloading**: Intelligent KV cache placement across CPU/GPU/storage tiers
- **Improved Throughput**: Reduced GPU memory pressure enables higher batch sizes
## Platform Support ## Platform Support
**Important Note**: LMCache integration currently only supports x86 architecture. ARM64 is not supported at this time. **Important Note**: LMCache integration currently only supports x86 architecture. ARM64 is not supported at this time.
## Aggregated Serving ## Aggregated Serving
### Configuration ### Configuration
LMCache is enabled using the `--connector lmcache` flag: LMCache is enabled using the `--connector lmcache` flag:
...@@ -36,6 +30,7 @@ python -m dynamo.vllm --model <model_name> --connector lmcache ...@@ -36,6 +30,7 @@ python -m dynamo.vllm --model <model_name> --connector lmcache
LMCache configuration can be customized via environment variables listed [here](https://docs.lmcache.ai/api_reference/configurations.html). LMCache configuration can be customized via environment variables listed [here](https://docs.lmcache.ai/api_reference/configurations.html).
For advanced configurations, LMCache supports multiple [storage backends](https://docs.lmcache.ai/index.html): For advanced configurations, LMCache supports multiple [storage backends](https://docs.lmcache.ai/index.html):
- **CPU RAM**: Fast local memory offloading - **CPU RAM**: Fast local memory offloading
- **Local Storage**: Disk-based persistence - **Local Storage**: Disk-based persistence
- **Redis**: Distributed cache sharing - **Redis**: Distributed cache sharing
...@@ -51,12 +46,13 @@ Use the provided launch script for quick setup: ...@@ -51,12 +46,13 @@ Use the provided launch script for quick setup:
``` ```
This will: This will:
1. Start the dynamo frontend 1. Start the Dynamo frontend
2. Launch a single vLLM worker with LMCache enabled 2. Launch a single vLLM worker with LMCache enabled
### Architecture for Aggregated Mode ### Architecture for Aggregated Mode
In aggregated mode, the system uses: In aggregated mode, the system uses:
- **KV Connector**: `LMCacheConnectorV1` - **KV Connector**: `LMCacheConnectorV1`
- **KV Role**: `kv_both` (handles both reading and writing) - **KV Role**: `kv_both` (handles both reading and writing)
...@@ -66,14 +62,14 @@ Disaggregated serving separates prefill and decode operations into dedicated wor ...@@ -66,14 +62,14 @@ Disaggregated serving separates prefill and decode operations into dedicated wor
### Deployment ### Deployment
Use the provided disaggregated launch script(the script requires at least 2 GPUs): Use the provided disaggregated launch script (requires at least 2 GPUs):
```bash ```bash
./examples/backends/vllm/launch/disagg_lmcache.sh ./examples/backends/vllm/launch/disagg_lmcache.sh
``` ```
This will: This will:
1. Start the dynamo frontend 1. Start the Dynamo frontend
2. Launch a decode worker on GPU 0 2. Launch a decode worker on GPU 0
3. Wait for initialization 3. Wait for initialization
4. Launch a prefill worker on GPU 1 with LMCache enabled 4. Launch a prefill worker on GPU 1 with LMCache enabled
...@@ -81,14 +77,16 @@ This will: ...@@ -81,14 +77,16 @@ This will:
### Worker Roles ### Worker Roles
#### Decode Worker #### Decode Worker
- **Purpose**: Handles token generation (decode phase) - **Purpose**: Handles token generation (decode phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=0 - **GPU Assignment**: CUDA_VISIBLE_DEVICES=0
- **LMCache Config**: Uses `NixlConnector` only for kv transfer between prefill and decode workers - **LMCache Config**: Uses `NixlConnector` only for KV transfer between prefill and decode workers
#### Prefill Worker #### Prefill Worker
- **Purpose**: Handles prompt processing (prefill phase) - **Purpose**: Handles prompt processing (prefill phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1 - **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for kv offloading and use NIXL for kv transfer between prefill and decode workers. - **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers.
- **Flag**: `--is-prefill-worker` - **Flag**: `--is-prefill-worker`
## Architecture ## Architecture
...@@ -98,6 +96,7 @@ This will: ...@@ -98,6 +96,7 @@ This will:
The system automatically configures KV transfer based on the deployment mode and worker type: The system automatically configures KV transfer based on the deployment mode and worker type:
#### Prefill Worker (Disaggregated Mode) #### Prefill Worker (Disaggregated Mode)
```python ```python
kv_transfer_config = KVTransferConfig( kv_transfer_config = KVTransferConfig(
kv_connector="PdConnector", kv_connector="PdConnector",
...@@ -112,6 +111,7 @@ kv_transfer_config = KVTransferConfig( ...@@ -112,6 +111,7 @@ kv_transfer_config = KVTransferConfig(
``` ```
#### Decode Worker or Aggregated Mode #### Decode Worker or Aggregated Mode
```python ```python
kv_transfer_config = KVTransferConfig( kv_transfer_config = KVTransferConfig(
kv_connector="LMCacheConnectorV1", kv_connector="LMCacheConnectorV1",
...@@ -120,6 +120,7 @@ kv_transfer_config = KVTransferConfig( ...@@ -120,6 +120,7 @@ kv_transfer_config = KVTransferConfig(
``` ```
#### Fallback (No LMCache) #### Fallback (No LMCache)
```python ```python
kv_transfer_config = KVTransferConfig( kv_transfer_config = KVTransferConfig(
kv_connector="NixlConnector", kv_connector="NixlConnector",
...@@ -138,7 +139,6 @@ kv_transfer_config = KVTransferConfig( ...@@ -138,7 +139,6 @@ kv_transfer_config = KVTransferConfig(
- Creates vLLM engine with proper KV transfer config - Creates vLLM engine with proper KV transfer config
- Handles both aggregated and disaggregated modes - Handles both aggregated and disaggregated modes
### Best Practices ### Best Practices
1. **Chunk Size Tuning**: Adjust `LMCACHE_CHUNK_SIZE` based on your use case: 1. **Chunk Size Tuning**: Adjust `LMCACHE_CHUNK_SIZE` based on your use case:
...@@ -159,15 +159,16 @@ kv_transfer_config = KVTransferConfig( ...@@ -159,15 +159,16 @@ kv_transfer_config = KVTransferConfig(
When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics. When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
**Requirements to access LMCache metrics:** **Requirements to access LMCache metrics:**
- `--connector lmcache` - Enables LMCache - `--connector lmcache` - Enables LMCache
- `DYN_SYSTEM_PORT=8081` - Enables metrics HTTP endpoint - `DYN_SYSTEM_PORT=8081` - Enables metrics HTTP endpoint
- `PROMETHEUS_MULTIPROC_DIR` (optional) - If not set, Dynamo manages it internally. Only set explicitly if you need control over the metrics directory. - `PROMETHEUS_MULTIPROC_DIR` (optional) - If not set, Dynamo manages it internally
For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](prometheus.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide. For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](../backends/vllm/prometheus.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide.
### Troubleshooting ## Troubleshooting
#### LMCache log: `PrometheusLogger instance already created with different metadata` ### LMCache log: `PrometheusLogger instance already created with different metadata`
You may see an error like: You may see an error like:
...@@ -197,7 +198,7 @@ vllm serve Qwen/Qwen3-0.6B \ ...@@ -197,7 +198,7 @@ vllm serve Qwen/Qwen3-0.6B \
- **Mitigation (silence)**: set `LMCACHE_LOG_LEVEL=CRITICAL`. - **Mitigation (silence)**: set `LMCACHE_LOG_LEVEL=CRITICAL`.
- **Upstream issue**: [vLLM issue #30996](https://github.com/vllm-project/vllm/issues/30996). - **Upstream issue**: [vLLM issue #30996](https://github.com/vllm-project/vllm/issues/30996).
#### vLLM log: `Found PROMETHEUS_MULTIPROC_DIR was set by user` ### vLLM log: `Found PROMETHEUS_MULTIPROC_DIR was set by user`
vLLM v1 uses `prometheus_client.multiprocess` and stores intermediate metric values in `PROMETHEUS_MULTIPROC_DIR`. vLLM v1 uses `prometheus_client.multiprocess` and stores intermediate metric values in `PROMETHEUS_MULTIPROC_DIR`.
......
...@@ -5,8 +5,6 @@ ...@@ -5,8 +5,6 @@
# Deploying Dynamo on Kubernetes # Deploying Dynamo on Kubernetes
[Link to installation](../getting-started/installation.md)
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
## Important Terminology ## Important Terminology
...@@ -47,7 +45,7 @@ Before deploying the platform, run the pre-deployment checks to ensure the clust ...@@ -47,7 +45,7 @@ Before deploying the platform, run the pre-deployment checks to ensure the clust
./deploy/pre-deployment/pre-deployment-check.sh ./deploy/pre-deployment/pre-deployment-check.sh
``` ```
This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details. This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README) for more details.
## 1. Install Platform First ## 1. Install Platform First
...@@ -80,9 +78,9 @@ Each backend has deployment examples and configuration options: ...@@ -80,9 +78,9 @@ Each backend has deployment examples and configuration options:
| Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node | | Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:| |--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | | **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## 3. Deploy Your First Model ## 3. Deploy Your First Model
...@@ -107,7 +105,7 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE} ...@@ -107,7 +105,7 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models curl http://localhost:8000/v1/models
``` ```
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla-planner-quickstart.md). For SLA-based autoscaling, see [SLA Planner Guide](../components/planner/planner-guide.md).
## Understanding Dynamo's Custom Resources ## Understanding Dynamo's Custom Resources
...@@ -228,13 +226,13 @@ Key customization points include: ...@@ -228,13 +226,13 @@ Key customization points include:
## Additional Resources ## Additional Resources
- **[Examples](../getting-started/examples.md)** - Complete working examples - **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/README.md)** - Complete working examples
- **[Create Custom Deployments](deployment/create-deployment.md)** - Build your own CRDs - **[Create Custom Deployments](deployment/create-deployment.md)** - Build your own CRDs
- **[Managing Models with DynamoModel](deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models - **[Managing Models with DynamoModel](deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
- **[Operator Documentation](dynamo-operator.md)** - How the platform works - **[Operator Documentation](dynamo-operator.md)** - How the platform works
- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration - **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users - **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README)** - For advanced users
- **[Checkpointing](chrek/dynamo.md)** - Fast pod startup with checkpoint/restore - **[Checkpointing](chrek/README.md)** - Fast pod startup with checkpoint/restore
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users - **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
- **[Logging](observability/logging.md)** - For logging setup - **[Logging](observability/logging.md)** - For logging setup
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment - **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
......
...@@ -3,12 +3,11 @@ ...@@ -3,12 +3,11 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# API Reference > **⚠️ Important**: This documentation is automatically generated from source code.
> [!IMPORTANT]
> This documentation is automatically generated from source code.
> Do not edit this file directly. > Do not edit this file directly.
# API Reference
## Packages ## Packages
- [nvidia.com/v1alpha1](#nvidiacomv1alpha1) - [nvidia.com/v1alpha1](#nvidiacomv1alpha1)
...@@ -23,6 +22,7 @@ a high-level, SLA-driven interface for deploying machine learning models on Dyna ...@@ -23,6 +22,7 @@ a high-level, SLA-driven interface for deploying machine learning models on Dyna
Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group. Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
### Resource Types ### Resource Types
- [DynamoCheckpoint](#dynamocheckpoint)
- [DynamoComponentDeployment](#dynamocomponentdeployment) - [DynamoComponentDeployment](#dynamocomponentdeployment)
- [DynamoGraphDeployment](#dynamographdeployment) - [DynamoGraphDeployment](#dynamographdeployment)
- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest) - [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
...@@ -56,6 +56,24 @@ _Appears in:_ ...@@ -56,6 +56,24 @@ _Appears in:_
#### CheckpointMode
_Underlying type:_ _string_
CheckpointMode defines how checkpoint creation is handled
_Validation:_
- Enum: [Auto Manual]
_Appears in:_
- [ServiceCheckpointConfig](#servicecheckpointconfig)
| Field | Description |
| --- | --- |
| `Auto` | CheckpointModeAuto means the DGD controller will automatically create a Checkpoint CR<br /> |
| `Manual` | CheckpointModeManual means the user must create the Checkpoint CR themselves<br /> |
#### ComponentKind #### ComponentKind
_Underlying type:_ _string_ _Underlying type:_ _string_
...@@ -137,6 +155,146 @@ _Appears in:_ ...@@ -137,6 +155,146 @@ _Appears in:_
#### DynamoCheckpoint
DynamoCheckpoint is the Schema for the dynamocheckpoints API
It represents a container checkpoint that can be used to restore pods to a warm state
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoCheckpoint` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoCheckpointSpec](#dynamocheckpointspec)_ | | | |
| `status` _[DynamoCheckpointStatus](#dynamocheckpointstatus)_ | | | |
#### DynamoCheckpointIdentity
DynamoCheckpointIdentity defines the inputs that determine checkpoint equivalence
Two checkpoints with the same identity hash are considered equivalent
_Appears in:_
- [DynamoCheckpointSpec](#dynamocheckpointspec)
- [ServiceCheckpointConfig](#servicecheckpointconfig)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `model` _string_ | Model is the model identifier (e.g., "meta-llama/Llama-3-70B") | | Required: \{\} <br /> |
| `backendFramework` _string_ | BackendFramework is the runtime framework (vllm, sglang, trtllm) | | Enum: [vllm sglang trtllm] <br />Required: \{\} <br /> |
| `dynamoVersion` _string_ | DynamoVersion is the Dynamo platform version (optional)<br />If not specified, version is not included in identity hash<br />This ensures checkpoint compatibility across Dynamo releases | | Optional: \{\} <br /> |
| `tensorParallelSize` _integer_ | TensorParallelSize is the tensor parallel configuration | 1 | Minimum: 1 <br />Optional: \{\} <br /> |
| `pipelineParallelSize` _integer_ | PipelineParallelSize is the pipeline parallel configuration | 1 | Minimum: 1 <br />Optional: \{\} <br /> |
| `dtype` _string_ | Dtype is the data type (fp16, bf16, fp8, etc.) | | Optional: \{\} <br /> |
| `maxModelLen` _integer_ | MaxModelLen is the maximum sequence length | | Minimum: 1 <br />Optional: \{\} <br /> |
| `extraParameters` _object (keys:string, values:string)_ | ExtraParameters are additional parameters that affect the checkpoint hash<br />Use for any framework-specific or custom parameters not covered above | | Optional: \{\} <br /> |
#### DynamoCheckpointJobConfig
DynamoCheckpointJobConfig defines the configuration for the checkpoint creation Job
_Appears in:_
- [DynamoCheckpointSpec](#dynamocheckpointspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `podTemplateSpec` _[PodTemplateSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#podtemplatespec-v1-core)_ | PodTemplateSpec allows customizing the checkpoint Job pod<br />This should include the container that runs the workload to be checkpointed | | Required: \{\} <br /> |
| `activeDeadlineSeconds` _integer_ | ActiveDeadlineSeconds specifies the maximum time the Job can run | 3600 | Optional: \{\} <br /> |
| `backoffLimit` _integer_ | BackoffLimit specifies the number of retries before marking the Job failed | 3 | Optional: \{\} <br /> |
| `ttlSecondsAfterFinished` _integer_ | TTLSecondsAfterFinished specifies how long to keep the Job after completion | 300 | Optional: \{\} <br /> |
#### DynamoCheckpointPhase
_Underlying type:_ _string_
DynamoCheckpointPhase represents the current phase of the checkpoint lifecycle
_Validation:_
- Enum: [Pending Creating Ready Failed]
_Appears in:_
- [DynamoCheckpointStatus](#dynamocheckpointstatus)
| Field | Description |
| --- | --- |
| `Pending` | DynamoCheckpointPhasePending indicates the checkpoint CR has been created but the Job has not started<br /> |
| `Creating` | DynamoCheckpointPhaseCreating indicates the checkpoint Job is running<br /> |
| `Ready` | DynamoCheckpointPhaseReady indicates the checkpoint tar file is available on the PVC<br /> |
| `Failed` | DynamoCheckpointPhaseFailed indicates the checkpoint creation failed<br /> |
#### DynamoCheckpointSpec
DynamoCheckpointSpec defines the desired state of DynamoCheckpoint
_Appears in:_
- [DynamoCheckpoint](#dynamocheckpoint)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `identity` _[DynamoCheckpointIdentity](#dynamocheckpointidentity)_ | Identity defines the inputs that determine checkpoint equivalence | | Required: \{\} <br /> |
| `job` _[DynamoCheckpointJobConfig](#dynamocheckpointjobconfig)_ | Job defines the configuration for the checkpoint creation Job | | Required: \{\} <br /> |
#### DynamoCheckpointStatus
DynamoCheckpointStatus defines the observed state of DynamoCheckpoint
_Appears in:_
- [DynamoCheckpoint](#dynamocheckpoint)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `phase` _[DynamoCheckpointPhase](#dynamocheckpointphase)_ | Phase represents the current phase of the checkpoint lifecycle | | Enum: [Pending Creating Ready Failed] <br />Optional: \{\} <br /> |
| `identityHash` _string_ | IdentityHash is the computed hash of the checkpoint identity<br />This hash is used to identify equivalent checkpoints | | Optional: \{\} <br /> |
| `location` _string_ | Location is the full URI/path to the checkpoint in the storage backend<br />For PVC: same as TarPath (e.g., /checkpoints/\{hash\}.tar)<br />For S3: s3://bucket/prefix/\{hash\}.tar<br />For OCI: oci://registry/repo:\{hash\} | | Optional: \{\} <br /> |
| `storageType` _[DynamoCheckpointStorageType](#dynamocheckpointstoragetype)_ | StorageType indicates the storage backend type used for this checkpoint | | Enum: [pvc s3 oci] <br />Optional: \{\} <br /> |
| `jobName` _string_ | JobName is the name of the checkpoint creation Job | | Optional: \{\} <br /> |
| `createdAt` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#time-v1-meta)_ | CreatedAt is the timestamp when the checkpoint tar was created | | Optional: \{\} <br /> |
| `message` _string_ | Message provides additional information about the current state | | Optional: \{\} <br /> |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions represent the latest available observations of the checkpoint's state | | Optional: \{\} <br /> |
#### DynamoCheckpointStorageType
_Underlying type:_ _string_
DynamoCheckpointStorageType defines the supported storage backends for checkpoints
_Validation:_
- Enum: [pvc s3 oci]
_Appears in:_
- [DynamoCheckpointStatus](#dynamocheckpointstatus)
#### DynamoComponentDeployment #### DynamoComponentDeployment
...@@ -182,15 +340,17 @@ _Appears in:_ ...@@ -182,15 +340,17 @@ _Appears in:_
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | | | `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | |
| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | | | `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | |
| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | | | `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | |
| `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery | | | | `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery | | Optional: \{\} <br /> |
| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | | | `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | |
| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | | | `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | Optional: \{\} <br /> |
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | | | `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | Optional: \{\} <br /> |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | | | `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | | | `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br /> | | `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br /> |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | | | `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. | | | | `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. | | Optional: \{\} <br /> |
| `eppConfig` _[EPPConfig](#eppconfig)_ | EPPConfig defines EPP-specific configuration options for Endpoint Picker Plugin components.<br />Only applicable when ComponentType is "epp". | | Optional: \{\} <br /> |
| `checkpoint` _[ServiceCheckpointConfig](#servicecheckpointconfig)_ | Checkpoint configures container checkpointing for this service.<br />When enabled, pods can be restored from a checkpoint files for faster cold start. | | Optional: \{\} <br /> |
#### DynamoComponentDeploymentSpec #### DynamoComponentDeploymentSpec
...@@ -220,15 +380,17 @@ _Appears in:_ ...@@ -220,15 +380,17 @@ _Appears in:_
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | | | `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | |
| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | | | `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | |
| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | | | `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | |
| `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery | | | | `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery | | Optional: \{\} <br /> |
| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | | | `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | |
| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | | | `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | Optional: \{\} <br /> |
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | | | `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | Optional: \{\} <br /> |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | | | `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | | | `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br /> | | `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br /> |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | | | `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. | | | | `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. | | Optional: \{\} <br /> |
| `eppConfig` _[EPPConfig](#eppconfig)_ | EPPConfig defines EPP-specific configuration options for Endpoint Picker Plugin components.<br />Only applicable when ComponentType is "epp". | | Optional: \{\} <br /> |
| `checkpoint` _[ServiceCheckpointConfig](#servicecheckpointconfig)_ | Checkpoint configures container checkpointing for this service.<br />When enabled, pods can be restored from a checkpoint files for faster cold start. | | Optional: \{\} <br /> |
#### DynamoGraphDeployment #### DynamoGraphDeployment
...@@ -300,7 +462,7 @@ _Appears in:_ ...@@ -300,7 +462,7 @@ _Appears in:_
| `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br />This is a high-level identifier for easy reference in kubectl output and logs.<br />The controller automatically sets this value in profilingConfig.config.deployment.model. | | Required: \{\} <br /> | | `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br />This is a high-level identifier for easy reference in kubectl output and logs.<br />The controller automatically sets this value in profilingConfig.config.deployment.model. | | Required: \{\} <br /> |
| `backend` _string_ | Backend specifies the inference backend for profiling.<br />The controller automatically sets this value in profilingConfig.config.engine.backend.<br />Profiling runs on real GPUs or via AIC simulation to collect performance data. | | Enum: [vllm sglang trtllm] <br />Required: \{\} <br /> | | `backend` _string_ | Backend specifies the inference backend for profiling.<br />The controller automatically sets this value in profilingConfig.config.engine.backend.<br />Profiling runs on real GPUs or via AIC simulation to collect performance data. | | Enum: [vllm sglang trtllm] <br />Required: \{\} <br /> |
| `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br />a real backend deployment. When true, the deployment uses simulated engines that<br />don't require GPUs, using the profiling data to simulate realistic timing behavior.<br />Mocker is available in all backend images and useful for large-scale experiments.<br />Profiling still runs against the real backend (specified above) to collect performance data. | false | | | `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br />a real backend deployment. When true, the deployment uses simulated engines that<br />don't require GPUs, using the profiling data to simulate realistic timing behavior.<br />Mocker is available in all backend images and useful for large-scale experiments.<br />Profiling still runs against the real backend (specified above) to collect performance data. | false | |
| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br />resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br />any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,<br />num_gpus_per_node) with values detected from the cluster.<br />Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\} <br /> | | `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br />resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br />any manually specified hardware configuration (minNumGpusPerEngine, maxNumGpusPerEngine,<br />numGpusPerNode) with values detected from the cluster.<br />Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\} <br /> |
| `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br />Note: deployment.model and engine.backend are automatically set from the high-level<br />modelName and backend fields and should not be specified in this config. | | Required: \{\} <br /> | | `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br />Note: deployment.model and engine.backend are automatically set from the high-level<br />modelName and backend fields and should not be specified in this config. | | Required: \{\} <br /> |
| `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br />after profiling completes. If false, only the spec is generated and stored in status.<br />Users can then manually create a DGD using the generated spec. | false | | | `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br />after profiling completes. If false, only the spec is generated and stored in status.<br />Users can then manually create a DGD using the generated spec. | false | |
| `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br />Only applicable when AutoApply is true. | | Optional: \{\} <br /> | | `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br />Only applicable when AutoApply is true. | | Optional: \{\} <br /> |
...@@ -324,7 +486,7 @@ _Appears in:_ ...@@ -324,7 +486,7 @@ _Appears in:_
| `backend` _string_ | Backend is extracted from profilingConfig.config.engine.backend for display purposes.<br />This field is populated by the controller and shown in kubectl output. | | Optional: \{\} <br /> | | `backend` _string_ | Backend is extracted from profilingConfig.config.engine.backend for display purposes.<br />This field is populated by the controller and shown in kubectl output. | | Optional: \{\} <br /> |
| `observedGeneration` _integer_ | ObservedGeneration reflects the generation of the most recently observed spec.<br />Used to detect spec changes and enforce immutability after profiling starts. | | | | `observedGeneration` _integer_ | ObservedGeneration reflects the generation of the most recently observed spec.<br />Used to detect spec changes and enforce immutability after profiling starts. | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the deployment request.<br />Standard condition types include: Validation, Profiling, SpecGenerated, DeploymentReady.<br />Conditions are merged by type on patch updates. | | | | `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the deployment request.<br />Standard condition types include: Validation, Profiling, SpecGenerated, DeploymentReady.<br />Conditions are merged by type on patch updates. | | |
| `profilingResults` _string_ | ProfilingResults contains a reference to the ConfigMap holding profiling data.<br />Format: "configmap/\<name\>" | | Optional: \{\} <br /> | | `profilingResults` _string_ | ProfilingResults contains a reference to the ConfigMap holding profiling data.<br />Format: "configmap/`<name>`" | | Optional: \{\} <br /> |
| `generatedDeployment` _[RawExtension](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#rawextension-runtime-pkg)_ | GeneratedDeployment contains the full generated DynamoGraphDeployment specification<br />including metadata, based on profiling results. Users can extract this to create<br />a DGD manually, or it's used automatically when autoApply is true.<br />Stored as RawExtension to preserve all fields including metadata.<br />For mocker backends, this contains the mocker DGD spec. | | EmbeddedResource: \{\} <br />Optional: \{\} <br /> | | `generatedDeployment` _[RawExtension](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#rawextension-runtime-pkg)_ | GeneratedDeployment contains the full generated DynamoGraphDeployment specification<br />including metadata, based on profiling results. Users can extract this to create<br />a DGD manually, or it's used automatically when autoApply is true.<br />Stored as RawExtension to preserve all fields including metadata.<br />For mocker backends, this contains the mocker DGD spec. | | EmbeddedResource: \{\} <br />Optional: \{\} <br /> |
| `deployment` _[DeploymentStatus](#deploymentstatus)_ | Deployment tracks the auto-created DGD when AutoApply is true.<br />Contains name, namespace, state, and creation status of the managed DGD. | | Optional: \{\} <br /> | | `deployment` _[DeploymentStatus](#deploymentstatus)_ | Deployment tracks the auto-created DGD when AutoApply is true.<br />Contains name, namespace, state, and creation status of the managed DGD. | | Optional: \{\} <br /> |
...@@ -384,9 +546,9 @@ _Appears in:_ ...@@ -384,9 +546,9 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `replicas` _integer_ | Replicas is the current number of replicas for the target service.<br />This is synced from the DGD's service replicas and is required for the scale subresource. | | | | `replicas` _integer_ | Replicas is the current number of replicas for the target service.<br />This is synced from the DGD's service replicas and is required for the scale subresource. | | Optional: \{\} <br /> |
| `selector` _string_ | Selector is a label selector string for the pods managed by this adapter.<br />Required for HPA compatibility via the scale subresource. | | | | `selector` _string_ | Selector is a label selector string for the pods managed by this adapter.<br />Required for HPA compatibility via the scale subresource. | | Optional: \{\} <br /> |
| `lastScaleTime` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#time-v1-meta)_ | LastScaleTime is the last time the adapter scaled the target service. | | | | `lastScaleTime` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#time-v1-meta)_ | LastScaleTime is the last time the adapter scaled the target service. | | Optional: \{\} <br /> |
#### DynamoGraphDeploymentServiceRef #### DynamoGraphDeploymentServiceRef
...@@ -441,8 +603,9 @@ _Appears in:_ ...@@ -441,8 +603,9 @@ _Appears in:_
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | | | `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. | | | | `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. | | |
| `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. | | | | `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. | | Optional: \{\} <br /> |
| `restart` _[RestartStatus](#restartstatus)_ | Restart contains the status of the restart of the graph deployment. | | | | `restart` _[RestartStatus](#restartstatus)_ | Restart contains the status of the restart of the graph deployment. | | Optional: \{\} <br /> |
| `checkpoints` _object (keys:string, values:[ServiceCheckpointStatus](#servicecheckpointstatus))_ | Checkpoints contains per-service checkpoint status information.<br />The map key is the service name from spec.services. | | Optional: \{\} <br /> |
#### DynamoModel #### DynamoModel
...@@ -479,8 +642,8 @@ _Appears in:_ ...@@ -479,8 +642,8 @@ _Appears in:_
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `modelName` _string_ | ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora") | | Required: \{\} <br /> | | `modelName` _string_ | ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora") | | Required: \{\} <br /> |
| `baseModelName` _string_ | BaseModelName is the base model identifier that matches the service label<br />This is used to discover endpoints via headless services | | Required: \{\} <br /> | | `baseModelName` _string_ | BaseModelName is the base model identifier that matches the service label<br />This is used to discover endpoints via headless services | | Required: \{\} <br /> |
| `modelType` _string_ | ModelType specifies the type of model (e.g., "base", "lora", "adapter") | base | Enum: [base lora adapter] <br /> | | `modelType` _string_ | ModelType specifies the type of model (e.g., "base", "lora", "adapter") | base | Enum: [base lora adapter] <br />Optional: \{\} <br /> |
| `source` _[ModelSource](#modelsource)_ | Source specifies the model source location (only applicable for lora model type) | | | | `source` _[ModelSource](#modelsource)_ | Source specifies the model source location (only applicable for lora model type) | | Optional: \{\} <br /> |
#### DynamoModelStatus #### DynamoModelStatus
...@@ -496,10 +659,29 @@ _Appears in:_ ...@@ -496,10 +659,29 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `endpoints` _[EndpointInfo](#endpointinfo) array_ | Endpoints is the current list of all endpoints for this model | | | | `endpoints` _[EndpointInfo](#endpointinfo) array_ | Endpoints is the current list of all endpoints for this model | | Optional: \{\} <br /> |
| `readyEndpoints` _integer_ | ReadyEndpoints is the count of endpoints that are ready | | | | `readyEndpoints` _integer_ | ReadyEndpoints is the count of endpoints that are ready | | |
| `totalEndpoints` _integer_ | TotalEndpoints is the total count of endpoints | | | | `totalEndpoints` _integer_ | TotalEndpoints is the total count of endpoints | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions represents the latest available observations of the model's state | | | | `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions represents the latest available observations of the model's state | | Optional: \{\} <br /> |
#### EPPConfig
EPPConfig contains configuration for EPP (Endpoint Picker Plugin) components.
EPP is responsible for intelligent endpoint selection and KV-aware routing.
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `configMapRef` _[ConfigMapKeySelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#configmapkeyselector-v1-core)_ | ConfigMapRef references a user-provided ConfigMap containing EPP configuration.<br />The ConfigMap should contain EndpointPickerConfig YAML.<br />Mutually exclusive with Config. | | Optional: \{\} <br /> |
| `config` _[EndpointPickerConfig](#endpointpickerconfig)_ | Config allows specifying EPP EndpointPickerConfig directly as a structured object.<br />The operator will marshal this to YAML and create a ConfigMap automatically.<br />Mutually exclusive with ConfigMapRef.<br />One of ConfigMapRef or Config must be specified (no default configuration).<br />Uses the upstream type from github.com/kubernetes-sigs/gateway-api-inference-extension | | Type: object <br />Optional: \{\} <br /> |
#### EndpointInfo #### EndpointInfo
...@@ -516,7 +698,7 @@ _Appears in:_ ...@@ -516,7 +698,7 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `address` _string_ | Address is the full address of the endpoint (e.g., "http://10.0.1.5:9090") | | | | `address` _string_ | Address is the full address of the endpoint (e.g., "http://10.0.1.5:9090") | | |
| `podName` _string_ | PodName is the name of the pod serving this endpoint | | | | `podName` _string_ | PodName is the name of the pod serving this endpoint | | Optional: \{\} <br /> |
| `ready` _boolean_ | Ready indicates whether the endpoint is ready to serve traffic<br />For LoRA models: true if the POST /loras request succeeded with a 2xx status code<br />For base models: always false (no probing performed) | | | | `ready` _boolean_ | Ready indicates whether the endpoint is ready to serve traffic<br />For LoRA models: true if the POST /loras request succeeded with a 2xx status code<br />For base models: always false (no probing performed) | | |
...@@ -614,7 +796,7 @@ _Appears in:_ ...@@ -614,7 +796,7 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `name` _string_ | Name is the base model identifier (e.g., "llama-3-70b-instruct-v1") | | Required: \{\} <br /> | | `name` _string_ | Name is the base model identifier (e.g., "llama-3-70b-instruct-v1") | | Required: \{\} <br /> |
| `revision` _string_ | Revision is the model revision/version (optional) | | | | `revision` _string_ | Revision is the model revision/version (optional) | | Optional: \{\} <br /> |
#### ModelSource #### ModelSource
...@@ -733,30 +915,47 @@ _Appears in:_ ...@@ -733,30 +915,47 @@ _Appears in:_
| `claims` _[ResourceClaim](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourceclaim-v1-core) array_ | Claims specifies resource claims for dynamic resource allocation | | | | `claims` _[ResourceClaim](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourceclaim-v1-core) array_ | Claims specifies resource claims for dynamic resource allocation | | |
#### ScalingAdapter #### Restart
ScalingAdapter configures whether a service uses the DynamoGraphDeploymentScalingAdapter
for replica management. When enabled, the DGDSA owns the replicas field and
external autoscalers (HPA, KEDA, Planner) can control scaling via the Scale subresource.
_Appears in:_ _Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) - [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `enabled` _boolean_ | Enabled indicates whether the ScalingAdapter should be enabled for this service.<br />When true, a DGDSA is created and owns the replicas field.<br />When false (default), no DGDSA is created and replicas can be modified directly in the DGD. | false | | | `id` _string_ | ID is an arbitrary string that triggers a restart when changed.<br />Any modification to this value will initiate a restart of the graph deployment according to the strategy. | | MinLength: 1 <br />Required: \{\} <br /> |
| `strategy` _[RestartStrategy](#restartstrategy)_ | Strategy specifies the restart strategy for the graph deployment. | | Optional: \{\} <br /> |
#### ServiceReplicaStatus #### RestartPhase
_Underlying type:_ _string_
ServiceReplicaStatus contains replica information for a single service.
_Appears in:_
- [RestartStatus](#restartstatus)
| Field | Description |
| --- | --- |
| `Pending` | |
| `Restarting` | |
| `Completed` | |
| `Failed` | |
#### RestartStatus
RestartStatus contains the status of the restart of the graph deployment.
...@@ -765,15 +964,12 @@ _Appears in:_ ...@@ -765,15 +964,12 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `componentKind` _[ComponentKind](#componentkind)_ | ComponentKind is the underlying resource kind (e.g., "PodClique", "PodCliqueScalingGroup", "Deployment", "LeaderWorkerSet"). | | Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet] <br /> | | `observedID` _string_ | ObservedID is the restart ID that has been observed and is being processed.<br />Matches the Restart.ID field in the spec. | | |
| `componentName` _string_ | ComponentName is the name of the underlying resource. | | | | `phase` _[RestartPhase](#restartphase)_ | Phase is the phase of the restart. | | |
| `replicas` _integer_ | Replicas is the total number of non-terminated replicas.<br />Required for all component kinds. | | Minimum: 0 <br /> | | `inProgress` _string array_ | InProgress contains the names of the services that are currently being restarted. | | Optional: \{\} <br /> |
| `updatedReplicas` _integer_ | UpdatedReplicas is the number of replicas at the current/desired revision.<br />Required for all component kinds. | | Minimum: 0 <br /> |
| `readyReplicas` _integer_ | ReadyReplicas is the number of ready replicas.<br />Populated for PodClique, Deployment, and LeaderWorkerSet.<br />Not available for PodCliqueScalingGroup.<br />When nil, the field is omitted from the API response. | | Minimum: 0 <br /> |
| `availableReplicas` _integer_ | AvailableReplicas is the number of available replicas.<br />For Deployment: replicas ready for >= minReadySeconds.<br />For PodCliqueScalingGroup: replicas where all constituent PodCliques have >= MinAvailable ready pods.<br />Not available for PodClique or LeaderWorkerSet.<br />When nil, the field is omitted from the API response. | | Minimum: 0 <br /> |
#### SharedMemorySpec #### RestartStrategy
...@@ -782,20 +978,38 @@ _Appears in:_ ...@@ -782,20 +978,38 @@ _Appears in:_
_Appears in:_ _Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) - [Restart](#restart)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `disabled` _boolean_ | | | | | `type` _[RestartStrategyType](#restartstrategytype)_ | Type specifies the restart strategy type. | Sequential | Enum: [Sequential Parallel] <br /> |
| `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ | | | | | `order` _string array_ | Order specifies the order in which the services should be restarted. | | Optional: \{\} <br /> |
#### VolumeMount #### RestartStrategyType
_Underlying type:_ _string_
VolumeMount references a PVC defined at the top level for volumes to be mounted by the component
_Appears in:_
- [RestartStrategy](#restartstrategy)
| Field | Description |
| --- | --- |
| `Sequential` | |
| `Parallel` | |
#### ScalingAdapter
ScalingAdapter configures whether a service uses the DynamoGraphDeploymentScalingAdapter
for replica management. When enabled, the DGDSA owns the replicas field and
external autoscalers (HPA, KEDA, Planner) can control scaling via the Scale subresource.
...@@ -805,49 +1019,52 @@ _Appears in:_ ...@@ -805,49 +1019,52 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `name` _string_ | Name references a PVC name defined in the top-level PVCs map | | Required: \{\} <br /> | | `enabled` _boolean_ | Enabled indicates whether the ScalingAdapter should be enabled for this service.<br />When true, a DGDSA is created and owns the replicas field.<br />When false (default), no DGDSA is created and replicas can be modified directly in the DGD. | false | Optional: \{\} <br /> |
| `mountPoint` _string_ | MountPoint specifies where to mount the volume.<br />If useAsCompilationCache is true and mountPoint is not specified,<br />a backend-specific default will be used. | | |
| `useAsCompilationCache` _boolean_ | UseAsCompilationCache indicates this volume should be used as a compilation cache.<br />When true, backend-specific environment variables will be set and default mount points may be used. | false | |
#### Restart #### ServiceCheckpointConfig
ServiceCheckpointConfig configures checkpointing for a DGD service
_Appears in:_ _Appears in:_
- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec) - [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `id` _string_ | ID is an arbitrary string that triggers a restart when changed.<br />Any modification to this value will initiate a restart of the graph deployment according to the strategy. | | MinLength: 1 <br />Required: \{\} <br /> | | `enabled` _boolean_ | Enabled indicates whether checkpointing is enabled for this service | false | Optional: \{\} <br /> |
| `strategy` _[RestartStrategy](#restartstrategy)_ | Strategy specifies the restart strategy for the graph deployment. | | Optional: \{\} <br /> | | `mode` _[CheckpointMode](#checkpointmode)_ | Mode defines how checkpoint creation is handled<br />- Auto: DGD controller creates Checkpoint CR automatically<br />- Manual: User must create Checkpoint CR | Auto | Enum: [Auto Manual] <br />Optional: \{\} <br /> |
| `checkpointRef` _string_ | CheckpointRef references an existing Checkpoint CR to use<br />If specified, Identity is ignored and this checkpoint is used directly | | Optional: \{\} <br /> |
| `identity` _[DynamoCheckpointIdentity](#dynamocheckpointidentity)_ | Identity defines the checkpoint identity for hash computation<br />Used when Mode is Auto or when looking up existing checkpoints<br />Required when checkpointRef is not specified | | Optional: \{\} <br /> |
#### RestartPhase #### ServiceCheckpointStatus
_Underlying type:_ _string_
ServiceCheckpointStatus contains checkpoint information for a single service.
_Appears in:_ _Appears in:_
- [RestartStatus](#restartstatus) - [DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)
| Field | Description | | Field | Description | Default | Validation |
| --- | --- | | --- | --- | --- | --- |
| `Pending` | | | `checkpointName` _string_ | CheckpointName is the name of the associated Checkpoint CR | | Optional: \{\} <br /> |
| `Restarting` | | | `identityHash` _string_ | IdentityHash is the computed hash of the checkpoint identity | | Optional: \{\} <br /> |
| `Completed` | | | `ready` _boolean_ | Ready indicates if the checkpoint is ready for use | | Optional: \{\} <br /> |
| `Failed` | |
#### RestartStatus #### ServiceReplicaStatus
RestartStatus contains the status of the restart of the graph deployment. ServiceReplicaStatus contains replica information for a single service.
...@@ -856,40 +1073,49 @@ _Appears in:_ ...@@ -856,40 +1073,49 @@ _Appears in:_
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `observedID` _string_ | ObservedID is the restart ID that has been observed and is being processed.<br />Matches the Restart.ID field in the spec. | | | | `componentKind` _[ComponentKind](#componentkind)_ | ComponentKind is the underlying resource kind (e.g., "PodClique", "PodCliqueScalingGroup", "Deployment", "LeaderWorkerSet"). | | Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet] <br /> |
| `phase` _[RestartPhase](#restartphase)_ | Phase is the phase of the restart. | | | | `componentName` _string_ | ComponentName is the name of the underlying resource. | | |
| `inProgress` _string array_ | InProgress contains the names of the services that are currently being restarted. | | | | `replicas` _integer_ | Replicas is the total number of non-terminated replicas.<br />Required for all component kinds. | | Minimum: 0 <br /> |
| `updatedReplicas` _integer_ | UpdatedReplicas is the number of replicas at the current/desired revision.<br />Required for all component kinds. | | Minimum: 0 <br /> |
| `readyReplicas` _integer_ | ReadyReplicas is the number of ready replicas.<br />Populated for PodClique, Deployment, and LeaderWorkerSet.<br />Not available for PodCliqueScalingGroup.<br />When nil, the field is omitted from the API response. | | Minimum: 0 <br />Optional: \{\} <br /> |
| `availableReplicas` _integer_ | AvailableReplicas is the number of available replicas.<br />For Deployment: replicas ready for >= minReadySeconds.<br />For PodCliqueScalingGroup: replicas where all constituent PodCliques have >= MinAvailable ready pods.<br />Not available for PodClique or LeaderWorkerSet.<br />When nil, the field is omitted from the API response. | | Minimum: 0 <br />Optional: \{\} <br /> |
#### SharedMemorySpec
#### RestartStrategy
_Appears in:_ _Appears in:_
- [Restart](#restart) - [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation | | Field | Description | Default | Validation |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| `type` _[RestartStrategyType](#restartstrategytype)_ | Type specifies the restart strategy type. | Sequential | Enum: [Sequential Parallel] <br /> | | `disabled` _boolean_ | | | |
| `order` _string array_ | Order specifies the order in which the services should be restarted. | | Optional: \{\} <br /> | | `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ | | | |
#### RestartStrategyType #### VolumeMount
_Underlying type:_ _string_
VolumeMount references a PVC defined at the top level for volumes to be mounted by the component
_Appears in:_ _Appears in:_
- [RestartStrategy](#restartstrategy) - [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | | Field | Description | Default | Validation |
| --- | --- | | --- | --- | --- | --- |
| `Sequential` | | | `name` _string_ | Name references a PVC name defined in the top-level PVCs map | | Required: \{\} <br /> |
| `Parallel` | | | `mountPoint` _string_ | MountPoint specifies where to mount the volume.<br />If useAsCompilationCache is true and mountPoint is not specified,<br />a backend-specific default will be used. | | |
| `useAsCompilationCache` _boolean_ | UseAsCompilationCache indicates this volume should be used as a compilation cache.<br />When true, backend-specific environment variables will be set and default mount points may be used. | false | |
# Operator Default Values Injection # Operator Default Values Injection
...@@ -1025,8 +1251,9 @@ Worker components receive the following probe configurations: ...@@ -1025,8 +1251,9 @@ Worker components receive the following probe configurations:
- **Timeout**: 5 seconds - **Timeout**: 5 seconds
- **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s) - **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)
> [!NOTE] :::{note}
> **For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.** For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.
:::
### Multinode Deployment Probe Modifications ### Multinode Deployment Probe Modifications
......
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Autoscaling
This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`. This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`.
## Example DGD ## Example DGD
...@@ -51,8 +49,7 @@ Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAd ...@@ -51,8 +49,7 @@ Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAd
| **Dynamo Planner** | LLM-aware autoscaling with SLA optimization | Production LLM workloads | | **Dynamo Planner** | LLM-aware autoscaling with SLA optimization | Production LLM workloads |
| **Custom Controllers** | Any scale-subresource-compatible controller | Custom requirements | | **Custom Controllers** | Any scale-subresource-compatible controller | Custom requirements |
> [!WARNING] > **⚠️ Deprecation Notice**: The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning.
> **Deprecation Notice:** The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning.
## Architecture ## Architecture
...@@ -158,7 +155,7 @@ The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions b ...@@ -158,7 +155,7 @@ The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions b
**When to use Planner:** **When to use Planner:**
- You want LLM-optimized autoscaling out of the box - You want LLM-optimized autoscaling out of the box
- You need coordinated scaling across prefill/decode services - You need coordinated scaling across prefill/decode services
- You want SLA-driven scaling (e.g., target TTFT < 500ms) - You want SLA-driven scaling (e.g., target TTFT \< 500ms)
**How Planner works:** **How Planner works:**
...@@ -169,14 +166,14 @@ Planner is deployed as a service component within your DGD. It: ...@@ -169,14 +166,14 @@ Planner is deployed as a service component within your DGD. It:
**Deployment:** **Deployment:**
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla-planner-quickstart.md) for complete instructions. The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../components/planner/planner-guide.md) for complete instructions.
Example configurations with Planner: Example configurations with Planner:
- `examples/backends/vllm/deploy/disagg_planner.yaml` - `examples/backends/vllm/deploy/disagg_planner.yaml`
- `examples/backends/sglang/deploy/disagg_planner.yaml` - `examples/backends/sglang/deploy/disagg_planner.yaml`
- `examples/backends/trtllm/deploy/disagg_planner.yaml` - `examples/backends/trtllm/deploy/disagg_planner.yaml`
For more details, see the [SLA Planner documentation](../planner/sla-planner.md). For more details, see the [SLA Planner documentation](../components/planner/planner-guide.md).
## Autoscaling with Kubernetes HPA ## Autoscaling with Kubernetes HPA
...@@ -187,7 +184,9 @@ The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution. ...@@ -187,7 +184,9 @@ The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution.
- You want to use standard Kubernetes tooling - You want to use standard Kubernetes tooling
- You need CPU or memory-based scaling - You need CPU or memory-based scaling
> **Note**: For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure. <Note>
For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure.
</Note>
### Basic HPA (CPU-based) ### Basic HPA (CPU-based)
...@@ -243,7 +242,9 @@ Dynamo metrics include these labels for filtering: ...@@ -243,7 +242,9 @@ Dynamo metrics include these labels for filtering:
| `dynamo_namespace` | Unique DGD identifier (`{k8s-namespace}-{dynamoNamespace}`) | `default-sglang-agg` | | `dynamo_namespace` | Unique DGD identifier (`{k8s-namespace}-{dynamoNamespace}`) | `default-sglang-agg` |
| `model` | Model being served | `Qwen/Qwen3-0.6B` | | `model` | Model being served | `Qwen/Qwen3-0.6B` |
> **Note**: When you have multiple DGDs in the same namespace, use `dynamo_namespace` to filter metrics for a specific DGD. <Note>
When you have multiple DGDs in the same namespace, use `dynamo_namespace` to filter metrics for a specific DGD.
</Note>
#### Example: Scale Decode Service Based on TTFT #### Example: Scale Decode Service Based on TTFT
...@@ -416,7 +417,9 @@ helm install keda kedacore/keda \ ...@@ -416,7 +417,9 @@ helm install keda kedacore/keda \
kubectl get pods -n keda kubectl get pods -n keda
``` ```
> **Note**: If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts. <Note>
If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts.
</Note>
### Example: Scale Decode Based on TTFT ### Example: Scale Decode Based on TTFT
...@@ -607,7 +610,9 @@ kubectl get dgdsa sglang-agg-decode -n default ...@@ -607,7 +610,9 @@ kubectl get dgdsa sglang-agg-decode -n default
# sglang-agg-decode sglang-agg decode 3 10m # sglang-agg-decode sglang-agg decode 3 10m
``` ```
> **Note**: If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle. <Note>
If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
</Note>
### With DGDSA Disabled ### With DGDSA Disabled
...@@ -731,7 +736,7 @@ If you see unstable scaling: ...@@ -731,7 +736,7 @@ If you see unstable scaling:
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) - [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [KEDA Documentation](https://keda.sh/) - [KEDA Documentation](https://keda.sh/)
- [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter) - [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter)
- [Planner Documentation](../planner/sla-planner.md) - [Planner Documentation](../components/planner/planner-guide.md)
- [Dynamo Metrics Reference](../observability/metrics.md) - [Dynamo Metrics Reference](../observability/metrics.md)
- [Prometheus and Grafana Setup](../observability/prometheus-grafana.md) - [Prometheus and Grafana Setup](../observability/prometheus-grafana.md)
# ChReK: Checkpoint/Restore in Kubernetes ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. See [Limitations](#limitations) for details. > ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. See [Limitations](#limitations) for details.
...@@ -115,7 +118,7 @@ ChReK is best suited for: ...@@ -115,7 +118,7 @@ ChReK is best suited for:
### Getting Started ### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform - [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
- [Standalone Usage Guide](standalone.md) - Using ChReK independently - [Standalone Usage Guide](standalone.md) - Using ChReK independently
- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for Helm chart configuration - [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Helm chart configuration
### Related Documentation ### Related Documentation
- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs - [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
--- ---
# Checkpoint/Restore for Fast Pod Startup # Checkpoint/Restore for Fast Pod Startup
...@@ -517,7 +505,7 @@ spec: ...@@ -517,7 +505,7 @@ spec:
- [ChReK Overview](README.md) - ChReK architecture and use cases - [ChReK Overview](README.md) - ChReK architecture and use cases
- [ChReK Standalone Usage Guide](standalone.md) - Use ChReK without Dynamo Platform - [ChReK Standalone Usage Guide](standalone.md) - Use ChReK without Dynamo Platform
- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for chart configuration - [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation - [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications - [API Reference](../api-reference.md) - Complete CRD specifications
# ChReK Standalone Usage Guide ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying. > ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
......
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Creating Kubernetes Deployments
The scripts in the `examples/<backend>/launch` folder like [agg.sh](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally. The scripts in the `examples/<backend>/launch` folder like [agg.sh](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph. The corresponding YAML files like [agg.yaml](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
...@@ -261,4 +259,4 @@ spec: ...@@ -261,4 +259,4 @@ spec:
``` ```
**For complete details on managing models and LoRA adapters, see:** **For complete details on managing models and LoRA adapters, see:**
📖 **[Managing Models with DynamoModel Guide](dynamomodel-guide.md)** 📖 **[Managing Models with DynamoModel Guide](./dynamomodel-guide.md)**
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Managing Models with DynamoModel
## Overview ## Overview
`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to: `DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
......
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Multinode Deployment Guide
This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models. This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
## Overview ## Overview
...@@ -211,7 +209,9 @@ The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray p ...@@ -211,7 +209,9 @@ The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray p
- **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers - **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
- **Probes**: All probes (liveness, readiness, startup) are automatically removed - **Probes**: All probes (liveness, readiness, startup) are automatically removed
> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend. <Note>
vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
</Note>
**2. Data Parallel Mode (Multiple model instances across nodes)** **2. Data Parallel Mode (Multiple model instances across nodes)**
- **When used**: When `world_size × data_parallel_size > GPUs_per_node` - **When used**: When `world_size × data_parallel_size > GPUs_per_node`
...@@ -306,8 +306,8 @@ To enable compilation cache, add a volume mount with `useAsCompilationCache: tru ...@@ -306,8 +306,8 @@ To enable compilation cache, add a volume mount with `useAsCompilationCache: tru
For additional support and examples, see the working multinode configurations in: For additional support and examples, see the working multinode configurations in:
- **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/) - **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)
- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/) - **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)
- **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/) - **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration. These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Working with Dynamo Kubernetes Operator
## Overview ## Overview
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling. Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
...@@ -91,6 +89,7 @@ helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \ ...@@ -91,6 +89,7 @@ helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \
--create-namespace \ --create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true \ --set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta --set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta
```
**Observability:** **Observability:**
...@@ -114,11 +113,11 @@ Dynamo provides the following Custom Resources: ...@@ -114,11 +113,11 @@ Dynamo provides the following Custom Resources:
For the complete technical API reference for Dynamo Custom Resource Definitions, see: For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](api-reference.md)** **📖 [Dynamo CRD API Reference](./api-reference.md)**
For a user-focused guide on deploying and managing models with DynamoModel, see: For a user-focused guide on deploying and managing models with DynamoModel, see:
**📖 [Managing Models with DynamoModel Guide](deployment/dynamomodel-guide.md)** **📖 [Managing Models with DynamoModel Guide](./deployment/dynamomodel-guide.md)**
## Webhooks ## Webhooks
...@@ -133,7 +132,7 @@ The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validat ...@@ -133,7 +132,7 @@ The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validat
For complete documentation on webhooks, certificate management, and troubleshooting, see: For complete documentation on webhooks, certificate management, and troubleshooting, see:
**[Webhooks Guide](webhooks.md)** **📖 [Webhooks Guide](./webhooks.md)**
## Observability ## Observability
...@@ -159,7 +158,7 @@ A pre-built Grafana dashboard is available for visualizing operator metrics. The ...@@ -159,7 +158,7 @@ A pre-built Grafana dashboard is available for visualizing operator metrics. The
For complete setup instructions and metrics reference, see: For complete setup instructions and metrics reference, see:
**[Operator Metrics Guide](observability/operator-metrics.md)** **📖 [Operator Metrics Guide](./observability/operator-metrics.md)**
## Installation ## Installation
...@@ -200,10 +199,12 @@ helm install dynamo-platform ./platform/ \ ...@@ -200,10 +199,12 @@ helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \ --namespace ${NAMESPACE} \
--create-namespace \ --create-namespace \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \ --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set etcd.enabled=false \
--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
``` ```
For detailed installation options, see the [Installation Guide](installation-guide.md) For detailed installation options, see the [Installation Guide](./installation-guide.md)
## Development ## Development
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment