Unverified Commit 8bd37c96 authored by Anant Sharma's avatar Anant Sharma Committed by GitHub
Browse files

refactor: move backend deploy, launch and slurm files from components to examples (#3849)


Signed-off-by: default avatarAnant Sharma <anants@nvidia.com>
parent 78359046
......@@ -21,7 +21,7 @@ To enable it build the dynamo container with the `--tensorrtllm-commit` flag, fo
## How to use
```bash
cd $DYNAMO_HOME/components/backends/trtllm
cd $DYNAMO_HOME/examples/backends/trtllm
# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
......
......@@ -48,7 +48,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
described [here](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker).
described [here](https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container).
This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity
......@@ -87,7 +87,7 @@ following environment variables based:
```bash
# NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker
# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
export IMAGE="<dynamo_trtllm_image>"
# MOUNTS are the host:container path pairs that are mounted into the containers
......
......@@ -52,7 +52,7 @@ following environment variables based:
```bash
# NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker
# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
export IMAGE="<dynamo_trtllm_image>"
# MOUNTS are the host:container path pairs that are mounted into the containers
......
......@@ -43,7 +43,7 @@ For advanced configurations, LMCache supports multiple [storage backends](https:
Use the provided launch script for quick setup:
```bash
./components/backends/vllm/launch/agg_lmcache.sh
./examples/backends/vllm/launch/agg_lmcache.sh
```
This will:
......@@ -69,7 +69,7 @@ The same `ENABLE_LMCACHE=1` environment variable enables LMCache, but the system
Use the provided disaggregated launch script(the script requires at least 2 GPUs):
```bash
./components/backends/vllm/launch/disagg_lmcache.sh
./examples/backends/vllm/launch/disagg_lmcache.sh
```
This will:
......
......@@ -106,7 +106,7 @@ Note: The above architecture illustrates all the components. The final component
```bash
# requires one gpu
cd components/backends/vllm
cd examples/backends/vllm
bash launch/agg.sh
```
......@@ -114,7 +114,7 @@ bash launch/agg.sh
```bash
# requires two gpus
cd components/backends/vllm
cd examples/backends/vllm
bash launch/agg_router.sh
```
......@@ -122,7 +122,7 @@ bash launch/agg_router.sh
```bash
# requires two gpus
cd components/backends/vllm
cd examples/backends/vllm
bash launch/disagg.sh
```
......@@ -130,7 +130,7 @@ bash launch/disagg.sh
```bash
# requires three gpus
cd components/backends/vllm
cd examples/backends/vllm
bash launch/disagg_router.sh
```
......@@ -140,7 +140,7 @@ This example is not meant to be performant but showcases Dynamo routing to data
```bash
# requires four gpus
cd components/backends/vllm
cd examples/backends/vllm
bash launch/dep.sh
```
......@@ -153,7 +153,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../components/backends/vllm/deploy/README.md)
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)
## Configuration
......
......@@ -100,7 +100,7 @@ Follow these steps to benchmark Dynamo deployments using client-side benchmarkin
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Cloud platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Cloud, then use [deploy/utils/README](../../deploy/utils/README.md) to set up benchmarking resources.
### Step 2: Deploy DynamoGraphDeployments
Deploy your DynamoGraphDeployments separately using the [deployment documentation](../../components/backends/). Each deployment should have a frontend service exposed.
Deploy your DynamoGraphDeployments separately using the [deployment documentation](../../examples/backends/). Each deployment should have a frontend service exposed.
### Step 3: Port-Forward and Benchmark Deployment A
```bash
......@@ -332,7 +332,7 @@ The server-side benchmarking solution:
## Quick Start
### Step 1: Deploy Your DynamoGraphDeployment
Deploy your DynamoGraphDeployment using the [deployment documentation](../../components/backends/). Ensure it has a frontend service exposed.
Deploy your DynamoGraphDeployment using the [deployment documentation](../../examples/backends/). Ensure it has a frontend service exposed.
### Step 2: Deploy and Run Benchmark Job
......
......@@ -163,7 +163,7 @@ spec:
- gpu-h200-sxm # Adjust to your GPU node type
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/components/backends/vllm
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
......@@ -234,7 +234,7 @@ spec:
- gpu-h200-sxm # Adjust to your GPU node type
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/components/backends/vllm
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
......
......@@ -28,7 +28,7 @@ Dynamo's `DistributedRuntime` is the core infrastructure in the framework that e
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, a typical deployment configuration (like `components/backends/vllm/deploy/agg.yaml` or `components/backends/sglang/deploy/agg.yaml`) has multiple workers:
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers:
- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`.
- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization.
......@@ -75,6 +75,6 @@ After selecting which endpoint to hit, the `Client` sends the serialized request
We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
- Rust: `/lib/runtime/examples/`
- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `/components/backends` for full implementation details.
- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `components/src/dynamo` for full implementation details.
......@@ -17,7 +17,7 @@ limitations under the License.
# Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [components/backends/vllm](../../components/backends/vllm). Color-coded flows indicate different types of operations:
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations:
## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
......
......@@ -77,7 +77,7 @@ The `model_type` can be:
- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault_tolerance/request_migration.md). Defaults to 0.
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
See `components/backends` for full code examples.
See `examples/backends` for full code examples.
## Component names
......
......@@ -67,9 +67,9 @@ Each backend has deployment examples and configuration options:
| Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
| **[SGLang](../../components/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[TensorRT-LLM](../../components/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[vLLM](../../components/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[SGLang](../../examples/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[TensorRT-LLM](../../examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[vLLM](../../examples/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## 3. Deploy Your First Model
......@@ -84,7 +84,7 @@ kubectl create secret generic hf-token-secret \
-n ${NAMESPACE};
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
......
# Creating Kubernetes Deployments
The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
The scripts in the `examples/<backend>/launch` folder like [agg.sh](../../../examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
......@@ -25,7 +25,7 @@ Before choosing a template, understand the different architecture patterns:
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
**Example**: [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml)
**Example**: [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml)
### Aggregated + Router (agg_router.yaml)
......@@ -42,7 +42,7 @@ Before choosing a template, understand the different architecture patterns:
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated
**Example**: [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml)
**Example**: [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml)
### Disaggregated Serving (disagg_router.yaml)
......@@ -61,7 +61,7 @@ Before choosing a template, understand the different architecture patterns:
- More complex setup and debugging
- Requires understanding of prefill/decode separation
**Example**: [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml)
**Example**: [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml)
### Quick Selection Guide
......@@ -69,11 +69,11 @@ Select the architecture pattern as your template that best fits your use case.
For example, when using the `vLLM` backend:
- **Development / Testing**: Use [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Development / Testing**: Use [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
## Step 2: Customize the Template
......
......@@ -281,8 +281,8 @@ To enable compilation cache, add a volume mount with `useAsCompilationCache: tru
For additional support and examples, see the working multinode configurations in:
- **SGLang**: [components/backends/sglang/deploy/](../../../components/backends/sglang/deploy/)
- **TensorRT-LLM**: [components/backends/trtllm/deploy/](../../../components/backends/trtllm/deploy/)
- **vLLM**: [components/backends/vllm/deploy/](../../../components/backends/vllm/deploy/)
- **SGLang**: [examples/backends/sglang/deploy/](../../../examples/backends/sglang/deploy/)
- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](../../../examples/backends/trtllm/deploy/)
- **vLLM**: [examples/backends/vllm/deploy/](../../../examples/backends/vllm/deploy/)
These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
......@@ -243,7 +243,7 @@ kubectl get pods -n ${NAMESPACE}
1. **Deploy Model/Workflow**
```bash
# Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Port forward and test
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
......@@ -251,9 +251,9 @@ kubectl get pods -n ${NAMESPACE}
```
2. **Explore Backend Guides**
- [vLLM Deployments](../../components/backends/vllm/deploy/README.md)
- [SGLang Deployments](../../components/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](../../components/backends/trtllm/deploy/README.md)
- [vLLM Deployments](../../examples/backends/vllm/deploy/README.md)
- [SGLang Deployments](../../examples/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](../../examples/backends/trtllm/deploy/README.md)
3. **Optional:**
- [Set up Prometheus & Grafana](./observability/metrics.md)
......
......@@ -126,7 +126,7 @@ At this point, we should have everything in place to collect and view logs in ou
To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
```bash
kubectl apply -n $DYN_NAMESPACE -f components/backends/sglang/deploy/agg_logging.yaml
kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml
```
Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
......
......@@ -69,7 +69,7 @@ Let's start by deploying a simple vLLM aggregated deployment:
```bash
export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
pushd components/backends/vllm/deploy
pushd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n $NAMESPACE
popd
```
......
......@@ -39,7 +39,7 @@ docker compose -f deploy/docker-compose.yml up -d
### Aggregated Serving with KVBM
```bash
cd $DYNAMO_HOME/components/backends/vllm
cd $DYNAMO_HOME/examples/backends/vllm
./launch/agg_kvbm.sh
```
......@@ -47,12 +47,12 @@ cd $DYNAMO_HOME/components/backends/vllm
```bash
# 1P1D - one prefill worker and one decode worker
# NOTE: need at least 2 GPUs
cd $DYNAMO_HOME/components/backends/vllm
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm.sh
# 2P2D - two prefill workers and two decode workers
# NOTE: need at least 4 GPUs
cd $DYNAMO_HOME/components/backends/vllm
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm_2p2d.sh
```
......
......@@ -102,7 +102,7 @@ tokens/s/gpu tokens/s/user
```bash
# Use with Dynamo's SLA planner (20-30 seconds vs hours)
python3 -m benchmarks.profiler.profile_sla \
--config ./components/backends/trtllm/deploy/disagg.yaml \
--config ./examples/backends/trtllm/deploy/disagg.yaml \
--backend trtllm \
--use-ai-configurator \
--aic-system h200_sxm \
......
......@@ -245,7 +245,7 @@ For details on hardware configuration and GPU discovery options, see [Hardware C
#### Using Existing DGD Configs (Recommended for Custom Setups)
If you have an existing DynamoGraphDeployment config (e.g., from `components/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
**Step 1: Create ConfigMap from your DGD config file:**
......
......@@ -293,7 +293,7 @@ The default delay is 10ms, which produces approximately 100 tokens per second.
### Other engines, multi-node, production
`vllm`, `sglang` and `trtllm` production grade engines are available in `components/backends`. They run as Python components, using the Rust bindings. See the main README.
`vllm`, `sglang` and `trtllm` production grade engines are available in `examples/backends`. They run as Python components, using the Rust bindings. See the main README.
`dynamo-run` is an exploration, development and prototyping tool, as well as an example of using the Rust API. Multi-node and production setups should be using the main engine components.
......@@ -320,7 +320,7 @@ The output looks like this:
## Writing your own engine in Python
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. All of the main backend components in `components/backends/` work like this.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. All of the main backend components in `examples/backends/` work like this.
The Python file must do three things:
1. Decorate a function to get the runtime
......@@ -396,7 +396,7 @@ Here are some example engines:
- Chat:
* [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py)
More fully-featured Python engines are in `components/backends`.
More fully-featured Python engines are in `examples/backends`.
## Debugging
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment