@@ -115,7 +115,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
#### Prerequisites
-**Dynamo Cloud**: Follow the [Quickstart Guide](../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
-**Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
-**Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
If you are a **🧑💻 Dynamo Contributor** you would have to rebuild the dynamo platform images as the code evolves. To do so please look at the [Cloud Guide](../../../docs/guides/dynamo_deploy/dynamo_cloud.md).
The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
## Step 1: Choose Your Architecture Pattern
Select the architecture pattern as your template that best fits your use case.
For example, when using the `VLLM` inference backend:
-**Development / Testing**
Use [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml) as the base configuration.
-**Production with Load Balancing**
Use [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
-**High Performance / Disaggregated Deployment**
Use [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
## Step 2: Customize the Template
You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
It serves the following roles:
1. OpenAI-Compatible HTTP Server
* Provides `/v1/chat/completions` endpoint
* Handles HTTP request/response formatting
* Supports streaming responses
* Validates incoming requests
2. Service Discovery and Routing
* Auto-discovers backend workers via etcd
* Routes requests to the appropriate Processor/Worker components
* Handles load balancing between multiple workers
3. Request Preprocessing
* Initial request validation
* Model name verification
* Request format standardization
You should then pick a worker and specialize the config. For example,
```yaml
VllmWorker:# vLLM-specific config
enforce-eager:true
enable-prefix-caching:true
SglangWorker:# SGLang-specific config
router-mode:kv
disagg-mode:true
TrtllmWorker:# TensorRT-LLM-specific config
engine-config:./engine.yaml
kv-cache-transfer:ucx
```
Here's a template structure based on the examples: