docs: Create a guide for writing dynamo deployments CR (#1999)

30942780 · atchernych · GitHub · f0e382ad · 30942780 · 30942780
Unverified Commit 30942780 authored Jul 24, 2025 by atchernych Committed by GitHub Jul 24, 2025
5 changed files
--- a/components/backends/vllm/README.md
+++ b/components/backends/vllm/README.md
@@ -115,7 +115,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director

 #### Prerequisites

- **Dynamo Cloud**: Follow the [Quickstart Guide](../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
+- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.

 - **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
  ```bash

--- a/deploy/cloud/helm/README.md
+++ b/deploy/cloud/helm/README.md
@@ -36,13 +36,14 @@ docker login <CONTAINER_REGISTRY>
 #### 🛠️ Build and push images for the Dynamo Cloud platform components

 [One-time Action]
-You should build the images for the Dynamo Cloud Platform.
+You should build the image(s) for the Dynamo Cloud Platform.
 If you are a **👤 Dynamo User** you would do this step once.

 ```bash
 export DOCKER_SERVER=<your-docker-server>
 export IMAGE_TAG=<TAG>
-earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
+cd deploy/cloud/operator
+earthly --push +docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
 ```

 If you are a **🧑‍💻 Dynamo Contributor** you would have to rebuild the dynamo platform images as the code evolves. To do so please look at the [Cloud Guide](../../../docs/guides/dynamo_deploy/dynamo_cloud.md).

--- a/docs/examples/README.md
+++ b/docs/examples/README.md
@@ -36,7 +36,7 @@ export NAMESPACE=<your-namespace> # the namespace you used to deploy Dynamo clou
 Deploying an example consists of the simple `kubectl apply -f ... -n ${NAMESPACE}` command. For example:

 ```bash
-kubectl apply -f  components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
+kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
 ```

 You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment.

--- a/docs/guides/dynamo_deploy/create_deployment.md
+++ b/docs/guides/dynamo_deploy/create_deployment.md
+# Creating Kubernetes Deployments
+
+The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
+The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a kubernetes deployment for your inference graph.
+
+
+This guide explains how to create your own deployment files.
+
+## Step 1: Choose Your Architecture Pattern
+
+Select the architecture pattern as your template that best fits your use case.
+
+For example, when using the `VLLM` inference backend:
+
+- **Development / Testing**
+  Use [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml) as the base configuration.
+
+- **Production with Load Balancing**
+  Use [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
+
+- **High Performance / Disaggregated Deployment**
+  Use [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
+
+
+## Step 2: Customize the Template
+
+You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
+The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
+
+It serves the following roles:
+1. OpenAI-Compatible HTTP Server
+  * Provides `/v1/chat/completions` endpoint
+  * Handles HTTP request/response formatting
+  * Supports streaming responses
+  * Validates incoming requests
+
+2. Service Discovery and Routing
+  * Auto-discovers backend workers via etcd
+  * Routes requests to the appropriate Processor/Worker components
+  * Handles load balancing between multiple workers
+
+3. Request Preprocessing
+  * Initial request validation
+  * Model name verification
+  * Request format standardization
+
+You should then pick a worker and specialize the config. For example,
+
+```yaml
+VllmWorker:         # vLLM-specific config
+  enforce-eager: true
+  enable-prefix-caching: true
+
+SglangWorker:       # SGLang-specific config
+  router-mode: kv
+  disagg-mode: true
+
+TrtllmWorker:       # TensorRT-LLM-specific config
+  engine-config: ./engine.yaml
+  kv-cache-transfer: ucx
+```
+
+Here's a template structure based on the examples:
+
+```yaml
+    YourWorker:
+      dynamoNamespace: your-namespace
+      componentType: worker
+      replicas: N
+      envFromSecret: your-secrets  # e.g., hf-token-secret
+      # Health checks for worker initialization
+      readinessProbe:
+        exec:
+          command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
+      resources:
+        requests:
+          gpu: "1"  # GPU allocation
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags
+```
+
+Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
+`extraPodSpec: -> mainContainer: -> args:`
+
+The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
+Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
+If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.
+
+
+## Step 3: Key Customization Points
+
+### Model Configuration
+
+```yaml
+   args:
+     - "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag"
+```
+
+### Resource Allocation
+
+```yaml
+   resources:
+     requests:
+       cpu: "N"
+       memory: "NGi"
+       gpu: "N"
+```
+
+### Scaling
+
+```yaml
+   replicas: N  # Number of worker instances
+```
+
+### Routing Mode
+```yaml
+   args:
+     - --router-mode
+     - kv  # Enable KV-cache routing
+```
+
+### Worker Specialization
+
+```yaml
+   args:
+     - --is-prefill-worker  # For disaggregated prefill workers
+```
\ No newline at end of file
--- a/docs/guides/dynamo_deploy/quickstart.md
+++ b/docs/guides/dynamo_deploy/quickstart.md
@@ -64,13 +64,10 @@ Use this approach when developing or customizing Dynamo as a contributor, or usi

 Ensure you have the source code checked out and are in the `dynamo` directory:

-```bash
-cd deploy/cloud/helm/
-```

 ### Set Environment Variables

-Our examples use the `nvcr.io` but you can setup your own values if you use another docker registry.
+Our examples use the [`nvcr.io`](nvcr.io/nvidia/ai-dynamo/) but you can setup your own values if you use another docker registry.

 ```bash
 export NAMESPACE=dynamo-cloud # or whatever you prefer.
@@ -98,15 +95,13 @@ docker login <your-registry>
 docker push <your-registry>/dynamo-base:latest-vllm
 ```

-[More on image building](../../../../README.md)
-
-
 ### Install Dynamo Cloud

 You need to build and push the Dynamo Cloud Operator Image by running

 ```bash
-earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
+cd deploy/cloud/operator
+earthly --push +docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
 ```

 The  Nvidia Cloud Operator image will be pulled from the `$DOCKER_SERVER/dynamo-operator:$IMAGE_TAG`.
@@ -196,4 +191,5 @@ kubectl create secret generic hf-token-secret \
  -n ${NAMESPACE}
 ```

-Follow the [Examples](../../examples/README.md)
\ No newline at end of file
+Follow the [Examples](../../examples/README.md)
+For more details on how to create your own deployments follow [Create Deployment Guide](create_deployment.md)