Unverified Commit 8c665c1b authored by kYLe's avatar kYLe Committed by GitHub
Browse files

docs: Add AWS ECS deployment example for Dynamo vLLM (#2415)


Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent fed53137
# Dynamo Deployment of vLLM Example on AWS ECS
## 1. ECS Cluster Setup
1. Go to AWS ECS console, **Clusters** tab and click on **Create cluster** with name `dynamo-GPU`
2. Input the cluster name and choose **AWS EC2 instances** as the infrastructure. This option will create a cluster with EC2 instances to deploy containers.
3. Choose the ECS-optimized GPU AMI `Amazon Linux 2 (GPU)` (Amazon ECS–optimized), which includes NVIDIA drivers and the Docker GPU runtime out of the box.
4. Choose `g6e.2xlarge` as the **EC2 instance type** and add an `SSH Key pair` so you can log in the instance for debugging purpose.
5. Set **Root EBS volume size** as `200`
6. For the networking, use the default settings. Make sure the **security group** has
- an inbound rule which allows "All traffic" from this security group.
- an inbound rule for port 22 and 8000, so that you can ssh into the instance for debugging purpose
7. Select `Turn on` for **Auto-assign public IP** option.
8. Click on **Create** and a cluster will be deployed through cloudformation.
## 2. ETCD/NATS Task Definitions Setup
Add a task for ETCD and NATS services. A sample task definition JSON is attached.
1. ETCD container
- Container name use `etcd`
- Image URL is `bitnami/etcd` and **Yes** for Essential container
- Container port
|Container port|Protocol|Port name| App protocol|
|-|-|-|-|
|2379|TCP|2379|HTTP|
|2380|TCP|2380|HTTP|
- Environment variable key is `ALLOW_NONE_AUTHENTICATION` and value is `YES`
2. NATS container
- Container name use `nats`
- Image URL is `nats` and **Yes** for Essential container
- Container port
|Container port|Protocol|Port name| App protocol|
|-|-|-|-|
|4222|TCP|4222|HTTP|
|6222|TCP|6222|HTTP|
|8222|TCP|8222|HTTP|
- Docker configuration, add `-js, --trace` in **Command**
## 3. vLLM Task Definitions Setup
1. Dynamo vLLM Frontend Task
This task will create vLLM frontend, processors, routers and a decode worker.
Please follow steps below to create this task
- Set container name as `dynamo-frontend` and use prebuild [Dynamo container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime).
- Choose `Amazon EC2 instances` as the **Launch type** with **Task size** `2 vCPU` and `40 GB`memory
- Choose `host` as the Network mode.
- Container name use `dynamo-vLLM-frontend`
- Add your Image URL (You can use the prebuild [Dynamo container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)) and **Yes** for Essential container. It can be AWS ECR URL or Nvidia NGC URL. If using NGC URL, please also choose **Private registry authentication** and add your Secret Manager ARN or name.
- Container port
|Container port|Protocol|Port name| App protocol|
|-|-|-|-|
|8000|TCP|8000|HTTP|
- Use `1` GPU for **Resource allocation limits**
- Environment variables settings as below. Will override the `IP_ADDRESS` later.
|Key|Value type|Value|
|-|-|-|
|ETCD_ENDPOINTS|Value|http://IP_ADDRESS:2379|
|NATS_SERVER|Value|nats://IP_ADDRESS:4222|
- Docker configuration
Add `sh,-c` in **Entry point** and `cd components/backends/vllm && python -m dynamo.frontend --router-mode kv & python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager` in **Command**
2. Dynamo vLLM PrefillWorker Task
Create the PrefillWorker task same as the frontend worker, except for following changes
- Set container name as `dynamo-prefill`
- No container port mapping
- Docker configuration with command `cd components/backends/vllm && python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker`
## 4. Task Deployment
You can create a service or directly run the task from the task definition
1. ETCD/NATS Task
- Choose the Fargate cluster for **Existing cluster** created in the hello world example.
- Wait for this deployment to finish, and get the **Private IP** of this task.
2. Dynamo Frontend Task
- Choose the EC2 cluster for **Existing cluster** created in step 1.
- In the **Container Overrides**, use the IP for ETCD/NATS task for the `ETCD_ENDPOINTS` and `NATS_SERVER` values.
3. Dynamo PrefillWorker Task
- Choose the EC2 cluster for **Existing cluster** created in step 1.
- In the **Container Overrides**, use the IP for ETCD/NATS task for the `ETCD_ENDPOINTS` and `NATS_SERVER` values.
## 5. Testing
Find the public IP of the dynamo frontend task from the task page. Run following commands to query the endpoint.
```sh
export DYNAMO_IP_ADDRESS=TASK_PUBLIC_IP_ADDRESS
curl http://$DYNAMO_IP_ADDRESS:8000/v1/models
curl http://$DYNAMO_IP_ADDRESS:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30
}'
```
You should be able to see the responses from the hosted endpoint.
\ No newline at end of file
{
"family": "Dynamo-tasks",
"containerDefinitions": [
{
"name": "etcd",
"image": "bitnami/etcd",
"cpu": 0,
"portMappings": [
{
"name": "2379",
"containerPort": 2379,
"hostPort": 2379,
"protocol": "tcp",
"appProtocol": "http"
},
{
"name": "2380",
"containerPort": 2380,
"hostPort": 2380,
"protocol": "tcp",
"appProtocol": "http"
}
],
"essential": true,
"environment": [
{
"name": "ALLOW_NONE_AUTHENTICATION",
"value": "YES"
}
],
"environmentFiles": [],
"mountPoints": [],
"volumesFrom": [],
"ulimits": [],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/Dynamo-tasks",
"mode": "non-blocking",
"awslogs-create-group": "true",
"max-buffer-size": "25m",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "ecs"
},
"secretOptions": []
},
"systemControls": []
},
{
"name": "nats",
"image": "nats",
"cpu": 0,
"portMappings": [
{
"name": "4222",
"containerPort": 4222,
"hostPort": 4222,
"protocol": "tcp"
},
{
"name": "6222",
"containerPort": 6222,
"hostPort": 6222,
"protocol": "tcp"
},
{
"name": "8222",
"containerPort": 8222,
"hostPort": 8222,
"protocol": "tcp"
}
],
"essential": true,
"command": [
"-js",
"--trace"
],
"environment": [],
"environmentFiles": [],
"mountPoints": [],
"volumesFrom": [],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/Dynamo-tasks",
"mode": "non-blocking",
"awslogs-create-group": "true",
"max-buffer-size": "25m",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "ecs"
},
"secretOptions": []
},
"systemControls": []
}
],
"taskRoleArn": "arn:aws:iam::AWS_ID:role/ecsTaskExecutionRole",
"executionRoleArn": "arn:aws:iam::AWS_ID:role/ecsTaskExecutionRole",
"networkMode": "awsvpc",
"volumes": [],
"placementConstraints": [],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "1024",
"memory": "3072",
"runtimePlatform": {
"cpuArchitecture": "X86_64",
"operatingSystemFamily": "LINUX"
},
"enableFaultInjection": false
}
\ No newline at end of file
{
"family": "Dynamo-frontend",
"containerDefinitions": [
{
"name": "dynamo-vllm-frontend",
"image": "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0",
"repositoryCredentials": {
"credentialsParameter": "arn:aws:secretsmanager:us-east-2:AWS_ID:secret:ngc_nvcr_access"
},
"cpu": 0,
"portMappings": [
{
"name": "8000",
"containerPort": 8000,
"hostPort": 8000,
"protocol": "tcp",
"appProtocol": "http"
}
],
"essential": true,
"entryPoint": [
"sh",
"-c"
],
"command": [
"cd components/backends/vllm && python -m dynamo.frontend --router-mode kv & python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager"
],
"environment": [
{
"name": "ETCD_ENDPOINTS",
"value": "http://IP_ADDRESS:2379"
},
{
"name": "NATS_SERVER",
"value": "nats://IP_ADDRESS:4222"
}
],
"environmentFiles": [],
"mountPoints": [],
"volumesFrom": [],
"ulimits": [],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/Dynamo-frontend",
"mode": "non-blocking",
"awslogs-create-group": "true",
"max-buffer-size": "25m",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "ecs"
},
"secretOptions": []
},
"systemControls": [],
"resourceRequirements": [
{
"value": "1",
"type": "GPU"
}
]
}
],
"taskRoleArn": "arn:aws:iam::AWS_ID:role/ecsTaskExecutionRole",
"executionRoleArn": "arn:aws:iam::AWS_ID:role/ecsTaskExecutionRole",
"networkMode": "host",
"volumes": [],
"placementConstraints": [],
"requiresCompatibilities": [
"EC2"
],
"cpu": "2048",
"memory": "40960",
"runtimePlatform": {
"cpuArchitecture": "X86_64",
"operatingSystemFamily": "LINUX"
},
"enableFaultInjection": false
}
{
"family": "Dynamo-backend",
"containerDefinitions": [
{
"name": "dynamo-prefill",
"image": "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0",
"repositoryCredentials": {
"credentialsParameter": "arn:aws:secretsmanager:us-east-2:AWS_ID:secret:ngc_access"
},
"cpu": 0,
"portMappings": [],
"essential": true,
"entryPoint": [
"sh",
"-c"
],
"command": [
"cd components/backends/vllm && python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker"
],
"environment": [
{
"name": "ETCD_ENDPOINTS",
"value": "http://IP_ADDRESS:2379"
},
{
"name": "NATS_SERVER",
"value": "nats://IP_ADDRESS:4222"
}
],
"environmentFiles": [],
"mountPoints": [],
"volumesFrom": [],
"ulimits": [],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/Dynamo-backend",
"mode": "non-blocking",
"awslogs-create-group": "true",
"max-buffer-size": "25m",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "ecs"
},
"secretOptions": []
},
"systemControls": [],
"resourceRequirements": [
{
"value": "1",
"type": "GPU"
}
]
}
],
"taskRoleArn": "arn:aws:iam::AWS_ID:role/ecsTaskExecutionRole",
"executionRoleArn": "arn:aws:iam::AWS_ID:role/ecsTaskExecutionRole",
"networkMode": "bridge",
"volumes": [],
"placementConstraints": [],
"requiresCompatibilities": [
"EC2"
],
"cpu": "2048",
"memory": "40960",
"runtimePlatform": {
"cpuArchitecture": "X86_64",
"operatingSystemFamily": "LINUX"
},
"enableFaultInjection": false
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment