# vLLM Deployment Examples This directory contains examples for deploying vLLM models in both aggregated and disaggregated configurations. ## Use the Latest Release We recommend using the latest stable release of dynamo to avoid breaking changes: [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: ```bash git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ``` ## Prerequisites 1. Run Dynamo vLLM V1 docker: ```bash ./container/build.sh --framework VLLM_V1 --target dev ./container/run.sh --framework VLLM_V1 --target dev -it ``` Or install vLLM manually: ```bash export VLLM_REF=3c545c0c3b98ee642373a308197d750d0e449403 git clone https://github.com/vllm-project/vllm.git cd vllm git checkout $VLLM_REF VLLM_USE_PRECOMPILED=1 uv pip install -e . ``` If you are in the default vllm container remember to uninstall the old vllm using 'uv pip uninstall ai-dynamo-vllm' 2. Start required services: ```bash docker compose -f deploy/metrics/docker-compose.yml up -d ``` ## Running the Server ### Aggregated Deployment ```bash cd examples/vllm_v1 dynamo serve graphs.agg:Frontend -f configs/agg.yaml ``` ### Disaggregated Deployment ```bash cd examples/vllm_v1 dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml ``` ## Testing the API Send a test request using curl: ```bash curl localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "prompt": "In the heart of Eldoria...", "stream": false, "max_tokens": 30 }' ``` For more detailed explenations, refer to the main [LLM examples README](../llm/README.md). ## Deepseek R1 To run DSR1 model please first follow the Ray setup from the [multinode documentation](../../docs/examples/multinode.md). ### Aggregated Deployment ```bash cd examples/vllm_v1 dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/agg.yaml ``` ### Disaggregated Deployment To create frontend with a single decode worker: ```bash cd examples/vllm_v1 dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/disagg.yaml ``` To create a single decode worker: ```bash cd examples/vllm_v1 dynamo serve components.worker:VllmDecodeWorker -f configs/deepseek_r1/disagg.yaml ``` To create a single prefill worker: ```bash cd examples/vllm_v1 dynamo serve components.worker:VllmPrefillWorker -f configs/deepseek_r1/disagg.yaml ``` ### Data Parallelism Deployment Additional configuration steps will be required for enabling DP for DSR1 model, as it typically requires setting up the DP groups across nodes. `configs/deepseek_r1/agg_dp.yaml` and `configs/deepseek_r1/disagg_dp.yaml` will be the replacement for aggregated deployment and disaggregated deployment. The below demonstration will use deployment of a single worker as an example, the reader should apply the same to any `dynamo serve` command that will create a worker. To create a single decode worker, take note of the IP address, referred as below, of the node, and: ```bash cd examples/vllm_v1 dynamo serve components.worker:VllmDecodeWorker -f configs/deepseek_r1/disagg_dp.yaml --VllmDecodeWorker.data_parallel_address= ``` The above command will create 1 of the 2 DP groups and the worker will be consdiered the head of the DP groups. Next we need to create a `VllmDpWorker` to create the rest of the DP groups, one for each group. ```bash cd examples/vllm_v1 # 'data_parallel_start_rank' == `dp_group_index * data_parallel_size_local` dynamo serve components.worker:VllmDpWorker -f configs/deepseek_r1/disagg_dp.yaml --VllmDpWorker.data_parallel_address= --VllmDpWorker.data_parallel_start_rank=8 # repeat above until all DP groups are created ``` ### Wide EP If running oustide of Dynamo vLLM V1 container please follow [vLLM guide](https://github.com/vllm-project/vllm/tree/main/tools/ep_kernels) to install EP kernels and install [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM). To run DSR1 with DEP16 (EP16 MoE and DP16 for other layers) with [DeepEP kernels](https://github.com/deepseek-ai/DeepEP) run on head node: ``` export VLLM_ALL2ALL_BACKEND="deepep_low_latency" # or "deepep_high_throughput" export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth3 # or another non IB interface that you can find with `ifconfig -a` cd examples/vllm_v1 dynamo serve components.worker:VllmDecodeWorker -f configs/deepseek_r1/disagg_dp.yaml --VllmDecodeWorker.data_parallel_address= --VllmDecodeWorker.enable_expert_parallel=true ``` on 2nd node: ``` export VLLM_ALL2ALL_BACKEND="deepep_low_latency" # or "deepep_high_throughput" export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth3 # or another non IB interface that you can find with `ifconfig -a` cd examples/vllm_v1 # 'data_parallel_start_rank' == `dp_group_index * data_parallel_size_local` dynamo serve components.worker:VllmDpWorker -f configs/deepseek_r1/disagg_dp.yaml --VllmDpWorker.data_parallel_address= --VllmDpWorker.data_parallel_start_rank=8 --VllmDpWorker.enable_expert_parallel=true ``` ## Testing Send a test request using curl: ```bash curl localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "prompt": "In the heart of Eldoria...", "stream": false, "max_tokens": 30 }' ```