Unverified Commit f3e3d94a authored by Alec's avatar Alec Committed by GitHub
Browse files

refactor: vLLM to new Python UX (#1983)


Co-authored-by: default avatarGraham King <grahamk@nvidia.com>
parent 9f2356cb
<!-- <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--> -->
# LLM Deployment Examples using vLLM # LLM Deployment using vLLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Deployment Architectures ## Deployment Architectures
...@@ -36,11 +24,11 @@ docker compose -f deploy/metrics/docker-compose.yml up -d ...@@ -36,11 +24,11 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
### Build and Run docker ### Build and Run docker
```bash ```bash
./container/build.sh ./container/build.sh --framework VLLM
``` ```
```bash ```bash
./container/run.sh -it [--mount-workspace] ./container/run.sh -it --framework VLLM [--mount-workspace]
``` ```
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks. This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
...@@ -74,7 +62,7 @@ Note: The above architecture illustrates all the components. The final component ...@@ -74,7 +62,7 @@ Note: The above architecture illustrates all the components. The final component
```bash ```bash
# requires one gpu # requires one gpu
cd examples/vllm cd components/backends/vllm
bash launch/agg.sh bash launch/agg.sh
``` ```
...@@ -82,7 +70,7 @@ bash launch/agg.sh ...@@ -82,7 +70,7 @@ bash launch/agg.sh
```bash ```bash
# requires two gpus # requires two gpus
cd examples/vllm cd components/backends/vllm
bash launch/agg_router.sh bash launch/agg_router.sh
``` ```
...@@ -90,7 +78,7 @@ bash launch/agg_router.sh ...@@ -90,7 +78,7 @@ bash launch/agg_router.sh
```bash ```bash
# requires two gpus # requires two gpus
cd examples/vllm cd components/backends/vllm
bash launch/disagg.sh bash launch/disagg.sh
``` ```
...@@ -98,7 +86,7 @@ bash launch/disagg.sh ...@@ -98,7 +86,7 @@ bash launch/disagg.sh
```bash ```bash
# requires three gpus # requires three gpus
cd examples/vllm cd components/backends/vllm
bash launch/disagg_router.sh bash launch/disagg_router.sh
``` ```
...@@ -108,7 +96,7 @@ This example is not meant to be performant but showcases dynamo routing to data ...@@ -108,7 +96,7 @@ This example is not meant to be performant but showcases dynamo routing to data
```bash ```bash
# requires four gpus # requires four gpus
cd examples/vllm cd components/backends/vllm
bash launch/dep.sh bash launch/dep.sh
``` ```
...@@ -146,7 +134,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director ...@@ -146,7 +134,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
Example with disagg: Example with disagg:
```bash ```bash
cd ~/dynamo/examples/vllm/deploy cd ~/dynamo/components/backends/vllm/deploy
kubectl apply -f disagg.yaml kubectl apply -f disagg.yaml
``` ```
......
<!-- <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--> -->
# Running Deepseek R1 with Wide EP # Running Deepseek R1 with Wide EP
...@@ -51,4 +39,4 @@ curl localhost:8080/v1/chat/completions \ ...@@ -51,4 +39,4 @@ curl localhost:8080/v1/chat/completions \
"stream": false, "stream": false,
"max_tokens": 30 "max_tokens": 30
}' }'
``` ```
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1 apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment kind: DynamoGraphDeployment
metadata: metadata:
...@@ -50,7 +39,7 @@ spec: ...@@ -50,7 +39,7 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- dynamo - dynamo
- run - run
...@@ -94,6 +83,6 @@ spec: ...@@ -94,6 +83,6 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log" - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1 apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment kind: DynamoGraphDeployment
metadata: metadata:
...@@ -50,7 +39,7 @@ spec: ...@@ -50,7 +39,7 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- dynamo - dynamo
- run - run
...@@ -96,6 +85,6 @@ spec: ...@@ -96,6 +85,6 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log" - "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1 apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment kind: DynamoGraphDeployment
metadata: metadata:
...@@ -50,7 +39,7 @@ spec: ...@@ -50,7 +39,7 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- dynamo - dynamo
- run - run
...@@ -94,7 +83,7 @@ spec: ...@@ -94,7 +83,7 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log" - "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
VllmPrefillWorker: VllmPrefillWorker:
...@@ -133,6 +122,6 @@ spec: ...@@ -133,6 +122,6 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log" - "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1 apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment kind: DynamoGraphDeployment
metadata: metadata:
...@@ -50,7 +39,7 @@ spec: ...@@ -50,7 +39,7 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- dynamo - dynamo
- run - run
...@@ -94,7 +83,7 @@ spec: ...@@ -94,7 +83,7 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log" - "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
VllmPrefillWorker: VllmPrefillWorker:
...@@ -133,6 +122,6 @@ spec: ...@@ -133,6 +122,6 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log" - "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1 apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment kind: DynamoGraphDeployment
metadata: metadata:
...@@ -50,16 +39,9 @@ spec: ...@@ -50,16 +39,9 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- dynamo - "python3 -m dynamo.frontend --http-port 8080 --router-mode kv"
- run
- in=http
- out=dyn
- --http-port
- "8000"
- --router-mode
- kv
VllmDecodeWorker: VllmDecodeWorker:
dynamoNamespace: vllm-v1-disagg-router dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
...@@ -96,9 +78,9 @@ spec: ...@@ -96,9 +78,9 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log" - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
VllmPrefillWorker: VllmPrefillWorker:
dynamoNamespace: vllm-v1-disagg-router dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
...@@ -135,6 +117,6 @@ spec: ...@@ -135,6 +117,6 @@ spec:
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4 image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
workingDir: /workspace/examples/vllm workingDir: /workspace/components/backends/vllm
args: args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log" - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"
...@@ -5,7 +5,7 @@ set -e ...@@ -5,7 +5,7 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
# run ingress # run ingress
dynamo run in=http out=dyn & python -m dynamo.frontend &
# run worker # run worker
python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --no-enable-prefix-caching python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --no-enable-prefix-caching
...@@ -5,9 +5,9 @@ set -e ...@@ -5,9 +5,9 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
# run ingress # run ingress
dynamo run in=http out=dyn --router-mode kv & python -m dynamo.frontend --router-mode kv &
# run workers # run workers
CUDA_VISIBLE_DEVICES=0 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager & CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=1 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager
...@@ -5,13 +5,13 @@ set -e ...@@ -5,13 +5,13 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
# run ingress # run ingress
dynamo run in=http out=dyn --router-mode kv & python -m dynamo.frontend --router-mode kv &
# Data Parallel Attention / Expert Parallelism # Data Parallel Attention / Expert Parallelism
# Routing to DP workers managed by Dynamo # Routing to DP workers managed by Dynamo
# Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example) # Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example)
for i in {0..3}; do for i in {0..3}; do
CUDA_VISIBLE_DEVICES=$i python3 components/main.py \ CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \
--model Qwen/Qwen3-30B-A3B \ --model Qwen/Qwen3-30B-A3B \
--data-parallel-rank $i \ --data-parallel-rank $i \
--data-parallel-size 4 \ --data-parallel-size 4 \
......
...@@ -5,11 +5,11 @@ set -e ...@@ -5,11 +5,11 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
# run ingress # run ingress
dynamo run in=http out=dyn & python -m dynamo.frontend --router-mode kv &
CUDA_VISIBLE_DEVICES=0 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager & CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=1 python3 components/main.py \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--enforce-eager \ --enforce-eager \
--is-prefill-worker --is-prefill-worker
...@@ -6,14 +6,14 @@ set -e ...@@ -6,14 +6,14 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
# run ingress # run ingress
dynamo run in=http out=dyn --router-mode kv & python -m dynamo.frontend --router-mode kv &
# routing will happen between the two decode workers # routing will happen between the two decode workers
CUDA_VISIBLE_DEVICES=0 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager & CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=1 python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager & CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=2 python3 components/main.py \ CUDA_VISIBLE_DEVICES=2 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--enforce-eager \ --enforce-eager \
--is-prefill-worker --is-prefill-worker
...@@ -76,7 +76,7 @@ trap 'echo Cleaning up...; kill 0' EXIT ...@@ -76,7 +76,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
# run ingress if it's node 0 # run ingress if it's node 0
if [ $NODE_RANK -eq 0 ]; then if [ $NODE_RANK -eq 0 ]; then
DYN_LOG=debug dynamo-run in=http out=dyn --router-mode kv 2>&1 | tee $LOG_DIR/dsr1_dep_ingress.log & DYN_LOG=debug python -m dynamo.frontend --router-mode kv 2>&1 | tee $LOG_DIR/dsr1_dep_ingress.log &
fi fi
mkdir -p $LOG_DIR mkdir -p $LOG_DIR
...@@ -89,7 +89,7 @@ for ((i=0; i<GPUS_PER_NODE; i++)); do ...@@ -89,7 +89,7 @@ for ((i=0; i<GPUS_PER_NODE; i++)); do
VLLM_ALL2ALL_BACKEND="deepep_low_latency" \ VLLM_ALL2ALL_BACKEND="deepep_low_latency" \
VLLM_USE_DEEP_GEMM=1 \ VLLM_USE_DEEP_GEMM=1 \
VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \ VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \
python3 components/main.py \ python3 -m dynamo.vllm \
--model deepseek-ai/DeepSeek-R1 \ --model deepseek-ai/DeepSeek-R1 \
--data_parallel_size $DATA_PARALLEL_SIZE \ --data_parallel_size $DATA_PARALLEL_SIZE \
--data-parallel-rank $dp_rank \ --data-parallel-rank $dp_rank \
......
<!-- <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--> -->
# Multi-node Examples # Multi-node Examples
...@@ -63,10 +51,10 @@ Deploy vLLM workers across multiple nodes for horizontal scaling: ...@@ -63,10 +51,10 @@ Deploy vLLM workers across multiple nodes for horizontal scaling:
**Node 1 (Head Node)**: Run ingress and first worker **Node 1 (Head Node)**: Run ingress and first worker
```bash ```bash
# Start ingress # Start ingress
dynamo run in=http out=dyn python -m dynamo.frontend --router-mode kv
# Start vLLM worker # Start vLLM worker
python3 components/main.py \ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \ --model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager --enforce-eager
...@@ -75,7 +63,7 @@ python3 components/main.py \ ...@@ -75,7 +63,7 @@ python3 components/main.py \
**Node 2**: Run additional worker **Node 2**: Run additional worker
```bash ```bash
# Start vLLM worker # Start vLLM worker
python3 components/main.py \ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \ --model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager --enforce-eager
...@@ -88,10 +76,10 @@ Deploy prefill and decode workers on separate nodes for optimized resource utili ...@@ -88,10 +76,10 @@ Deploy prefill and decode workers on separate nodes for optimized resource utili
**Node 1**: Run ingress and prefill workers **Node 1**: Run ingress and prefill workers
```bash ```bash
# Start ingress # Start ingress
dynamo run in=http out=dyn & python -m dynamo.frontend --router-mode kv &
# Start prefill worker # Start prefill worker
python3 components/main.py \ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct --model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager --enforce-eager
...@@ -100,7 +88,7 @@ python3 components/main.py \ ...@@ -100,7 +88,7 @@ python3 components/main.py \
**Node 2**: Run decode workers **Node 2**: Run decode workers
```bash ```bash
# Start decode worker # Start decode worker
python3 components/main.py \ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct --model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager \ --enforce-eager \
...@@ -117,6 +105,6 @@ For models requiring more GPUs than available on a single node such as tensor-pa ...@@ -117,6 +105,6 @@ For models requiring more GPUs than available on a single node such as tensor-pa
**Node 1**: First part of tensor-parallel model **Node 1**: First part of tensor-parallel model
```bash ```bash
# Start ingress # Start ingress
dynamo run in=http out=dyn & python -m dynamo.frontend --router-mode kv &
``` ```
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
uvloop
vllm==0.9.2
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from dynamo.vllm.main import main
if __name__ == "__main__":
main()
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import asyncio import asyncio
import json import json
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import asyncio import asyncio
import logging import logging
...@@ -21,12 +9,13 @@ from copy import deepcopy ...@@ -21,12 +9,13 @@ from copy import deepcopy
from typing import AsyncGenerator from typing import AsyncGenerator
import msgspec import msgspec
from protocol import MyRequestOutput
from vllm.inputs import TokensPrompt from vllm.inputs import TokensPrompt
from vllm.sampling_params import SamplingParams from vllm.sampling_params import SamplingParams
from dynamo.runtime.logging import configure_dynamo_logging from dynamo.runtime.logging import configure_dynamo_logging
from .protocol import MyRequestOutput
configure_dynamo_logging() configure_dynamo_logging()
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import asyncio import asyncio
import logging import logging
...@@ -19,9 +7,6 @@ import os ...@@ -19,9 +7,6 @@ import os
import signal import signal
import uvloop import uvloop
from args import Config, configure_ports_with_etcd, overwrite_args, parse_args
from handlers import DecodeWorkerHandler, PrefillWorkerHandler
from publisher import StatLoggerFactory
from vllm.distributed.kv_events import ZmqEventPublisher from vllm.distributed.kv_events import ZmqEventPublisher
from vllm.usage.usage_lib import UsageContext from vllm.usage.usage_lib import UsageContext
from vllm.v1.engine.async_llm import AsyncLLM from vllm.v1.engine.async_llm import AsyncLLM
...@@ -35,6 +20,10 @@ from dynamo.llm import ( ...@@ -35,6 +20,10 @@ from dynamo.llm import (
from dynamo.runtime import DistributedRuntime, dynamo_worker from dynamo.runtime import DistributedRuntime, dynamo_worker
from dynamo.runtime.logging import configure_dynamo_logging from dynamo.runtime.logging import configure_dynamo_logging
from .args import Config, configure_ports_with_etcd, overwrite_args, parse_args
from .handlers import DecodeWorkerHandler, PrefillWorkerHandler
from .publisher import StatLoggerFactory
configure_dynamo_logging() configure_dynamo_logging()
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
...@@ -211,6 +200,9 @@ async def init(runtime: DistributedRuntime, config: Config): ...@@ -211,6 +200,9 @@ async def init(runtime: DistributedRuntime, config: Config):
handler.cleanup() handler.cleanup()
def main():
uvloop.run(worker())
if __name__ == "__main__": if __name__ == "__main__":
uvloop.install() main()
asyncio.run(worker())
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment