Unverified Commit 7a353e61 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

feat: Add experimental WideEP + EPLB dis-aggregated example for TRTLLM (#1690)


Co-authored-by: default avatartanmayv25 <tanmay2592@gmail.com>
parent 47e7fde7
...@@ -25,11 +25,7 @@ max_num_tokens: 8192 ...@@ -25,11 +25,7 @@ max_num_tokens: 8192
max_seq_len: 8192 max_seq_len: 8192
kv_cache_config: kv_cache_config:
# With dp attention disabled: high free_gpu_memory_fraction is fine.
free_gpu_memory_fraction: 0.75 free_gpu_memory_fraction: 0.75
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 # NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed # NOTE: overlap_scheduler enabled by default since this commit and changed
......
...@@ -10,8 +10,13 @@ the set of nodes need to be launched together in the same MPI world, such as ...@@ -10,8 +10,13 @@ the set of nodes need to be launched together in the same MPI world, such as
via `mpirun` or `srun`. This is true regardless of whether the worker is via `mpirun` or `srun`. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only. aggregated, prefill-only, or decode-only.
In this document we will demonstrate an example of launching a multi-node TP16/EP16 In this document we will demonstrate two examples launching multinode workers
aggregated worker on a slurm cluster with `srun`. on a slurm cluster with `srun`:
1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
worker across 4 GB200 nodes
2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
worker (4 nodes) across a total of 8 GB200 nodes.
NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or `start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
...@@ -25,7 +30,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl ...@@ -25,7 +30,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
testing, you should aim to allocate groups of nodes that are performantly testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup. inter-connected, such as those in an NVL72 setup.
2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis) 2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `srun_script.sh` script in this SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
example will use `srun` arguments like `--container-image`, example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis. `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
...@@ -34,10 +39,14 @@ For simplicity of the example, we will make some assumptions about your slurm cl ...@@ -34,10 +39,14 @@ For simplicity of the example, we will make some assumptions about your slurm cl
described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker). described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
This is the image that can be set to the `IMAGE` environment variable in later steps. This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 4 nodes below as a reference command. This is technically not will allocate 8 nodes below as a reference command to have enough capacity
a requirement, but makes iterations of testing/experimenting easier when to run both examples. If you plan to only run the aggregated example, you
you have a reserved set of nodes for a period of time. Make sure to set your will only need 4 nodes. If you customize the configurations to require a
`PARTITION` and `ACCOUNT` according to your slurm cluster setup: different number of nodes, you can adjust the number of allocated nodes
accordingly. Pre-allocating nodes is technically not a requirement,
but it makes iterations of testing/experimenting easier.
Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
```bash ```bash
# Set partition manually based on your slurm cluster's partition names # Set partition manually based on your slurm cluster's partition names
PARTITION="" PARTITION=""
...@@ -48,20 +57,21 @@ For simplicity of the example, we will make some assumptions about your slurm cl ...@@ -48,20 +57,21 @@ For simplicity of the example, we will make some assumptions about your slurm cl
--account="${ACCOUNT}" \ --account="${ACCOUNT}" \
--job-name="${ACCOUNT}-dynamo.trtllm" \ --job-name="${ACCOUNT}-dynamo.trtllm" \
-t 05:00:00 \ -t 05:00:00 \
--nodes 4 --nodes 8
``` ```
5. Lastly, we will assume you are inside an interactive shell on one of your allocated 5. Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which should be the default behavior after executing the `salloc` command above. nodes, which may be the default behavior after executing the `salloc` command above
If not, then you should SSH into one of the allocated nodes. depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
## Launching Slurm Jobs ### Environment Variable Setup
This example aims to automate as much of the environment setup as possible, This example aims to automate as much of the environment setup as possible,
but all slurm clusters and environments are different, and you may need to but all slurm clusters and environments are different, and you may need to
dive into the scripts to make modifications based on your specific environment. dive into the scripts to make modifications based on your specific environment.
Assuming you have already allocated at least 4 nodes via `salloc`, and are Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes: inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
```bash ```bash
# NOTE: IMAGE must be set manually for now # NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here: # To build an iamge, see the steps here:
...@@ -77,7 +87,7 @@ export IMAGE="<dynamo_trtllm_image>" ...@@ -77,7 +87,7 @@ export IMAGE="<dynamo_trtllm_image>"
# #
# NOTE: Currently, this example assumes that the local bash scripts and configs # NOTE: Currently, this example assumes that the local bash scripts and configs
# referenced are mounted into into /mnt inside the container. If you want to # referenced are mounted into into /mnt inside the container. If you want to
# customize the location of the scripts, make sure to modify `srun_script.sh` # customize the location of the scripts, make sure to modify `srun_aggregated.sh`
# accordingly for the new locations of `start_frontend_services.sh` and # accordingly for the new locations of `start_frontend_services.sh` and
# `start_trtllm_worker.sh`. # `start_trtllm_worker.sh`.
# #
...@@ -105,28 +115,68 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4" ...@@ -105,28 +115,68 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
# By default this is inferred from MODEL_PATH, but when using locally downloaded # By default this is inferred from MODEL_PATH, but when using locally downloaded
# model weights, it can be nice to have explicit control over the name. # model weights, it can be nice to have explicit control over the name.
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
```
## Aggregated WideEP
Assuming you have at least 4 nodes allocated following the setup steps above,
follow these steps below to launch an **aggregated** deployment across 4 nodes:
# NOTE: This path assumes you have mounted the config file into /mnt inside ```bash
# the container. See the MOUNTS variable in srun_script.sh # Default set in srun_aggregated.sh, but can customize here.
export ENGINE_CONFIG="/mnt/agg_DEP16_dsr1.yaml" # export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml"
# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG # Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
# The produce of NUM_NODES*NUM_GPUS_PER_NODE should match the number of # The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
# total GPUs necessary to satisfy the requested parallelism. For example, # total GPUs necessary to satisfy the requested parallelism. For example,
# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16. # 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
export NUM_NODES=4 # export NUM_NODES=4
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
# export NUM_GPUS_PER_NODE=4
# Launches:
# - frontend + etcd/nats on current (head) node
# - one large aggregated trtllm worker across multiple nodes via MPI tasks
./srun_aggregated.sh
```
## Disaggregated WideEP
Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
following the setup above, follow these steps below to launch a **disaggregated**
deployment across 8 nodes:
> [!Tip]
> Make sure you have a fresh environment and don't still have the aggregated
> example above still deployed on the same set of nodes.
```bash
# Defaults set in srun_disaggregated.sh, but can customize here.
# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
# GPUs necessary to satisfy the requested parallelism in each config.
# export NUM_PREFILL_NODES=4
# export NUM_DECODE_NODES=4
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this. # GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
export NUM_GPUS_PER_NODE=4 # export NUM_GPUS_PER_NODE=4
# Launches frontend + etcd/nats on current (head) node. # Launches:
# Launches one large trtllm worker across multiple nodes via MPI tasks. # - frontend + etcd/nats on current (head) node.
./srun_script.sh # - one large prefill trtllm worker across multiple nodes via MPI tasks
# - one large decode trtllm worker across multiple nodes via MPI tasks
./srun_disaggregated.sh
``` ```
## Understanding the Output ## Understanding the Output
1. The `srun_script.sh` launches two `srun` jobs. The first launches 1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
etcd, NATS, and the OpenAI frontend on the head node only etcd, NATS, and the OpenAI frontend on the head node only
called "node1" in the example output below. The second launches called "node1" in the example output below. The second launches
a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
...@@ -168,7 +218,9 @@ export NUM_GPUS_PER_NODE=4 ...@@ -168,7 +218,9 @@ export NUM_GPUS_PER_NODE=4
``` ```
5. At this point, with the worker fully initialized and detected by the frontend, 5. At this point, with the worker fully initialized and detected by the frontend,
it is now ready for inference. it is now ready for inference.
6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
three srun jobs instead of two. One for frontend, one for prefill worker,
and one for decode worker.
## Example Request ## Example Request
...@@ -195,7 +247,8 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \ ...@@ -195,7 +247,8 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
## Cleanup ## Cleanup
To cleanup background `srun` processes launched by `srun_script.sh`, you can run: To cleanup background `srun` processes launched by `srun_aggregated.sh` or
`srun_disaggregated.sh`, you can run:
```bash ```bash
pkill srun pkill srun
``` ```
...@@ -209,3 +262,14 @@ pkill srun ...@@ -209,3 +262,14 @@ pkill srun
serving example will be added in the near future. serving example will be added in the near future.
- WideEP configs in this directory are still being tested. A WideEP specific - WideEP configs in this directory are still being tested. A WideEP specific
example with documentation will be added once ready. example with documentation will be added once ready.
- There are known issues where WideEP workers may not cleanly shut down:
- This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
now, you must manually clean these up before deploying again on the
same set of nodes.
- Similarly, there may be GPU memory left in-use after killing the `srun`
jobs. After cleaning up any leftover shared memory files as described
above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
to check on this behavior. If you don't free the GPU memory before the
next deployment, you may get a CUDA OOM error while loading the model.
- There is mention of this issue in the relevant TRT-LLM blog
[here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Example of a Multi-node worker, but no WideEP or EPLB.
# See wide_ep*.yaml for WideEP example configs.
backend: pytorch backend: pytorch
tensor_parallel_size: 16 tensor_parallel_size: 16
moe_expert_parallel_size: 16 moe_expert_parallel_size: 16
......
...@@ -10,11 +10,9 @@ moe_backend: WideEP ...@@ -10,11 +10,9 @@ moe_backend: WideEP
# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size # moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
# 4096 = 256 * 16 # 4096 = 256 * 16
# moe_max_num_tokens: 4096 # moe_max_num_tokens: 4096
moe_load_balancer: /mnt/eplb.yaml moe_load_balancer: /mnt/engine_configs/eplb.yaml
# 36 TP/EP following example from: tensor_parallel_size: 16
# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md moe_expert_parallel_size: 16
tensor_parallel_size: 36
moe_expert_parallel_size: 36
enable_attention_dp: true enable_attention_dp: true
max_batch_size: 256 max_batch_size: 256
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
backend: pytorch
# WideEP related settings
moe_backend: WideEP
moe_load_balancer: /mnt/engine_configs/eplb.yaml
# TP/EP/PP/DP
tensor_parallel_size: 16
moe_expert_parallel_size: 16
pipeline_parallel_size: 1
enable_attention_dp: true
max_batch_size: 256
max_num_tokens: 256
# 8448 = 8192 ISL + 256 OSL
max_seq_len: 8448
kv_cache_config:
# With dp attention disabled: high free_gpu_memory_fraction is fine.
# free_gpu_memory_fraction: 0.85
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
free_gpu_memory_fraction: 0.30
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: false
use_cuda_graph: true
cuda_graph_padding_enabled: true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
kv_cache_dtype: fp8
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
backend: pytorch
# WideEP related settings
moe_backend: WideEP
moe_load_balancer: /mnt/engine_configs/eplb.yaml
# TP/EP/PP/DP
tensor_parallel_size: 16
moe_expert_parallel_size: 16
pipeline_parallel_size: 1
enable_attention_dp: true
max_batch_size: 1
max_num_tokens: 8192
max_seq_len: 8192
kv_cache_config:
free_gpu_memory_fraction: 0.75
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: true
print_iter_log: true
# NOTE: This dtype must match in both prefill/decode configs
kv_cache_dtype: fp8
...@@ -18,6 +18,8 @@ MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}" ...@@ -18,6 +18,8 @@ MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
NUM_NODES=${NUM_NODES:-4} NUM_NODES=${NUM_NODES:-4}
NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4} NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_agg.yaml}"
# Automate settings of certain variables for convenience, but you are free # Automate settings of certain variables for convenience, but you are free
# to manually set these for more control as well. # to manually set these for more control as well.
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
...@@ -55,12 +57,14 @@ srun \ ...@@ -55,12 +57,14 @@ srun \
# in practice you would probably set `srun --output ... --error ...` to pipe # in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files. # the stdout/stderr to files.
echo "Launching multi-node worker in background." echo "Launching multi-node worker in background."
# No --task for the worker defaults to aggregated mode
TASK="" \
srun \ srun \
--mpi pmix \ --mpi pmix \
--oversubscribe \ --oversubscribe \
--container-image "${IMAGE}" \ --container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \ --container-mounts "${MOUNTS}" \
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \ --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
--verbose \ --verbose \
--label \ --label \
-A "${ACCOUNT}" \ -A "${ACCOUNT}" \
......
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This is one of the only variables that must be set currently, most of the rest may
# just work out of the box if following the steps in the README.
IMAGE="${IMAGE:-""}"
# Set to mount current host directory to /mnt inside the container as an example,
# but you may freely customize the mounts based on your cluster. A common practice
# is to mount paths to NFS storage for common scripts, model weights, etc.
# NOTE: This can be a comma separated list of multiple mounts as well.
DEFAULT_MOUNT="${PWD}:/mnt"
MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_prefill.yaml}"
NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_decode.yaml}"
# Automate settings of certain variables for convenience, but you are free
# to manually set these for more control as well.
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
export HEAD_NODE="${SLURMD_NODENAME}"
export HEAD_NODE_IP="$(hostname -i)"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
if [[ -z ${IMAGE} ]]; then
echo "ERROR: You need to set the IMAGE environment variable to the " \
"Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
"See how to build one from source here: " \
"https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
exit 1
fi
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo "Launching frontend services in background."
srun \
--overlap \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodelist "${HEAD_NODE}" \
--nodes 1 \
--jobid "${SLURM_JOB_ID}" \
/mnt/start_frontend_services.sh &
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo "Launching multi-node prefill worker in background."
TASK=prefill \
ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \
srun \
--mpi pmix \
--oversubscribe \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodes "${NUM_PREFILL_NODES}" \
--ntasks-per-node "${NUM_GPUS_PER_NODE}" \
--jobid "${SLURM_JOB_ID}" \
/mnt/start_trtllm_worker.sh &
echo "Launching multi-node decode worker in background."
TASK=decode \
ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \
srun \
--mpi pmix \
--oversubscribe \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodes "${NUM_DECODE_NODES}" \
--ntasks-per-node "${NUM_GPUS_PER_NODE}" \
--jobid "${SLURM_JOB_ID}" \
/mnt/start_trtllm_worker.sh &
...@@ -22,6 +22,18 @@ if [[ -z ${ENGINE_CONFIG} ]]; then ...@@ -22,6 +22,18 @@ if [[ -z ${ENGINE_CONFIG} ]]; then
exit 1 exit 1
fi fi
EXTRA_ARGS=""
if [[ -n ${TASK} ]]; then
EXTRA_ARGS+="--task ${TASK}"
fi
# NOTE: When this script is run directly from srun, the environment variables
# for TRTLLM KV cache are not set. So we need to set them here.
# Related issue: https://github.com/ai-dynamo/dynamo/issues/1743
if [[ -z ${TRTLLM_USE_UCX_KVCACHE} ]] && [[ -z ${TRTLLM_USE_NIXL_KVCACHE} ]]; then
export TRTLLM_USE_UCX_KVCACHE=1
fi
# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM # NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM
# worker and registers itself with the runtime. It is currently easier to wrap # worker and registers itself with the runtime. It is currently easier to wrap
# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes, # this standalone script with `trtllm-llmapi-launch` for MPI handling purposes,
...@@ -30,4 +42,5 @@ trtllm-llmapi-launch \ ...@@ -30,4 +42,5 @@ trtllm-llmapi-launch \
python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \ python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \
--model-path "${MODEL_PATH}" \ --model-path "${MODEL_PATH}" \
--model-name "${SERVED_MODEL_NAME}" \ --model-name "${SERVED_MODEL_NAME}" \
--extra-engine-args "${ENGINE_CONFIG}" --extra-engine-args "${ENGINE_CONFIG}" \
${EXTRA_ARGS}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment