feat: Add experimental WideEP + EPLB dis-aggregated example for TRTLLM (#1690)

Co-authored-by: tanmayv25 <tanmay2592@gmail.com>

feat: Add experimental WideEP + EPLB dis-aggregated example for TRTLLM (#1690)
Co-authored-by: tanmayv25 <tanmay2592@gmail.com>
7a353e61 · Ryan McCormick · GitHub · 47e7fde7 · 7a353e61 · 7a353e61
Unverified Commit 7a353e61 authored Jul 04, 2025 by Ryan McCormick Committed by GitHub Jul 03, 2025
10 changed files
--- a/examples/tensorrt_llm/configs/deepseek_r1/engine_configs/prefill_config.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/engine_configs/prefill_config.yaml
@@ -25,11 +25,7 @@ max_num_tokens: 8192
 max_seq_len: 8192
 kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
  free_gpu_memory_fraction: 0.75
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
 # NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
 # NOTE: overlap_scheduler enabled by default since this commit and changed

--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
@@ -10,8 +10,13 @@ the set of nodes need to be launched together in the same MPI world, such as
 via `mpirun` or `srun`. This is true regardless of whether the worker is
 aggregated, prefill-only, or decode-only.
-In this document we will demonstrate an example of launching a multi-node TP16/EP16
+In this document we will demonstrate two examples launching multinode workers
-aggregated worker on a slurm cluster with `srun`.
+on a slurm cluster with `srun`:
+1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
+   worker across 4 GB200 nodes
+2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
+   TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
+   worker (4 nodes) across a total of 8 GB200 nodes.
 NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
 `start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
@@ -25,7 +30,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
   testing, you should aim to allocate groups of nodes that are performantly
   inter-connected, such as those in an NVL72 setup.
 2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `srun_script.sh` script in this
+   SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
   example will use `srun` arguments like `--container-image`,
   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
   If your cluster supports similar container based plugins, you may be able to
@@ -34,10 +39,14 @@ For simplicity of the example, we will make some assumptions about your slurm cl
   described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
   This is the image that can be set to the `IMAGE` environment variable in later steps.
 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
-   will allocate 4 nodes below as a reference command. This is technically not
+   will allocate 8 nodes below as a reference command to have enough capacity
-   a requirement, but makes iterations of testing/experimenting easier when
+   to run both examples. If you plan to only run the aggregated example, you
-   you have a reserved set of nodes for a period of time. Make sure to set your
+   will only need 4 nodes. If you customize the configurations to require a
-   `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
+   different number of nodes, you can adjust the number of allocated nodes
+   accordingly. Pre-allocating nodes is technically not a requirement,
+   but it makes iterations of testing/experimenting easier.
+   Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
    ```bash
    # Set partition manually based on your slurm cluster's partition names
    PARTITION=""
@@ -48,20 +57,21 @@ For simplicity of the example, we will make some assumptions about your slurm cl
      --account="${ACCOUNT}" \
      --job-name="${ACCOUNT}-dynamo.trtllm" \
      -t 05:00:00 \
-      --nodes 4
+      --nodes 8
    ```
 5. Lastly, we will assume you are inside an interactive shell on one of your allocated
-   nodes, which should be the default behavior after executing the `salloc` command above.
+   nodes, which may be the default behavior after executing the `salloc` command above
-   If not, then you should SSH into one of the allocated nodes.
+   depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
-## Launching Slurm Jobs
+### Environment Variable Setup
 This example aims to automate as much of the environment setup as possible,
 but all slurm clusters and environments are different, and you may need to
 dive into the scripts to make modifications based on your specific environment.
-Assuming you have already allocated at least 4 nodes via `salloc`, and are
+Assuming you have already allocated your nodes via `salloc`, and are
-inside an interactive shell on one of the allocated nodes:
+inside an interactive shell on one of the allocated nodes, set the
+following environment variables based:
 ```bash
 # NOTE: IMAGE must be set manually for now
 # To build an iamge, see the steps here:
@@ -77,7 +87,7 @@ export IMAGE="<dynamo_trtllm_image>"
 #
 # NOTE: Currently, this example assumes that the local bash scripts and configs
 # referenced are mounted into into /mnt inside the container. If you want to
-# customize the location of the scripts, make sure to modify `srun_script.sh`
+# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
 # accordingly for the new locations of `start_frontend_services.sh` and
 # `start_trtllm_worker.sh`.
 #
@@ -105,28 +115,68 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
 # By default this is inferred from MODEL_PATH, but when using locally downloaded
 # model weights, it can be nice to have explicit control over the name.
 export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+```
+## Aggregated WideEP
+Assuming you have at least 4 nodes allocated following the setup steps above,
+follow these steps below to launch an **aggregated** deployment across 4 nodes:
-# NOTE: This path assumes you have mounted the config file into /mnt inside
+```bash
-# the container. See the MOUNTS variable in srun_script.sh
+# Default set in srun_aggregated.sh, but can customize here.
-export ENGINE_CONFIG="/mnt/agg_DEP16_dsr1.yaml"
+# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml"
 # Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
-# The produce of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
+# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
 # total GPUs necessary to satisfy the requested parallelism. For example,
 # 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
-export NUM_NODES=4
+# export NUM_NODES=4
+# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+# export NUM_GPUS_PER_NODE=4
+# Launches:
+# - frontend + etcd/nats on current (head) node
+# - one large aggregated trtllm worker across multiple nodes via MPI tasks
+./srun_aggregated.sh
+```
+## Disaggregated WideEP
+Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
+following the setup above, follow these steps below to launch a **disaggregated**
+deployment across 8 nodes:
+> [!Tip]
+> Make sure you have a fresh environment and don't still have the aggregated
+> example above still deployed on the same set of nodes.
+```bash
+# Defaults set in srun_disaggregated.sh, but can customize here.
+# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
+# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
+# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
+# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
+# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
+# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
+# GPUs necessary to satisfy the requested parallelism in each config.
+# export NUM_PREFILL_NODES=4
+# export NUM_DECODE_NODES=4
 # GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
-export NUM_GPUS_PER_NODE=4
+# export NUM_GPUS_PER_NODE=4
-# Launches frontend + etcd/nats on current (head) node.
+# Launches:
-# Launches one large trtllm worker across multiple nodes via MPI tasks.
+# - frontend + etcd/nats on current (head) node.
-./srun_script.sh
+# - one large prefill trtllm worker across multiple nodes via MPI tasks
+# - one large decode trtllm worker across multiple nodes via MPI tasks
+./srun_disaggregated.sh
 ```
 ## Understanding the Output
-1. The `srun_script.sh` launches two `srun` jobs. The first launches
+1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
   etcd, NATS, and the OpenAI frontend on the head node only
   called "node1" in the example output below. The second launches
   a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
@@ -168,7 +218,9 @@ export NUM_GPUS_PER_NODE=4
    ```
 5. At this point, with the worker fully initialized and detected by the frontend,
   it is now ready for inference.
+6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
+   three srun jobs instead of two. One for frontend, one for prefill worker,
+   and one for decode worker.
 ## Example Request
@@ -195,7 +247,8 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
 ## Cleanup
-To cleanup background `srun` processes launched by `srun_script.sh`, you can run:
+To cleanup background `srun` processes launched by `srun_aggregated.sh` or
+`srun_disaggregated.sh`, you can run:
 ```bash
 pkill srun
 ```
@@ -209,3 +262,14 @@ pkill srun
  serving example will be added in the near future.
 - WideEP configs in this directory are still being tested. A WideEP specific
  example with documentation will be added once ready.
+- There are known issues where WideEP workers may not cleanly shut down:
+    - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
+      now, you must manually clean these up before deploying again on the
+      same set of nodes.
+    - Similarly, there may be GPU memory left in-use after killing the `srun`
+      jobs. After cleaning up any leftover shared memory files as described
+      above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
+      to check on this behavior. If you don't free the GPU memory before the
+      next deployment, you may get a CUDA OOM error while loading the model.
+    - There is mention of this issue in the relevant TRT-LLM blog
+      [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
 # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Example of a Multi-node worker, but no WideEP or EPLB.
+# See wide_ep*.yaml for WideEP example configs.
 backend: pytorch
 tensor_parallel_size: 16
 moe_expert_parallel_size: 16

--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/eplb.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/eplb.yaml
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_wide_ep.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_wide_ep.yaml
@@ -10,11 +10,9 @@ moe_backend: WideEP
 #   moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
 #   4096 = 256 * 16
 # moe_max_num_tokens: 4096
-moe_load_balancer: /mnt/eplb.yaml
+moe_load_balancer: /mnt/engine_configs/eplb.yaml
-# 36 TP/EP following example from:
+tensor_parallel_size: 16
-# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md
+moe_expert_parallel_size: 16
-tensor_parallel_size: 36
-moe_expert_parallel_size: 36
 enable_attention_dp: true
 max_batch_size: 256

--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+# WideEP related settings
+moe_backend: WideEP
+moe_load_balancer: /mnt/engine_configs/eplb.yaml
+# TP/EP/PP/DP
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+pipeline_parallel_size: 1
+enable_attention_dp: true
+max_batch_size: 256
+max_num_tokens: 256
+# 8448 = 8192 ISL + 256 OSL
+max_seq_len: 8448
+kv_cache_config:
+  # With dp attention disabled: high free_gpu_memory_fraction is fine.
+  # free_gpu_memory_fraction: 0.85
+  # With dp attention enabled: large ISL at high concurrency may need
+  # free_gpu_memory_fraction low to have enough available memory.
+  free_gpu_memory_fraction: 0.30
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+disable_overlap_scheduler: false
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+# NOTE: For larger max batch size, you may want to add larger cuda graph
+# batch sizes below to match.
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+# WideEP related settings
+moe_backend: WideEP
+moe_load_balancer: /mnt/engine_configs/eplb.yaml
+# TP/EP/PP/DP
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+pipeline_parallel_size: 1
+enable_attention_dp: true
+max_batch_size: 1
+max_num_tokens: 8192
+max_seq_len: 8192
+kv_cache_config:
+  free_gpu_memory_fraction: 0.75
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+disable_overlap_scheduler: true
+print_iter_log: true
+# NOTE: This dtype must match in both prefill/decode configs
+kv_cache_dtype: fp8
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
@@ -18,6 +18,8 @@ MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
 NUM_NODES=${NUM_NODES:-4}
 NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
+export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_agg.yaml}"
 # Automate settings of certain variables for convenience, but you are free
 # to manually set these for more control as well.
 ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
@@ -55,12 +57,14 @@ srun \
 # in practice you would probably set `srun --output ... --error ...` to pipe
 # the stdout/stderr to files.
 echo "Launching multi-node worker in background."
+# No --task for the worker defaults to aggregated mode
+TASK="" \
 srun \
  --mpi pmix \
  --oversubscribe \
  --container-image "${IMAGE}" \
  --container-mounts "${MOUNTS}" \
-  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
  --verbose \
  --label \
  -A "${ACCOUNT}" \

--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_disaggregated.sh
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_disaggregated.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# This is one of the only variables that must be set currently, most of the rest may
+# just work out of the box if following the steps in the README.
+IMAGE="${IMAGE:-""}"
+# Set to mount current host directory to /mnt inside the container as an example,
+# but you may freely customize the mounts based on your cluster. A common practice
+# is to mount paths to NFS storage for common scripts, model weights, etc.
+# NOTE: This can be a comma separated list of multiple mounts as well.
+DEFAULT_MOUNT="${PWD}:/mnt"
+MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
+NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
+NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
+PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_prefill.yaml}"
+NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
+DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_decode.yaml}"
+# Automate settings of certain variables for convenience, but you are free
+# to manually set these for more control as well.
+ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+export HEAD_NODE="${SLURMD_NODENAME}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+if [[ -z ${IMAGE} ]]; then
+  echo "ERROR: You need to set the IMAGE environment variable to the " \
+       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
+       "See how to build one from source here: " \
+       "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
+  exit 1
+fi
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching frontend services in background."
+srun \
+  --overlap \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodelist "${HEAD_NODE}" \
+  --nodes 1 \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_frontend_services.sh &
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching multi-node prefill worker in background."
+TASK=prefill \
+ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \
+srun \
+  --mpi pmix \
+  --oversubscribe \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodes "${NUM_PREFILL_NODES}" \
+  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_trtllm_worker.sh &
+echo "Launching multi-node decode worker in background."
+TASK=decode \
+ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \
+srun \
+  --mpi pmix \
+  --oversubscribe \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodes "${NUM_DECODE_NODES}" \
+  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_trtllm_worker.sh &
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
@@ -22,6 +22,18 @@ if [[ -z ${ENGINE_CONFIG} ]]; then
    exit 1
 fi
+EXTRA_ARGS=""
+if [[ -n ${TASK} ]]; then
+  EXTRA_ARGS+="--task ${TASK}"
+fi
+# NOTE: When this script is run directly from srun, the environment variables
+# for TRTLLM KV cache are not set. So we need to set them here.
+# Related issue: https://github.com/ai-dynamo/dynamo/issues/1743
+if [[ -z ${TRTLLM_USE_UCX_KVCACHE} ]] && [[ -z ${TRTLLM_USE_NIXL_KVCACHE} ]]; then
+    export TRTLLM_USE_UCX_KVCACHE=1
+fi
 # NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM
 # worker and registers itself with the runtime. It is currently easier to wrap
 # this standalone script with `trtllm-llmapi-launch` for MPI handling purposes,
@@ -30,4 +42,5 @@ trtllm-llmapi-launch \
  python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \
    --model-path "${MODEL_PATH}" \
    --model-name "${SERVED_MODEL_NAME}" \
-    --extra-engine-args "${ENGINE_CONFIG}"
+    --extra-engine-args "${ENGINE_CONFIG}" \
+    ${EXTRA_ARGS}