chore: Upgrade to tensorrt-llm==1.2.0rc3 (#4645)

c9d7d95f · Tanmay Verma · GitHub · d52f070f · c9d7d95f · c9d7d95f
Unverified Commit c9d7d95f authored Nov 26, 2025 by Tanmay Verma Committed by GitHub Nov 26, 2025
4 changed files
--- a/container/Dockerfile.trtllm
+++ b/container/Dockerfile.trtllm
@@ -199,8 +199,6 @@ ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu
 ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins
 # workaround for pickle lib issue
 ENV OMPI_MCA_coll_ucc_enable=0
-# Use UCX KVCACHE by default
-ENV TRTLLM_USE_UCX_KVCACHE=1

 ARG DYNAMO_COMMIT_SHA
 ENV DYNAMO_COMMIT_SHA=$DYNAMO_COMMIT_SHA

--- a/container/build.sh
+++ b/container/build.sh
@@ -98,7 +98,7 @@ TRTLLM_GIT_URL=""
 DEFAULT_TENSORRTLLM_INDEX_URL="https://pypi.nvidia.com/"
 # TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package.
 # Need to update the Dockerfile.trtllm to use the ai-dynamo[trtllm] package.
-DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc2"
+DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc3"
 TENSORRTLLM_PIP_WHEEL=""

 VLLM_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"

--- a/docs/backends/trtllm/kv-cache-transfer.md
+++ b/docs/backends/trtllm/kv-cache-transfer.md
@@ -21,39 +21,18 @@ limitations under the License.

 In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

-## Default Method: UCX
-By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
+## Default Method: NIXL
+By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

-## Beta Method: NIXL
-TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
+### Specify Backends for NIXL

-**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges.
+TODO: Add instructions for how to specify different backends for NIXL.

-## Using NIXL for KV Cache Transfer
+## Alternative Method: UCX

-**Note:** NIXL version shipped with current dynamo is not supported by tensorrt-llm<=1.2.0rc2. In order to use NIXL backend for KV cache transfer, users are required to build container image with tensorrt-llm>=1.2.0rc3.
+TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:

-To enable NIXL for KV cache transfer in disaggregated serving:
+1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
+2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.

-1. **Build the container with NIXL support(tensorrt-llm==1.2.0rc3):**
-   ```bash
-   ./container/build.sh --framework trtllm \
-     --tensorrtllm-pip-wheel tensorrt-llm==1.2.0rc3
-   ```
-
-2. **Run the containerized environment:**
-   See [run container](./README.md#run-container) section to learn how to start the container image built in previous step.
-
-   Within container, unset `TRTLLM_USE_UCX_KVCACHE` variable so NIXL can be used instead of UCX.
-
-   ```bash
-    unset TRTLLM_USE_UCX_KVCACHE
-    ```
-
-3. **Start the disaggregated service:**
-   See [disaggregated serving](./README.md#disaggregated-serving) to see how to start the deployment.
-
-4. **Send the request:**
-   See [client](./README.md#client) section to learn how to send the request to deployment.
-
-**Important:** Ensure that ETCD and NATS services are running before starting the service.
+This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -50,7 +50,7 @@ Repository = "https://github.com/ai-dynamo/dynamo.git"
 [project.optional-dependencies]
 trtllm =[
    "uvloop",
-    "tensorrt-llm==1.2.0rc2",
+    "tensorrt-llm==1.2.0rc3",
 ]

 vllm = [