Unverified Commit c9d7d95f authored by Tanmay Verma's avatar Tanmay Verma Committed by GitHub
Browse files

chore: Upgrade to tensorrt-llm==1.2.0rc3 (#4645)

parent d52f070f
...@@ -199,8 +199,6 @@ ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu ...@@ -199,8 +199,6 @@ ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu
ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins
# workaround for pickle lib issue # workaround for pickle lib issue
ENV OMPI_MCA_coll_ucc_enable=0 ENV OMPI_MCA_coll_ucc_enable=0
# Use UCX KVCACHE by default
ENV TRTLLM_USE_UCX_KVCACHE=1
ARG DYNAMO_COMMIT_SHA ARG DYNAMO_COMMIT_SHA
ENV DYNAMO_COMMIT_SHA=$DYNAMO_COMMIT_SHA ENV DYNAMO_COMMIT_SHA=$DYNAMO_COMMIT_SHA
......
...@@ -98,7 +98,7 @@ TRTLLM_GIT_URL="" ...@@ -98,7 +98,7 @@ TRTLLM_GIT_URL=""
DEFAULT_TENSORRTLLM_INDEX_URL="https://pypi.nvidia.com/" DEFAULT_TENSORRTLLM_INDEX_URL="https://pypi.nvidia.com/"
# TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package. # TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package.
# Need to update the Dockerfile.trtllm to use the ai-dynamo[trtllm] package. # Need to update the Dockerfile.trtllm to use the ai-dynamo[trtllm] package.
DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc2" DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc3"
TENSORRTLLM_PIP_WHEEL="" TENSORRTLLM_PIP_WHEEL=""
VLLM_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base" VLLM_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
......
...@@ -21,39 +21,18 @@ limitations under the License. ...@@ -21,39 +21,18 @@ limitations under the License.
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer: In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
## Default Method: UCX ## Default Method: NIXL
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers. By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
## Beta Method: NIXL ### Specify Backends for NIXL
TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges. TODO: Add instructions for how to specify different backends for NIXL.
## Using NIXL for KV Cache Transfer ## Alternative Method: UCX
**Note:** NIXL version shipped with current dynamo is not supported by tensorrt-llm<=1.2.0rc2. In order to use NIXL backend for KV cache transfer, users are required to build container image with tensorrt-llm>=1.2.0rc3. TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
To enable NIXL for KV cache transfer in disaggregated serving: 1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
1. **Build the container with NIXL support(tensorrt-llm==1.2.0rc3):** This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
```bash
./container/build.sh --framework trtllm \
--tensorrtllm-pip-wheel tensorrt-llm==1.2.0rc3
```
2. **Run the containerized environment:**
See [run container](./README.md#run-container) section to learn how to start the container image built in previous step.
Within container, unset `TRTLLM_USE_UCX_KVCACHE` variable so NIXL can be used instead of UCX.
```bash
unset TRTLLM_USE_UCX_KVCACHE
```
3. **Start the disaggregated service:**
See [disaggregated serving](./README.md#disaggregated-serving) to see how to start the deployment.
4. **Send the request:**
See [client](./README.md#client) section to learn how to send the request to deployment.
**Important:** Ensure that ETCD and NATS services are running before starting the service.
...@@ -50,7 +50,7 @@ Repository = "https://github.com/ai-dynamo/dynamo.git" ...@@ -50,7 +50,7 @@ Repository = "https://github.com/ai-dynamo/dynamo.git"
[project.optional-dependencies] [project.optional-dependencies]
trtllm =[ trtllm =[
"uvloop", "uvloop",
"tensorrt-llm==1.2.0rc2", "tensorrt-llm==1.2.0rc3",
] ]
vllm = [ vllm = [
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment