kv-cache-transfer.md 3.11 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->



# KV Cache Transfer in Disaggregated Serving

In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

## Default Method: UCX
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.

## Experimental Method: NIXL
TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet.

## Using NIXL for KV Cache Transfer

**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer.

To enable NIXL for KV cache transfer in disaggregated serving:

1. **Build the container with NIXL support:**
   The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.

   **Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):**
   ```bash
   rm -rf /tmp/trtllm_wheel
   ```

   **Build the container with NIXL support:**
   ```bash
48
   ./container/build.sh --framework trtllm \
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
     --use-default-experimental-tensorrtllm-commit \
     --trtllm-use-nixl-kvcache-experimental
   ```

   **Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support.

2. **Run the containerized environment:**
   See [run container](./README.md#run-container) section to learn how to start the container image built in previous step.

3. **Start the disaggregated service:**
   See [disaggregated serving](./README.md#disaggregated-serving) to see how to start the deployment.

4. **Send the request:**
   See [client](./README.md#client) section to learn how to send the request to deployment.

64
**Important:** Ensure that ETCD and NATS services are running before starting the service.