"csrc/quantization/w8a8/fp8/amd/quant_utils.cuh" did not exist on "9fc983c707a9371951135d2e5a62183da7e09369"
trtllm-kv-cache-transfer.md 1.76 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: KV Cache Transfer
5
6
---

7
8
9
10
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).

---

11
12
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

13
14
## Using NIXL for KV Cache Transfer

15
Start the disaggregated service: See [Disaggregated Serving](./trtllm-examples.md#disaggregated) to learn how to start the deployment.
16

17
18
19
20
21
## Default Method: NIXL
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

### Specify Backends for NIXL

22
TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently only supports the UCX backend, as LIBFABRIC support is still a work in progress. Please do not change the NIXL backend in the Dynamo runtime image.
23
24

## Alternative Method: UCX
25

26
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
27

28
29
> [!Note]
> The environment variable `TRTLLM_USE_UCX_KVCACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.