kv-cache-transfer.md 2.77 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->



# KV Cache Transfer in Disaggregated Serving

In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

24
25
26
27
## Using NIXL for KV Cache Transfer

Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.

28
29
## Default Method: NIXL
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
30

31
### Specify Backends for NIXL
32

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
NIXL supports multiple communication backends that can be configured via environment variables. By default, UCX is used if no backends are explicitly specified.

**Environment Variable Format:**
```bash
DYN_KVBM_NIXL_BACKEND_<BACKEND>=<value>
```

**Supported Backends:**
- `UCX` - Unified Communication X (default)
- `GDS` - GPU Direct Storage

**Examples:**
```bash
# Enable UCX backend (default behavior)
export DYN_KVBM_NIXL_BACKEND_UCX=true

# Enable GDS backend
export DYN_KVBM_NIXL_BACKEND_GDS=true

# Enable multiple backends
export DYN_KVBM_NIXL_BACKEND_UCX=true
export DYN_KVBM_NIXL_BACKEND_GDS=true

# Explicitly disable a backend
export DYN_KVBM_NIXL_BACKEND_GDS=false
```

**Valid Values:**
- `true`, `1`, `on`, `yes` - Enable the backend
- `false`, `0`, `off`, `no` - Disable the backend

> [!Note]
> If no `DYN_KVBM_NIXL_BACKEND_*` environment variables are set, UCX is used as the default backend.
66

67
## Alternative Method: UCX
68

69
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
70

71
72
> [!Note]
> The environment variable `TRTLLM_USE_UCX_KV_CACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.