# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
sidebar-title:Disagg Communication
subtitle:Best practices for prefill/decode worker communication on Kubernetes
---
# Disaggregated Inference Communication Guide
This guide explains how prefill and decode workers communicate in Dynamo's disaggregated inference architecture on Kubernetes. It answers the frequently asked question: **Why can't prefill and decode workers use NVLink to communicate on the same node?**
## Summary
-**NVLink cannot be used between Kubernetes pods** due to process isolation and GPU partitioning
-**RDMA (InfiniBand/RoCE) is required** for production disaggregated deployments
-**Without RDMA, expect 200-500x performance degradation** in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
-**UCX is the communication layer** that NIXL uses to transfer KV cache between workers
1.**Process Isolation**: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B's memory space.
2.**GPU Partitioning**: The Kubernetes device plugin assigns specific GPUs to each pod via `CUDA_VISIBLE_DEVICES`. Pod A's GPU 0 and Pod B's GPU 0 are physically different devices.
3.**Process/Namespace Isolation**: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so `cudaDeviceEnablePeerAccess()` can be called.
4.**Memory Registration**: NVLink transfers use `cudaMemcpy` with peer access enabled. This requires calling `cudaDeviceEnablePeerAccess()` - impossible across process boundaries.
### Where NVLink DOES Work
NVLink works **within a pod** for parallelism strategies (TP, EP) where all GPUs are in the same process:
```yaml
# Decode worker with TP=4 uses NVLink between its 4 GPUs
VLLMDecodeWorker:
resources:
limits:
gpu:"4"# All 4 GPUs visible to single process
args:
---tensor-parallel-size
-"4"# NVLink used for TP/EP communication within pod
```
---
## Supported Communication Options
### Transport Comparison
| Transport | Bandwidth | Latency | Same-Node | Cross-Node | GPU Direct |
**Best Practice**: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.
### Cross-Node Communication
When prefill and decode workers are on **different nodes**:
**Recommendation**: Use `get_zcopy` with threshold `0` for KV cache transfers (always large).
> **⚠️ AWS EFA Exception**: Do NOT use `get_zcopy` on AWS with Ubuntu 24.04 + Kernel ≥6.8. See [AWS EFA Configuration](#aws-efa-configuration) for required settings.
value:"/tmp/ucx.log"# Optional: log to file instead of stdout
```
**Note**: UCX statistics (`UCX_STATS_DEST`, `UCX_STATS_TRIGGER`) require UCX compiled with `--enable-stats` flag, which is not enabled in default builds.
> On AWS Ubuntu 24.04 with Kernel ≥6.8, using `UCX_RNDV_SCHEME=get_zcopy` triggers a fatal `NIXL_ERR_BACKEND` crash. The EFA provider cannot register CUDA memory due to incomplete DMA-BUF support in `efa_nv_peermem`.
>
> **You MUST use the configuration below** — do not copy the standard InfiniBand settings.
> **Note: NIXL is migrating from UCX to libfabric for AWS**
> The Dynamo team is transitioning NIXL to use **libfabric** instead of UCX for AWS EFA deployments. This change is driven by:
> - **Better topology awareness**: libfabric provides hierarchical topology awareness similar to NCCL
> - **Native EFA support**: libfabric is the recommended communication layer for AWS EFA
>
> **Current status**: UCX over EFA works but is not recommended for production. Published AWS examples are functional but not performant. Check with the Dynamo team for libfabric availability timeline.
> **Note**: InfiniBand/RoCE numbers with GPUDirect are expected values based on hardware specifications and have not been validated. AWS measurements reflect EFA without functional GPUDirect RDMA (see [AWS EFA Configuration](#aws-efa-configuration) for details).