disagg-communication-guide.md 28.6 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
sidebar-title: Disagg Communication
subtitle: Best practices for prefill/decode worker communication on Kubernetes
---

# Disaggregated Inference Communication Guide

This guide explains how prefill and decode workers communicate in Dynamo's disaggregated inference architecture on Kubernetes. It answers the frequently asked question: **Why can't prefill and decode workers use NVLink to communicate on the same node?**

## Summary

- **NVLink cannot be used between Kubernetes pods** due to process isolation and GPU partitioning
- **RDMA (InfiniBand/RoCE) is required** for production disaggregated deployments
- **Without RDMA, expect 200-500x performance degradation** in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
- **UCX is the communication layer** that NIXL uses to transfer KV cache between workers

---

## Architecture Overview

### Communication Stack

```text
┌─────────────────────────────────────────────────────────────────────────┐
│                         Dynamo Disaggregated Serving                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────────┐              ┌──────────────────┐                 │
│  │  Prefill Worker  │              │  Decode Worker   │                 │
│  │  (Pod A)         │              │  (Pod B)         │                 │
│  │                  │              │                  │                 │
│  │  ┌────────────┐  │              │  ┌────────────┐  │                 │
│  │  │ KV Cache   │  │   Transfer   │  │ KV Cache   │  │                 │
│  │  │ (GPU VRAM) │──┼──────────────┼─▶│ (GPU VRAM) │  │                 │
│  │  └────────────┘  │              │  └────────────┘  │                 │
│  └────────┬─────────┘              └────────┬─────────┘                 │
│           │                                  │                          │
├───────────┼──────────────────────────────────┼──────────────────────────┤
│           │          NIXL Library            │                          │
│           │    (KV Cache Transfer API)       │                          │
├───────────┼──────────────────────────────────┼──────────────────────────┤
│           │                                  │                          │
│           │              UCX                 │                          │
│           │   (Unified Communication X)      │                          │
│           │                                  │                          │
│  ┌────────┴──────────────────────────────────┴────────┐                 │
│  │                  Transport Layer                    │                 │
│  │                                                     │                 │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │                 │
│  │  │   rc_x/dc_x │  │  cuda_copy  │  │    tcp      │ │                 │
│  │  │   (RDMA)    │  │  (staging)  │  │  (fallback) │ │                 │
│  │  │             │  │             │  │             │ │                 │
│  │  │  InfiniBand │  │ GPU↔Host    │  │  Network    │ │                 │
│  │  │  or RoCE    │  │ memory copy │  │  sockets    │ │                 │
│  │  └─────────────┘  └─────────────┘  └─────────────┘ │                 │
│  └─────────────────────────────────────────────────────┘                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
```

### Component Responsibilities

| Component | Role | Location |
|-----------|------|----------|
| **NIXL** | High-level KV cache transfer API | Dynamo runtime library |
| **UCX** | Low-level communication framework | System library |
| **Transports** | Physical data movement | Hardware/kernel drivers |

---

## Why NVLink Cannot Be Used Between Pods

### The Fundamental Constraint

NVLink is a **direct GPU-to-GPU interconnect** that operates at the hardware level. It requires:

1. **Same process** - Both GPUs must be visible to a single process so `cudaDeviceEnablePeerAccess()` can be called
2. **Direct memory access** - Process must have permission to access both GPU memory regions
3. **Peer-to-peer mapping** - CUDA runtime must establish memory mappings between GPUs

**Kubernetes pods violate all three requirements:**

```text
┌─────────────────────────────────────────────────────────────────────────┐
│                        Physical Node (8× H100 GPUs)                      │
│                                                                          │
│  ┌─────────────────────────────┐    ┌─────────────────────────────┐    │
│  │       Pod A (Prefill)       │    │       Pod B (Decode)        │    │
│  │                             │    │                             │    │
│  │  Process Namespace: PID 1   │    │  Process Namespace: PID 1   │    │
│  │  CUDA_VISIBLE_DEVICES: 0,1  │    │  CUDA_VISIBLE_DEVICES: 2-7  │    │
│  │                             │    │                             │    │
│  │  ┌─────┐  ┌─────┐          │    │  ┌─────┐  ┌─────┐  ...     │    │
│  │  │GPU 0│  │GPU 1│          │    │  │GPU 2│  │GPU 3│          │    │
│  │  └─────┘  └─────┘          │    │  └─────┘  └─────┘          │    │
│  │       ↑ NVLink ↑            │    │       ↑ NVLink ↑            │    │
│  │       (works!)              │    │       (works!)              │    │
│  └─────────────────────────────┘    └─────────────────────────────┘    │
│                                                                          │
│            ╳ NO NVLink possible between pods ╳                          │
│                                                                          │
│  Reason: Separate process namespaces, separate CUDA contexts,           │
│          separate GPU device assignments                                 │
└─────────────────────────────────────────────────────────────────────────┘
```

### Technical Explanation

1. **Process Isolation**: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B's memory space.

2. **GPU Partitioning**: The Kubernetes device plugin assigns specific GPUs to each pod via `CUDA_VISIBLE_DEVICES`. Pod A's GPU 0 and Pod B's GPU 0 are physically different devices.

3. **Process/Namespace Isolation**: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so `cudaDeviceEnablePeerAccess()` can be called.

4. **Memory Registration**: NVLink transfers use `cudaMemcpy` with peer access enabled. This requires calling `cudaDeviceEnablePeerAccess()` - impossible across process boundaries.

### Where NVLink DOES Work

NVLink works **within a pod** for parallelism strategies (TP, EP) where all GPUs are in the same process:

```yaml
# Decode worker with TP=4 uses NVLink between its 4 GPUs
VLLMDecodeWorker:
  resources:
    limits:
      gpu: "4"   # All 4 GPUs visible to single process
  args:
    - --tensor-parallel-size
    - "4"        # NVLink used for TP/EP communication within pod
```

---

## Supported Communication Options

### Transport Comparison

| Transport | Bandwidth | Latency | Same-Node | Cross-Node | GPU Direct |
|-----------|-----------|---------|-----------|------------|------------|
| **NVLink** | 450-900 GB/s | ~µs | ✅ (intra-pod only) | ❌ | ✅ |
| **InfiniBand RDMA** | 20-50 GB/s | ~1 µs | ✅ | ✅ | ✅ (with GPUDirect) |
| **RoCE RDMA** | 10-25 GB/s | ~2 µs | ✅ | ✅ | ✅ (with GPUDirect) |
| **TCP** | 1-3 GB/s | ~50 µs | ✅ | ✅ | ❌ (host staging) |

### Same-Node Communication

When prefill and decode workers are on the **same physical node**:

```text
┌─────────────────────────────────────────────────────────────────────────┐
│                             Same Node                                    │
│                                                                          │
│  ┌────────────────────┐                    ┌────────────────────┐       │
│  │   Prefill Pod      │                    │   Decode Pod       │       │
│  │                    │                    │                    │       │
│  │  ┌──────────────┐  │                    │  ┌──────────────┐  │       │
│  │  │ GPU 0 (VRAM) │  │                    │  │ GPU 2 (VRAM) │  │       │
│  │  └──────┬───────┘  │                    │  └──────▲───────┘  │       │
│  └─────────┼──────────┘                    └─────────┼──────────┘       │
│            │                                         │                   │
│            │         RDMA (InfiniBand/RoCE)          │                   │
│            └─────────────────────────────────────────┘                   │
│                                                                          │
│  Options (best to worst):                                                │
│  1. InfiniBand RDMA with GPUDirect    → GPU-to-GPU, bypasses CPU        │
│  2. RoCE RDMA with GPUDirect          → GPU-to-GPU, bypasses CPU        │
│  3. Host-staged RDMA                  → GPU→CPU→RDMA→CPU→GPU            │
│  4. TCP (fallback)                    → GPU→CPU→TCP→CPU→GPU             │
└─────────────────────────────────────────────────────────────────────────┘
```

**Best Practice**: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.

### Cross-Node Communication

When prefill and decode workers are on **different nodes**:

```text
┌──────────────────────────────┐         ┌──────────────────────────────┐
│           Node 1             │         │           Node 2             │
│                              │         │                              │
│  ┌────────────────────┐      │         │      ┌────────────────────┐  │
│  │   Prefill Pod      │      │         │      │   Decode Pod       │  │
│  │  ┌──────────────┐  │      │         │      │  ┌──────────────┐  │  │
│  │  │ GPU (VRAM)   │  │      │         │      │  │ GPU (VRAM)   │  │  │
│  │  └──────┬───────┘  │      │         │      │  └──────▲───────┘  │  │
│  └─────────┼──────────┘      │         │      └─────────┼──────────┘  │
│            │                 │         │                │             │
│  ┌─────────▼─────────┐       │         │       ┌────────┴────────┐   │
│  │    RDMA NIC       │       │         │       │    RDMA NIC     │   │
│  │  (InfiniBand/     │◄──────┼─────────┼──────▶│  (InfiniBand/   │   │
│  │   RoCE)           │       │ Network │       │   RoCE)         │   │
│  └───────────────────┘       │         │       └─────────────────┘   │
└──────────────────────────────┘         └──────────────────────────────┘
```

**Requirements for optimal cross-node performance:**
- InfiniBand or RoCE network fabric
- GPUDirect RDMA enabled (GPU memory registered with NIC)
- Proper UCX configuration

---

## UCX Configuration Reference

### Environment Variables

UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.

#### Core Transport Selection

```yaml
env:
  - name: UCX_TLS
    value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
```

| Transport | Description | When to Use |
|-----------|-------------|-------------|
| `rc_x` | Reliable Connection (accelerated) | Primary RDMA transport |
| `rc` | Reliable Connection (standard) | Fallback RDMA |
| `dc_x` | Dynamically Connected (accelerated) | Scalable RDMA (many endpoints) |
| `dc` | Dynamically Connected (standard) | Fallback scalable RDMA |
| `cuda_copy` | GPU↔Host memory staging | Required for GPU buffers |
| `cuda_ipc` | CUDA IPC (same-node, same-pod) | Intra-pod GPU transfers |
| `tcp` | TCP sockets | Fallback when RDMA unavailable |
| `srd` | Scalable Reliable Datagram (AWS EFA) | AWS-specific (provided by EFA, not core UCX) |

**Excluding transports**: Use `^` prefix to exclude (e.g., `UCX_TLS=^mm` excludes memory mapping).

**Note**: When specifying `UCX_TLS` explicitly with GPU memory, you must include `cuda_copy` or `cuda_ipc` for UCX to recognize GPU buffers.

#### Rendezvous Protocol Settings

```yaml
env:
  - name: UCX_RNDV_SCHEME
    value: "get_zcopy"
  - name: UCX_RNDV_THRESH
    value: "0"
```

| Variable | Value | Description |
|----------|-------|-------------|
| `UCX_RNDV_SCHEME` | `get_zcopy` | Zero-copy RDMA GET (receiver pulls data) |
| `UCX_RNDV_SCHEME` | `put_zcopy` | Zero-copy RDMA PUT (sender pushes data) |
| `UCX_RNDV_SCHEME` | `auto` | Let UCX choose based on message size |
| `UCX_RNDV_THRESH` | `0` | Use rendezvous for all message sizes |
| `UCX_RNDV_THRESH` | `8192` | Use rendezvous for messages ≥8KB |
| `UCX_RNDV_THRESH` | `auto` | Let UCX calculate optimal threshold |

**Recommendation**: Use `get_zcopy` with threshold `0` for KV cache transfers (always large).

> **⚠️ AWS EFA Exception**: Do NOT use `get_zcopy` on AWS with Ubuntu 24.04 + Kernel ≥6.8. See [AWS EFA Configuration](#aws-efa-configuration) for required settings.

#### Memory Registration

```yaml
env:
  - name: UCX_IB_REG_METHODS
    value: "odp,rcache"
```

| Method | Description |
|--------|-------------|
| `odp` | On-Demand Paging (dynamic registration) |
| `rcache` | Registration cache (reuse registrations) |
| `direct` | Direct registration (each transfer) |

#### Debugging and Diagnostics

```yaml
env:
  - name: UCX_LOG_LEVEL
    value: "info"        # Options: fatal, error, warn, info, debug, trace, data, func
  - name: UCX_LOG_FILE
    value: "/tmp/ucx.log" # Optional: log to file instead of stdout
```

**Note**: UCX statistics (`UCX_STATS_DEST`, `UCX_STATS_TRIGGER`) require UCX compiled with `--enable-stats` flag, which is not enabled in default builds.

### Complete Production Configuration

```yaml
env:
  # Transport selection - RDMA with GPU support
  - name: UCX_TLS
    value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"

  # Rendezvous for large transfers
  - name: UCX_RNDV_SCHEME
    value: "get_zcopy"
  - name: UCX_RNDV_THRESH
    value: "0"

  # Memory registration optimization
  - name: UCX_IB_REG_METHODS
    value: "odp,rcache"

  # RDMA settings
  - name: UCX_IB_GID_INDEX
    value: "3"           # RoCE v2 GID index (cluster-specific)
```

### AWS EFA Configuration

> **⚠️ Critical: Zero-Copy RDMA causes crashes on AWS Kernel 6.8+**
>
> On AWS Ubuntu 24.04 with Kernel ≥6.8, using `UCX_RNDV_SCHEME=get_zcopy` triggers a fatal `NIXL_ERR_BACKEND` crash. The EFA provider cannot register CUDA memory due to incomplete DMA-BUF support in `efa_nv_peermem`.
>
> **You MUST use the configuration below** — do not copy the standard InfiniBand settings.

> **Note: NIXL is migrating from UCX to libfabric for AWS**
> The Dynamo team is transitioning NIXL to use **libfabric** instead of UCX for AWS EFA deployments. This change is driven by:
> - **Better topology awareness**: libfabric provides hierarchical topology awareness similar to NCCL
> - **Native EFA support**: libfabric is the recommended communication layer for AWS EFA
>
> **Current status**: UCX over EFA works but is not recommended for production. Published AWS examples are functional but not performant. Check with the Dynamo team for libfabric availability timeline.

**Required AWS EFA Configuration** (Ubuntu 24.04 + Kernel ≥6.8):

```yaml
env:
  - name: UCX_TLS
    value: "srd,cuda_copy,tcp"    # SRD is EFA's RDMA transport
  - name: UCX_RNDV_SCHEME
    value: "auto"                  # DO NOT use get_zcopy - causes crashes
  - name: UCX_RNDV_THRESH
    value: "8192"                  # Avoid CUDA zero-copy for large transfers
```

**Why these settings are mandatory**:
- `UCX_RNDV_SCHEME=auto` prevents UCX from forcing zero-copy RDMA on CUDA buffers
- `UCX_RNDV_THRESH=8192` ensures large KV cache transfers use host-staging instead of GPU-direct (which fails)
- Using `get_zcopy` or threshold `0` will cause `remote invalid RD request` errors and worker crashes

**Known Limitations**:
- GPU Direct RDMA is non-functional on AWS EFA with Ubuntu 24.04 + kernel ≥6.8
- Expect 3x performance degradation compared to InfiniBand (host-staged transfers)
- For optimal disaggregated performance, consider clusters with InfiniBand/RoCE, or wait for libfabric support on AWS

---

## Deployment Configuration

### Kubernetes Resource Requirements

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
  services:
    VLLMPrefillWorker:
      resources:
        limits:
          gpu: "2"
      extraPodSpec:
        mainContainer:
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]      # Required for RDMA memory pinning
          resources:
            limits:
              rdma/ib: "2"           # RDMA resources (match TP size)
            requests:
              rdma/ib: "2"
```

### Required Capabilities and Resources

| Setting | Purpose | Notes |
|---------|---------|-------|
| `IPC_LOCK` capability | Pin memory for RDMA | Bypasses RLIMIT_MEMLOCK; required for `ibv_reg_mr()` to pin GPU/host buffers |
| `rdma/ib` resources | RDMA NIC access | Provided by RDMA device plugin |
| `sharedMemory.size` | IPC between processes | 16Gi for vLLM, 80Gi for TRT-LLM |

### Infrastructure Prerequisites

1. **RDMA Device Plugin**: Exposes `rdma/ib` resources to Kubernetes
   ```bash
   kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}'
   ```

2. **InfiniBand/RoCE Network**: Physical RDMA fabric connecting nodes

3. **GPUDirect RDMA** (optional but recommended):
   - NVIDIA driver with GPUDirect enabled
   - `nvidia-peermem` kernel module loaded
   - NIC firmware supporting GPUDirect

---

## Diagnostics and Performance Validation

### Pre-Deployment Validation

#### 1. Verify RDMA Availability

```bash
# Check RDMA devices on node
kubectl debug node/<node-name> -it --image=ubuntu:22.04 -- bash
ibv_devinfo
```

Expected output shows InfiniBand or RoCE devices:
```text
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.35.2000
        ...
```

#### 2. Check UCX Transport Capabilities

```bash
# Inside a Dynamo worker pod
ucx_info -d
```

Look for GPU memory support:
```text
# Memory domain: mlx5_0
#     Component: ib
#     memory types: host (access,reg,cache), cuda (access,reg,cache)
#                                            ^^^^ GPU memory supported
```

**If you only see `host`**: GPUDirect RDMA is not working. KV transfers will use host staging.

#### 3. Test UCX Performance

```bash
# Server (on decode worker pod)
ucx_perftest -t tag_bw -n 100 -s 134217728

# Client (on prefill worker pod)
ucx_perftest <server-ip> -t tag_bw -n 100 -s 134217728
```

**Expected bandwidth**:
- InfiniBand HDR: 20-25 GB/s per port
- RoCE 100GbE: 10-12 GB/s
- TCP fallback: 1-2 GB/s

### NIXL Benchmark Tool

Deploy the NIXL benchmark to validate end-to-end KV transfer performance:

```bash
cd deploy/pre-deployment/nixl
./build_and_deploy.sh
```

This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.

### Runtime Diagnostics

#### Verify NIXL Backend Initialization

```bash
kubectl logs <worker-pod> | grep -i "NIXL\|UCX"
```

**Good output**:
```text
NIXL INFO Backend UCX was instantiated
```

**Bad output** (RDMA not working):
```text
UCX WARN no RDMA transports available
NIXL INFO falling back to TCP transport
```

#### Monitor Transfer Performance

Check Grafana dashboards for:
- **NIXL transfer bandwidth**: Should show GB/s, not MB/s
- **KV cache transfer latency**: Should be under 500ms for typical workloads

**Red flags indicating RDMA issues**:
- Transfer bandwidth under 1 GB/s
- TTFT > 10 seconds
- `Unsupported operation` errors in logs

### Common Diagnostic Commands

```bash
# Check UCX transport selection
kubectl exec <pod> -- env | grep UCX

# Verify RDMA device visibility
kubectl exec <pod> -- ls /dev/infiniband/

# Check GPUDirect RDMA status (on node)
kubectl debug node/<node> -it --image=ubuntu:22.04 -- \
  nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr"

# Test basic connectivity between pods
kubectl exec <prefill-pod> -- ping -c 3 <decode-pod-ip>
```

---

## Performance Expectations

### KV Cache Transfer Overhead

| Configuration | TTFT Overhead | Source |
|---------------|---------------|--------|
| Aggregated (baseline) | 0 | No KV transfer needed |
| Disagg + InfiniBand RDMA with GPUDirect | +200-500ms | *Expected* based on hardware specs |
| Disagg + RoCE RDMA with GPUDirect | +300-800ms | *Expected* based on hardware specs |
| Disagg + Host-staged (no GPUDirect) | +1-3s | *Expected* - CPU bottleneck |
| Disagg + AWS EFA (without GPUDirect) | ~3x slower than aggregated | *Measured* on AWS p5.48xlarge |
| Disagg + TCP fallback | **+90-100s** | *Measured* ~98s TTFT on AWS p5.48xlarge |

> **Note**: InfiniBand/RoCE numbers with GPUDirect are expected values based on hardware specifications and have not been validated. AWS measurements reflect EFA without functional GPUDirect RDMA (see [AWS EFA Configuration](#aws-efa-configuration) for details).

### When Disaggregated Makes Sense

**Use disaggregated architecture when:**
- Output sequence length (OSL) > 1000 tokens (overhead amortized)
- You need independent scaling of prefill vs decode capacity
- Prefill and decode have different hardware requirements

**Use aggregated architecture when:**
- Low-latency TTFT is critical
- Short outputs (OSL under 500 tokens)
- RDMA is not available

### Break-Even Analysis

The KV transfer overhead is amortized across output tokens. Example data from **Llama-3.1-8B-Instruct** on AWS p5.48xlarge:

```text
Total Latency = TTFT + (OSL × ITL)

Example (Llama-3.1-8B, ISL=4000):
- Aggregated:    218ms + (OSL × 8.0ms)
- Disaggregated: 2400ms + (OSL × 7.8ms)

Break-even: 2400 - 218 = 2182ms overhead
            2182ms / (8.0 - 7.8)ms per token = 10,910 tokens

At OSL=2000: Disagg is 1.1x slower (acceptable)
At OSL=100:  Disagg is 3.1x slower (not recommended)
```

---

## Troubleshooting Guide

### Problem: TTFT is 10+ seconds

**Symptoms**: TTFT degrades from expected 200-500ms to 10+ seconds

**Root Cause**: RDMA not active, falling back to TCP

**Diagnosis**:
```bash
kubectl logs <worker-pod> | grep -i "transport\|UCX\|TCP"
```

**Solutions**:
1. Verify RDMA device plugin is installed
2. Add `rdma/ib` resource requests to pod spec
3. Add `IPC_LOCK` capability
4. Set UCX environment variables

### Problem: "Unsupported operation" errors

**Symptoms**: Logs show `Unexpected UCX error: Unsupported operation`

**Root Cause**: UCX attempting GPU RDMA on hardware that doesn't support it

**Solutions**:
1. Check if GPUDirect RDMA is enabled: `ucx_info -d | grep cuda`
2. If not supported, set `UCX_RNDV_THRESH=inf` to disable GPU RDMA
3. Verify `nvidia-peermem` module is loaded

### Problem: AWS EFA not using GPU Direct

**Symptoms**: 3x performance degradation on AWS despite EFA configured

**Root Cause**: GPU Direct RDMA not functional on kernel ≥6.8 with EFA

**Current Status**: This is a known limitation. Options:
1. Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15)
2. Accept host-staging performance penalty
3. Wait for AWS to update EFA DMA-BUF support

### Problem: Intermittent transfer failures

**Symptoms**: Sporadic `getXferStatus: backend 'UCX' returned error status`

**Diagnosis**:
```bash
# Enable UCX debug logging
kubectl set env deployment/<worker> UCX_LOG_LEVEL=debug
kubectl logs <worker-pod> | grep -i error
```

**Common causes**:
- Network congestion or packet loss
- Mismatched UCX versions between pods
- RDMA resource exhaustion

---

## Quick Reference

### Minimum Viable RDMA Configuration

```yaml
env:
  - name: UCX_TLS
    value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
  - name: UCX_RNDV_SCHEME
    value: "get_zcopy"
  - name: UCX_RNDV_THRESH
    value: "0"

securityContext:
  capabilities:
    add: ["IPC_LOCK"]

resources:
  limits:
    rdma/ib: "2"
  requests:
    rdma/ib: "2"
```

### Diagnostic Checklist

- [ ] `rdma/ib` resources visible: `kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'`
- [ ] UCX sees RDMA devices: `ucx_info -d | grep "Transport: rc"`
- [ ] UCX sees GPU memory: `ucx_info -d | grep "memory types.*cuda"`
- [ ] NIXL initialized with UCX: `kubectl logs <pod> | grep "Backend UCX"`
- [ ] Transfer bandwidth > 1 GB/s (check Grafana metrics)

---

## Related Documentation

- [Disaggregated Serving Architecture](../design-docs/disagg-serving.md)
- [AIConfigurator Deployment Guide](../features/disaggregated-serving/README.md)
- [NIXL Benchmark Deployment](../../deploy/pre-deployment/nixl/README.md)
- [KV Cache Transfer Methods](../backends/trtllm/trtllm-kv-cache-transfer.md)