event_plane.md 14.7 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Event Plane Architecture

This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.

## Overview

Dynamo's coordination layer adapts to the deployment environment:

| Deployment | Service Discovery | KV Events | Request Plane |
|------------|-------------------|-----------|---------------|
| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |

> **Note:** The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.

```
┌─────────────────────────────────────────────────────────────────────┐
│                    Coordination Layer                                │
│                                                                      │
│  ┌─────────────────────────┐    ┌─────────────────────────────────┐ │
│  │   Service Discovery     │    │            NATS                 │ │
│  │                         │    │         (Optional)              │ │
│  │  • K8s: CRDs + API      │    │  • KV Cache Events              │ │
│  │  • Bare metal: etcd     │    │  • Router Replica Sync          │ │
│  │                         │    │  • JetStream Persistence        │ │
│  └─────────────────────────┘    └─────────────────────────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
                    │                          │
         ┌──────────┴──────────┐    ┌─────────┴──────────┐
         ▼                     ▼    ▼                    ▼
    ┌─────────┐          ┌─────────┐              ┌─────────┐
    │Frontend │          │ Planner │              │ Worker  │
    └─────────┘          └─────────┘              └─────────┘
```

## Kubernetes-Native Service Discovery

When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.

### Configuration

The operator explicitly sets:
```bash
DYN_DISCOVERY_BACKEND=kubernetes
```

> **Important:** This must be explicitly configured. The runtime defaults to `kv_store` in all environments.

### How It Works

1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
2. **EndpointSlices**: Used to signal readiness status to the system
3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints

### Benefits

- No external etcd cluster required
- Native integration with Kubernetes lifecycle
- Automatic cleanup when pods terminate
- Works with standard K8s RBAC

### Environment Variables (Injected by Operator)

| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Current pod name |
| `POD_NAMESPACE` | Current namespace |
| `POD_UID` | Pod unique identifier |

---

## etcd Architecture (Default for All Deployments)

When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.

### Connection Configuration

etcd connection is configured via environment variables:

| Variable | Description | Default |
|----------|-------------|---------|
| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
| `ETCD_AUTH_USERNAME` | Basic auth username | None |
| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |

Example:
```bash
export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
```

### Lease Management

Each `DistributedRuntime` maintains a primary lease with etcd:

```
┌────────────────────┐         ┌──────────────┐
│ DistributedRuntime │◄────────│ Primary Lease │
│                    │         │  TTL: 10s     │
│  • Namespace       │         └───────┬───────┘
│  • Components      │                 │
│  • Endpoints       │                 │ Keep-Alive
│                    │                 │ Heartbeat
└────────────────────┘                 ▼
                               ┌──────────────┐
                               │     etcd     │
                               └──────────────┘
```

**Lease Lifecycle:**

1. **Creation**: Lease created during `DistributedRuntime` initialization
2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
4. **Cleanup**: All keys associated with the lease are automatically deleted

**Automatic Recovery:**

- Reconnection with exponential backoff (50ms to 5s)
- Deadline-based retry logic
- Cancellation token propagation

### Service Discovery

Endpoints are registered in etcd for dynamic discovery:

**Key Format:**
```
/services/{namespace}/{component}/{endpoint}/{instance_id}
```

**Example:**
```
/services/vllm-agg/backend/generate/694d98147d54be25
```

**Registration Data:**
```json
{
  "namespace": "vllm-agg",
  "component": "backend",
  "endpoint": "generate",
  "instance_id": 7587888160958628000,
  "transport": {
    "tcp": "192.168.1.10:9999"
  }
}
```

### Discovery Queries

The discovery system supports multiple query patterns:

| Query Type | Pattern | Use Case |
|------------|---------|----------|
| `AllEndpoints` | `/services/` | List all services |
| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |

### Watch Functionality

Clients watch etcd prefixes for real-time updates:

```python
# Client watches for endpoint changes
watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")

for event in watcher:
    if event.type == "PUT":
        # New endpoint registered
        add_endpoint(event.value)
    elif event.type == "DELETE":
        # Endpoint removed (worker died)
        remove_endpoint(event.key)
```

**Watch Features:**

- Initial state retrieval with `get_and_watch_prefix()`
- Automatic reconnection on stream failure
- Revision tracking for no-event-loss guarantees
- Event types: `PUT` (create/update) and `DELETE`

### Distributed Locks

etcd provides distributed locking for coordination:

**Lock Types:**

| Type | Key Pattern | Behavior |
|------|-------------|----------|
| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |

**Operations:**

```rust
// Non-blocking write lock
let lock = client.try_write_lock("my_resource").await?;

// Blocking read lock with polling (100ms intervals)
let lock = client.read_lock_with_wait("my_resource").await?;
```

## NATS Architecture

### When NATS is Used

NATS is used for:

1. **KV Cache Events**: Real-time KV cache state updates for routing
2. **Router Replica Sync**: Synchronizing router state across replicas
3. **Legacy Request Plane**: NATS-based request transport (optional)

### Configuration

| Variable | Description | Default |
|----------|-------------|---------|
| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |

### Disabling NATS

For deployments without KV-aware routing:

```bash
# Disable NATS and KV events
python -m dynamo.frontend --no-kv-events
```

This enables "approximate mode" for KV routing without event persistence.

### Event Publishing

Components publish events to NATS subjects:

```rust
pub trait EventPublisher {
    async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
    async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
}
```

**Subject Naming:**
```
{base_subject}.{event_name}
```

Example:
```
vllm-agg.backend.kv_cache_update
```

### Event Subscription

Components subscribe to events:

```rust
pub trait EventSubscriber {
    async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
    async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
}
```

### JetStream Persistence

For durable event delivery, NATS JetStream provides:

- Message persistence
- Replay from offset
- Consumer groups for load balancing
- Acknowledgment tracking

## Key-Value Store Abstraction

Dynamo provides a unified KV store interface supporting multiple backends:

### Supported Backends

| Backend | Use Case | Configuration |
|---------|----------|---------------|
| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
| `MemoryStore` | Testing, development | Default |
| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
| `FileStore` | Local persistence | File path |

### Store Interface

```rust
pub trait KvStore {
    async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
    async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
    async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
    async fn watch(&self, bucket: &str) -> Result<WatchStream>;
}
```

### Buckets

Data is organized into logical buckets:

| Bucket | Purpose |
|--------|---------|
| `v1/instances` | Endpoint instance registry |
| `v1/mdc` | Model deployment cards |

## Typed Prefix Watcher

For type-safe watching of etcd prefixes:

```rust
// Watch and maintain HashMap of deserialized values
let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
    &etcd_client,
    "/services/vllm-agg/",
    lease_id_extractor,
    value_extractor,
).await?;

// Receive updates via watch channel
let instances = watcher.borrow();
```

**Key Extractors:**

| Extractor | Description |
|-----------|-------------|
| `lease_id()` | Use lease ID as key |
| `key_string()` | Extract key with prefix stripping |
| `full_key_string()` | Use full etcd key |

## Reliability Features

### Connection Resilience

**etcd Reconnection:**
- Exponential backoff: 50ms to 5s
- Deadline-based retry logic
- Mutex ensures single concurrent reconnect

**NATS Reconnection:**
- Built-in reconnection in NATS client
- Configurable max reconnect attempts
- Buffering during disconnection

### Lease-Based Cleanup

When a worker crashes or loses connectivity:

1. Keep-alive heartbeats stop
2. Lease expires after TTL (10 seconds)
3. All registered endpoints automatically deleted
4. Clients receive DELETE watch events
5. Traffic reroutes to healthy workers

### Transaction Safety

etcd transactions ensure atomic operations:

```rust
// Atomic create-if-not-exists
let txn = Txn::new()
    .when([Compare::create_revision(key, CompareOp::Equal, 0)])
    .and_then([Op::put(key, value, options)]);

etcd_client.txn(txn).await?;
```

This prevents race conditions in concurrent service registration.

## Operational Modes

### Kubernetes Mode (Requires Explicit Configuration)

Native Kubernetes service discovery:

```bash
# Operator explicitly sets this (not auto-detected):
export DYN_DISCOVERY_BACKEND=kubernetes

# Workers register via K8s CRDs
python -m dynamo.vllm --model Qwen/Qwen3-0.6B

# Frontend discovers workers via K8s API
python -m dynamo.frontend
```

No etcd or NATS required for basic operation when using K8s discovery.

### KV Store Mode (Global Default)

Full service discovery with etcd:

```bash
# This is the default - no configuration needed
# export DYN_DISCOVERY_BACKEND=kv_store  # (implicit)

# Workers register with etcd
python -m dynamo.vllm --model Qwen/Qwen3-0.6B

# Frontend discovers workers via etcd
python -m dynamo.frontend
```

### KV-Aware Routing (Optional)

Enable NATS for KV cache event tracking:

```bash
# Default: KV events enabled (requires NATS)
python -m dynamo.frontend --router-mode kv

# Disable KV events for prediction-based routing (no NATS)
python -m dynamo.frontend --router-mode kv --no-kv-events
```

With `--no-kv-events`:
- Router predicts cache state based on routing decisions
- TTL-based expiration and LRU pruning
- No NATS infrastructure required

## Best Practices

### 1. Use Kubernetes Discovery on K8s

The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.

### 2. For Bare Metal: Deploy etcd Cluster

For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.

### 3. Configure Appropriate TTLs (etcd mode)

Balance between detection speed and overhead:

- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
- **Long TTL (30s)**: Less overhead, slower detection

### 4. KV Routing Without NATS

For simpler deployments without NATS:

```bash
# Use prediction-based KV routing
python -m dynamo.frontend --router-mode kv --no-kv-events
```

This provides KV-aware routing with reduced accuracy but no NATS dependency.

## Related Documentation

- [Distributed Runtime](distributed_runtime.md) - Runtime architecture
- [Request Plane](request_plane.md) - Request transport configuration
- [Fault Tolerance](../fault_tolerance/README.md) - Failure handling