Unverified Commit 4f9a190c authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

docs: add overview doc for GPU Memory Service (GMS) (#5920)


Co-authored-by: default avatarSchwinn Saereesitthipitak <schwinns@nvidia.com>
parent 2176c431
# GPU Memory Service (GMS)
## Overview
The **GPU Memory Service (GMS)** is an out-of-process GPU memory manager that decouples ownership of GPU memory from the processes that use it. This enables:
- **Zero-copy sharing** of GPU memory across multiple processes
- **Data survival** across process crashes
- **Fast model loading** via memory import instead of disk I/O for subsequent workers
GMS provides PyTorch integration via `CUDAPluggableAllocator` and pre-built integrations for inference frameworks like **vLLM** and **SGLang**.
## Problem Statement
In traditional LLM inference deployments, each worker process:
1. Loads model weights from disk/network into GPU memory
2. Owns that GPU memory for the lifetime of the process
3. Cannot share weights with other workers on the same GPU
This leads to:
- **Slow worker startup** (weight loading is I/O bound)
- **Memory waste** (duplicate weights when running multiple workers)
- **No crash resilience** (GPU memory lost when process dies)
## Solution Architecture
```
┌──────────────────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌────────────────────┐ ┌─────────────────────────────────────────┐ │
│ │ GMS Server │ │ GMSClientMemoryManager (Writer) │ │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ ┌─────────────────────────────────┐ │ │
│ │ │ Memory Manager │ │ ◄── Unix ───────►│ │ GMSRPCClient │ │ │
│ │ └────────────────┘ │ Socket │ └─────────────────────────────────┘ │ │
│ │ │ + │ │ │
│ │ ┌────────────────┐ │ FD │ Writer-only: allocate_and_map, commit │ │
│ │ │ State Machine │ │ (SCM_RIGHTS) └─────────────────────────────────────────┘ │
│ │ └────────────────┘ │ │
│ │ │ ┌─────────────────────────────────────────┐ │
│ │ ┌────────────────┐ │ │ GMSClientMemoryManager (Reader) │ │
│ │ │ Metadata Store │ │ │ │ │
│ │ └────────────────┘ │ ◄── Unix ───────►│ ┌─────────────────────────────────┐ │ │
│ │ │ Socket │ │ GMSRPCClient │ │ │
│ └────────────────────┘ + │ └─────────────────────────────────┘ │ │
│ FD │ │ │
│ (SCM_RIGHTS) │ Reader-only: import_allocation, │ │
│ │ unmap, remap │ │
│ └─────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────────┘
```
## Core Components
GMS follows a client-server architecture where the **server** owns GPU memory allocations and the **clients** map that memory into their own address spaces. The key insight is that the socket connection itself acts as a distributed lock.
### Server
The GMS server runs as an independent process that manages GPU memory without ever mapping it to its own address space. This design allows the server to:
- **Survive GPU driver failures** - no CUDA context means no vulnerability to driver resets
- **Outlive client processes** - memory persists across client crashes
- **Arbitrate access** - enforce single-writer, multiple-reader semantics
The server consists of three main components:
1. **Memory Manager** - Allocates physical GPU memory via CUDA VMM (`cuMemCreate`) and exports shareable file descriptors (`cuMemExportToShareableHandle`). Critically, it never calls `cuMemMap` - clients handle all virtual address mapping.
2. **State Machine (FSM)** - Manages the global lock state and enforces access rules that ensures consistency across multiple clients. See [State Machine](#state-machine) below for details.
3. **Metadata Store** - Key-value store for tensor metadata (shapes, dtypes, offsets), enabling clients to reconstruct model structure.
### Client
Clients connect to the server to acquire locks and access GPU memory. Two client classes are provided:
1. **GMSRPCClient** - Low-level RPC client for direct protocol access. Handles socket communication, msgpack serialization, and file descriptor passing via `SCM_RIGHTS`. The socket connection **is** the lock - connection lifetime equals lock lifetime, providing automatic crash resilience.
2. **GMSClientMemoryManager** - High-level client that wraps `GMSRPCClient` and handles all CUDA VMM operations for memory import and mapping safely:
- Imports file descriptors and converts them to CUDA memory handles
- Reserves virtual address space and maps physical memory
- Sets appropriate access permissions (RW for writers, RO for readers)
- Supports **unmap/remap** for VA-stable memory release under memory pressure
> **Note**: Always use `GMSClientMemoryManager` to interact with GMS from client code. The low-level `GMSRPCClient` is an implementation detail and should not be used directly.
### Memory Allocation and Import Flow
The following diagram shows how `GMSClientMemoryManager` interacts with the server and GPU. **Writers** allocate new memory while **readers** import existing allocations - both flows share the same export/import/map sequence.
```mermaid
sequenceDiagram
participant C as GMSClientMemoryManager
participant S as GMS Server
participant GPU as GPU Memory
%% Connection
C->>S: Connect (Unix Socket)
C->>S: HandshakeRequest(lock_type)
S-->>C: HandshakeResponse(granted_lock)
%% Allocation (Writer only)
rect rgb(255, 245, 230)
Note over C,GPU: Writer only: Allocate new memory
C->>S: AllocateRequest(size, tag)
S->>GPU: cuMemCreate(size)
GPU-->>S: handle
S-->>C: AllocateResponse(allocation_id)
end
%% Export/Import (Both Writer and Reader)
Note over C,GPU: Both Writer and Reader: Export and map
C->>S: ExportRequest(allocation_id)
S->>GPU: cuMemExportToShareableHandle(handle)
GPU-->>S: fd
S-->>C: Response + fd (via SCM_RIGHTS)
C->>GPU: cuMemImportFromShareableHandle(fd)
C->>GPU: cuMemAddressReserve(size)
C->>GPU: cuMemMap(va, handle)
C->>GPU: cuMemSetAccess(va, RW or RO)
Note over C,GPU: Memory now accessible at VA
```
---
## State Machine
The server maintains a finite state machine (FSM) that governs lock acquisition and memory access. The state is **derived** from the current connections rather than stored explicitly.
### States and Transitions
```mermaid
stateDiagram-v2
[*] --> EMPTY
EMPTY --> RW : RW_CONNECT
RW --> COMMITTED : RW_COMMIT
RW --> EMPTY : RW_ABORT
COMMITTED --> RW : RW_CONNECT
COMMITTED --> RO : RO_CONNECT
RO --> RO : RO_CONNECT
RO --> RO : RO_DISCONNECT (not last)
RO --> COMMITTED : RO_DISCONNECT (last)
```
### State Descriptions
| State | Description | Can Connect RW | Can Connect RO |
|-------|-------------|:--------------:|:--------------:|
| `EMPTY` | No connections, no committed weights | ✓ | ✗ |
| `RW` | Writer connected (exclusive access) | ✗ | ✗ |
| `COMMITTED` | Weights published, no active connections | ✓ | ✓ |
| `RO` | One or more readers connected (shared access) | ✗ | ✓ |
### Events
| Event | Trigger | Description |
|-------|---------|-------------|
| `RW_CONNECT` | Writer connects | Acquires exclusive write lock |
| `RW_COMMIT` | Writer calls `commit()` | Publishes weights, releases lock |
| `RW_ABORT` | Writer disconnects without commit | Discards allocations, releases lock |
| `RO_CONNECT` | Reader connects | Acquires shared read lock |
| `RO_DISCONNECT` | Reader disconnects | Releases shared lock; if last reader, returns to COMMITTED |
### Lock Semantics
The socket connection **is** the lock:
- **Crash resilience**: Connection close (including process crash) automatically releases the lock
- **No explicit unlock**: Eliminates forgotten locks and deadlocks
- **Atomic transitions**: State changes happen atomically with socket operations
---
## Sequence Diagrams
### Writer Flow (Cold Start)
The first worker loads weights from disk and publishes them to GMS.
```mermaid
sequenceDiagram
participant W as Writer Process
participant C as GMSClientMemoryManager
participant S as GMS Server
W->>C: new GMSClientMemoryManager(mode=RW)
C->>S: HandshakeRequest(lock_type=RW)
S-->>C: HandshakeResponse(success=true)
loop For each tensor
W->>C: allocate_and_map(size, tag)
Note over C,S: See Memory Allocation Flow above
W->>C: metadata_put(key, allocation_id, offset, shape)
end
W->>C: commit()
C->>S: CommitRequest()
S->>S: FSM: RW → COMMITTED
S-->>C: CommitResponse(success=true)
```
### Reader Flow (Warm Start)
Subsequent workers import weights from GMS instead of loading from disk.
```mermaid
sequenceDiagram
participant R as Reader Process
participant C as GMSClientMemoryManager
participant S as GMS Server
R->>C: new GMSClientMemoryManager(mode=RO)
C->>S: HandshakeRequest(lock_type=RO)
S-->>C: HandshakeResponse(success=true, committed=true)
R->>C: metadata_list()
S-->>C: keys=[...]
loop For each tensor key
R->>C: metadata_get(key)
S-->>C: allocation_id, offset, shape
R->>C: import_allocation(allocation_id)
Note over C,S: See Memory Import Flow above
end
Note over R,C: Keep connection open during inference
```
### Unmap/Remap Flow (Memory Pressure)
Readers can temporarily release GPU memory while preserving virtual address reservations. This enables "shadow engine" patterns where inactive workers release memory for active ones.
```mermaid
sequenceDiagram
participant R as Reader Process
participant C as GMSClientMemoryManager
participant S as GMS Server
participant GPU as GPU Memory
Note over R,GPU: Need to temporarily release GPU memory
R->>C: unmap()
C->>GPU: cudaDeviceSynchronize()
loop For each mapping
C->>GPU: cuMemUnmap(va)
C->>GPU: cuMemRelease(handle)
Note over C: Keep VA reservation!
end
C->>C: Save memory_layout_hash
C->>S: Close socket (release RO lock)
S->>S: FSM: RO → COMMITTED (if last reader)
Note over R,GPU: GPU memory released, VA preserved
Note over R,GPU: Another writer could modify weights here
R->>C: remap()
C->>S: HandshakeRequest(lock_type=RO)
S->>S: FSM: COMMITTED → RO
S-->>C: HandshakeResponse(success=true)
C->>S: GetStateHashRequest()
S-->>C: GetStateHashResponse(hash)
alt hash == saved_hash
loop For each preserved VA
C->>S: ExportRequest(allocation_id)
S-->>C: Response + fd
C->>GPU: cuMemImportFromShareableHandle(fd)
C->>GPU: cuMemMap(same_va, handle)
Note over C: Tensors valid at same addresses!
end
else hash != saved_hash
C-->>R: StaleMemoryLayoutError
Note over R: Must re-import from scratch
end
```
### Auto-Mode (RW_OR_RO)
The `RW_OR_RO` mode automatically selects writer or reader based on server state, simplifying multi-worker deployments.
```mermaid
sequenceDiagram
participant P as Process
participant C as GMSClientMemoryManager
participant S as GMS Server
Note over P,S: Auto-mode: Writer if first, Reader if weights exist
P->>C: new GMSClientMemoryManager(mode=RW_OR_RO)
C->>S: HandshakeRequest(lock_type=RW_OR_RO)
alt No committed weights AND no RW holder
S->>S: Grant RW lock
S->>S: FSM: EMPTY → RW
S-->>C: HandshakeResponse(granted=RW, committed=false)
Note over P: First process - load from disk
else Weights already committed
S->>S: Grant RO lock
S->>S: FSM: COMMITTED → RO
S-->>C: HandshakeResponse(granted=RO, committed=true)
Note over P: Subsequent process - import from GMS
else RW held by another
S->>S: Wait for RO availability
S->>S: FSM: COMMITTED → RO
S-->>C: HandshakeResponse(granted=RO, committed=true)
Note over P: Wait for writer to finish
end
```
---
## Key Design Decisions
### 1. No VA Mapping on Server
The server never maps memory to virtual addresses (`cuMemMap`). This means:
- **No CUDA context** required on the server
- Server can survive GPU driver resets
- Memory management is fully delegated to clients
### 2. Socket-as-Lock
The socket connection **is** the lock:
- RW lock: Exclusive connection (only one RW at a time)
- RO lock: Shared connection (multiple RO allowed)
- Lock release = socket close (automatic on crash)
Benefits:
- **Crash resilience**: If a reader crashes, its lock is automatically released
- **No explicit unlock**: No forgotten locks or deadlocks
### 3. VA-Stable Unmap/Remap
During `unmap()`:
- Physical memory is released (`cuMemUnmap` + `cuMemRelease`)
- VA reservations are **kept** (`cuMemAddressReserve` still valid)
During `remap()`:
- Same VAs are reused for mapping
- **Tensor pointers remain valid** (no need to update PyTorch tensors)
### 4. Memory Layout Hash
On commit, the server computes a hash of:
- All allocation IDs, sizes, and tags
- All metadata entries
On `remap()`, this hash is checked:
- If match: Safe to remap (layout unchanged)
- If mismatch: Raise `StaleMemoryLayoutError` (must re-import)
**Important**: This detects **structural** changes, not **content** changes.
Weight values can be modified in-place (e.g., RL training updates) as long as the structure is preserved.
---
## Wire Protocol
### Message Format
```
┌──────────────┬────────────────────────────────────────┐
│ Length (4B) │ msgpack-encoded Message │
│ big-endian │ │
└──────────────┴────────────────────────────────────────┘
```
### FD Passing
File descriptors are passed out-of-band using Unix socket `SCM_RIGHTS`:
```python
# Server side (send FD)
socket.send_fds(sock, [message_bytes], [fd])
# Client side (receive FD)
data, fds, _, _ = socket.recv_fds(sock, bufsize, maxfds=1)
fd = fds[0] if fds else -1
```
---
## API Reference
### GMSClientMemoryManager
```python
class GMSClientMemoryManager:
def __init__(
socket_path: str,
mode: RequestedLockType, # RW, RO, or RW_OR_RO
device: int = 0,
timeout_ms: Optional[int] = None,
): ...
# Properties
@property mode: GrantedLockType # Actual granted mode
@property is_connected: bool
@property is_unmapped: bool
@property total_bytes: int
# Allocation (RW only)
def allocate_and_map(size: int, tag: str = "default") -> int # Returns VA
def free_mapping(va: int) -> None
def clear_all() -> int # Returns count cleared
# Import (RO or RW)
def import_allocation(allocation_id: str) -> int # Returns VA
# Metadata (RW: put/delete, RO: get/list)
def metadata_put(key: str, allocation_id: str, offset: int, value: bytes) -> bool
def metadata_get(key: str) -> Optional[Tuple[str, int, bytes]]
def metadata_list(prefix: str = "") -> List[str]
def metadata_delete(key: str) -> bool
# Lifecycle
def commit() -> bool # Publish weights, release RW lock
def switch_to_read(timeout_ms: Optional[int] = None) -> None
def unmap() -> None # Release RO lock, preserve VAs
def remap(timeout_ms: Optional[int] = None) -> bool
def close() -> None
```
## Limitations
1. **Single-GPU per server**: Each GMS server manages one GPU device
2. **CUDA VMM required**: Requires a GPU with Virtual Memory Management support. Check at runtime via `CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED` - there is no guaranteed minimum compute capability
3. **No content validation**: Remap doesn't detect in-place weight modifications
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment