- Physical memory is released (`cuMemUnmap` + `cuMemRelease`)
- Physical memory is released (`cuMemUnmap` + `cuMemRelease`)
- VA reservations are **kept** (`cuMemAddressReserve` still valid)
- VA reservations are **kept** (`cuMemAddressReserve` still valid)
During `remap()`:
During `remap_all_vas()`:
- Same VAs are reused for mapping
- Same VAs are reused for mapping
-**Tensor pointers remain valid** (no need to update PyTorch tensors)
-**Tensor pointers remain valid** (no need to update PyTorch tensors)
...
@@ -354,7 +358,7 @@ On commit, the server computes a hash of:
...
@@ -354,7 +358,7 @@ On commit, the server computes a hash of:
- All allocation IDs, sizes, and tags
- All allocation IDs, sizes, and tags
- All metadata entries
- All metadata entries
On `remap()`, this hash is checked:
On `remap_all_vas()`, this hash is checked:
- If match: Safe to remap (layout unchanged)
- If match: Safe to remap (layout unchanged)
- If mismatch: Raise `StaleMemoryLayoutError` (must re-import)
- If mismatch: Raise `StaleMemoryLayoutError` (must re-import)
...
@@ -393,46 +397,121 @@ fd = fds[0] if fds else -1
...
@@ -393,46 +397,121 @@ fd = fds[0] if fds else -1
### GMSClientMemoryManager
### GMSClientMemoryManager
The API is organized in two tiers. **Tier 2 (convenience)** is what integrations normally use. **Tier 1 (atomic)** exposes individual operations for advanced callers.
```python
```python
classGMSClientMemoryManager:
classGMSClientMemoryManager:
def__init__(
def__init__(socket_path:str,*,device:int=0):...
socket_path:str,
mode:RequestedLockType,# RW, RO, or RW_OR_RO
device:int=0,
timeout_ms:Optional[int]=None,
):...
# Properties
# Properties
@propertymode:GrantedLockType# Actual granted mode
defunmap_all_vas()->None# Sync + unmap all, preserve VA reservations
defremap(timeout_ms:Optional[int]=None)->bool
defremap_all_vas()->None# Re-import at preserved VAs (checks layout hash)
defclose()->None
defreallocate_all_handles(tag="default")->None# Fresh server handles for preserved VAs
defclose(free:bool=False)->None
```
```
## Limitations
## Limitations
1.**Single-GPU per server**: Each GMS server manages one GPU device
1.**Single-GPU per server**: Each GMS server manages one GPU device
2.**CUDA VMM required**: Requires a GPU with Virtual Memory Management support. Check at runtime via `CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED` - there is no guaranteed minimum compute capability
2.**CUDA VMM required**: Requires a GPU with Virtual Memory Management support. Check at runtime via `CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED` - there is no guaranteed minimum compute capability
GMS provides pre-built integrations for vLLM and SGLang. Enable GMS by passing `--load-format gms` when launching an engine.
### How It Works
When `--load-format gms` is set:
1.**A GMS server must already be running** for the target GPU device. The engine connects to it via a Unix socket derived from the GPU UUID.
2. The engine uses `RW_OR_RO` mode by default: the **first** process gets RW (loads weights from disk, commits to GMS), and **subsequent** processes get RO (import weights from GMS metadata).
3. Weights are managed by GMS; KV cache is managed by the framework's own allocator (e.g., vLLM's `CuMemAllocator`).
#### vLLM
```bash
python -m dynamo.vllm \
--model <model> \
--load-format gms \
--enable-sleep-mode\
--gpu-memory-utilization 0.9
```
The integration uses a custom worker class (`GMSWorker`) that:
- Establishes the GMS connection early in `init_device()` so vLLM's `MemorySnapshot` can account for committed weights
- Registers a custom model loader (`GMSModelLoader`) for the `gms` load format
- Patches `torch.cuda.empty_cache` to avoid releasing GMS-managed memory
- Routes weight allocation through a `CUDAPluggableAllocator` backed by GMS
#### SGLang
```bash
python -m dynamo.sglang \
--model-path <model> \
--load-format gms \
--enable-memory-saver\
--mem-fraction-static 0.9
```
The integration patches `torch_memory_saver` to route weight operations through GMS:
- Weights (`"weights"` / `"model_weights"` tags) go through `GMSMemorySaverImpl`
- Other tags (e.g., `"kv_cache"`) are delegated to the default torch mempool implementation
- The `--enable-memory-saver` flag is required to activate the memory saver pathway
### Shadow Engine Failover (Sleep / Wake)
Both integrations support releasing and reclaiming GPU memory for shadow engine patterns. The API names differ by framework:
-**vLLM**: `sleep` / `wake_up` (via `/engine/sleep` and `/engine/wake_up` HTTP endpoints)
-**SGLang**: `release_memory_occupation` / `resume_memory_occupation` (via the corresponding HTTP endpoints)
Under the hood, sleeping calls `unmap_all_vas()` + `disconnect()` to release GPU memory while preserving VA reservations, and waking calls `connect(RO)` + `remap_all_vas()` to re-import weights at the same virtual addresses. Tensor pointers remain valid, so no model re-initialization is needed.
This enables a shadow engine to release its GPU memory, let a primary engine use the GPU, and then reclaim the memory after the primary is killed.
### Configuration via `model_loader_extra_config`
To force read-only mode (import only, never load from disk), pass `gms_read_only` via the framework's `--model-loader-extra-config` flag:
This forces `RO` lock mode instead of the default `RW_OR_RO` auto-detection. The engine will only import existing committed weights and fail if none are available.