The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.
## Overview
The checkpoint engine integration allows SGLang to:
- Load model weights in parallel using multiple processes
- Distribute weight loading across multiple nodes to increase effective disk bandwidth
- Overlap weight loading with other initialization tasks like CUDA graph capture
- Support both single-node and multi-node deployments
## Installation
First, install the checkpoint engine package:
```bash
pip install'checkpoint-engine[p2p]'
```
## Architecture
The system consists of two main components:
1.**SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready
2.**Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights
The checkpoint engine uses a parameter server architecture with support for:
-**Broadcast mode**: Weights are broadcast from loading processes to inference processes
-**P2P mode**: Direct peer-to-peer weight transfer between processes
-**All mode**: Combination of both broadcast and P2P methods
## Usage Examples
### Single Node Setup
**Terminal 1 - Launch SGLang Server:**
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights
```
**Terminal 2 - Run Checkpoint Engine:**
Using sglang entrypoint:
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
### Multi-Node Setup (2 Nodes)
**Node 0:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights\
--host[IP]
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr[IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
**Node 1:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights\
--host[IP]
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr[IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
### Multi-Node Setup with Tensor Parallelism (TP=16)
**Node 0:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights\
--host[IP] \
--dist-init-addr[IP]:9120 \
--nnodes 2 \
--node-rank 0
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr[IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
**Node 1:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights\
--host[IP] \
--dist-init-addr[IP]:9120 \
--nnodes 2 \
--node-rank 1
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr[IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
## Configuration Options
### SGLang Server Options
-`--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks)
-`--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready
-`--host`: Host address for multi-node setups
-`--dist-init-addr`: Distributed initialization address for tensor parallelism
### Checkpoint Engine Options
-`--update-method`: Weight update method (`broadcast`, `p2p`, or `all`)
-`--checkpoint-path`: Path to model checkpoint directory
-`--inference-parallel-size`: Number of inference parallel processes
-`--endpoint`: SGLang server endpoint (default: `http://localhost:19730`)
-`--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`)
-`--save-metas-file`: File to save checkpoint metadata
-`--load-metas-file`: File to load checkpoint metadata from
-`--uds`: Unix domain socket path for communication
-`--weight-version`: Version identifier for weights
## Performance Benefits
The checkpoint engine provides significant time savings in two main aspects:
1.**Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
2.**Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
## Troubleshooting
- Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'`
- Verify network connectivity between nodes in multi-node setups
- Check that the checkpoint path contains valid model files
- Monitor logs for connection errors between SGLang server and checkpoint engine
- Use `--sleep-time` parameter to add delays if needed for debugging