# Checkpoint Engine Integration The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes. ## Overview The checkpoint engine integration allows SGLang to: - Load model weights in parallel using multiple processes - Distribute weight loading across multiple nodes to increase effective disk bandwidth - Overlap weight loading with other initialization tasks like CUDA graph capture - Support both single-node and multi-node deployments ## Installation First, install the checkpoint engine package: ```bash pip install 'checkpoint-engine[p2p]' ``` ## Architecture The system consists of two main components: 1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready 2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights The checkpoint engine uses a parameter server architecture with support for: - **Broadcast mode**: Weights are broadcast from loading processes to inference processes - **P2P mode**: Direct peer-to-peer weight transfer between processes - **All mode**: Combination of both broadcast and P2P methods ## Usage Examples ### Single Node Setup **Terminal 1 - Launch SGLang Server:** ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --tp 8 \ --load-format dummy \ --wait-for-initial-weights ``` **Terminal 2 - Run Checkpoint Engine:** Using sglang entrypoint: ```bash python -m sglang.srt.checkpoint_engine.update \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 8 ``` Using torchrun directly: ```bash torchrun --nproc-per-node 8 \ examples/checkpoint_engine/update.py \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 8 ``` ### Multi-Node Setup (2 Nodes) **Node 0:** Launch SGLang server: ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --tp 8 \ --load-format dummy \ --wait-for-initial-weights \ --host [IP] ``` Run checkpoint engine: Using sglang entrypoint (recommended): ```bash python -m sglang.srt.checkpoint_engine.update \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 8 ``` Using torchrun directly: ```bash torchrun --nproc-per-node 8 \ --nnodes 2 \ --node-rank 0 \ --master-addr [IP] \ --master-port 29500 \ examples/checkpoint_engine/update.py \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 8 ``` **Node 1:** Launch SGLang server: ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --tp 8 \ --load-format dummy \ --wait-for-initial-weights \ --host [IP] ``` Run checkpoint engine: Using sglang entrypoint (recommended): ```bash python -m sglang.srt.checkpoint_engine.update \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 8 ``` Using torchrun directly: ```bash torchrun --nproc-per-node 8 \ --nnodes 2 \ --node-rank 1 \ --master-addr [IP] \ --master-port 29500 \ examples/checkpoint_engine/update.py \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 8 ``` ### Multi-Node Setup with Tensor Parallelism (TP=16) **Node 0:** Launch SGLang server: ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --tp 8 \ --load-format dummy \ --wait-for-initial-weights \ --host [IP] \ --dist-init-addr [IP]:9120 \ --nnodes 2 \ --node-rank 0 ``` Run checkpoint engine: Using sglang entrypoint (recommended): ```bash python -m sglang.srt.checkpoint_engine.update \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 16 ``` Using torchrun directly: ```bash torchrun --nproc-per-node 8 \ --nnodes 2 \ --node-rank 0 \ --master-addr [IP] \ --master-port 29500 \ examples/checkpoint_engine/update.py \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 16 ``` **Node 1:** Launch SGLang server: ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --tp 8 \ --load-format dummy \ --wait-for-initial-weights \ --host [IP] \ --dist-init-addr [IP]:9120 \ --nnodes 2 \ --node-rank 1 ``` Run checkpoint engine: Using sglang entrypoint (recommended): ```bash python -m sglang.srt.checkpoint_engine.update \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 16 ``` Using torchrun directly: ```bash torchrun --nproc-per-node 8 \ --nnodes 2 \ --node-rank 1 \ --master-addr [IP] \ --master-port 29500 \ examples/checkpoint_engine/update.py \ --update-method broadcast \ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ --inference-parallel-size 16 ``` ## Configuration Options ### SGLang Server Options - `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks) - `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready - `--host`: Host address for multi-node setups - `--dist-init-addr`: Distributed initialization address for tensor parallelism ### Checkpoint Engine Options - `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`) - `--checkpoint-path`: Path to model checkpoint directory - `--inference-parallel-size`: Number of inference parallel processes - `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`) - `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`) - `--save-metas-file`: File to save checkpoint metadata - `--load-metas-file`: File to load checkpoint metadata from - `--uds`: Unix domain socket path for communication - `--weight-version`: Version identifier for weights ## Performance Benefits The checkpoint engine provides significant time savings in two main aspects: 1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes. 2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings. ## Troubleshooting - Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'` - Verify network connectivity between nodes in multi-node setups - Check that the checkpoint path contains valid model files - Monitor logs for connection errors between SGLang server and checkpoint engine - Use `--sleep-time` parameter to add delays if needed for debugging ## References - [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine)