Unverified Commit 7f028b07 authored by Philip Kiely - Baseten's avatar Philip Kiely - Baseten Committed by GitHub
Browse files

Fix formatting in long code blocks (#10528)

parent 0abb41c7
......@@ -29,51 +29,93 @@ The "❌" and "✅" symbols in the table above under "Page Size > 1" indicate wh
- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend flashinfer
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-V3 --attention-backend flashinfer --trust-remote-code
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend flashinfer
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-V3 \
--attention-backend flashinfer \
--trust-remote-code
```
- FlashAttention 3 (Default for Hopper Machines, e.g., H100, H200, H20)
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend fa3
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --attention-backend fa3
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend fa3
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-V3 \
--trust-remote-code \
--attention-backend fa3
```
- Triton
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend triton
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-V3 --attention-backend triton --trust-remote-code
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend triton
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-V3 \
--attention-backend triton \
--trust-remote-code
```
- Torch Native
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend torch_native
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend torch_native
```
- FlashMLA
```bash
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend flashmla --trust-remote-code
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend flashmla --kv-cache-dtype fp8_e4m3 --trust-remote-code
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--attention-backend flashmla \
--trust-remote-code
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--attention-backend flashmla \
--kv-cache-dtype fp8_e4m3 \
--trust-remote-code
```
- TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200)
```bash
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend trtllm_mla --trust-remote-code
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--attention-backend trtllm_mla \
--trust-remote-code
```
- TRTLLM MLA with FP8 KV Cache (Higher concurrency, lower memory footprint)
```bash
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend trtllm_mla --kv-cache-dtype fp8_e4m3 --trust-remote-code
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--attention-backend trtllm_mla \
--kv-cache-dtype fp8_e4m3 \
--trust-remote-code
```
- Ascend
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend ascend
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend ascend
```
- Wave
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend wave
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend wave
```
## Steps to add a new attention backend
......
......@@ -34,22 +34,88 @@ uv pip install mooncake-transfer-engine
### Llama Single Node
```bash
$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --disaggregation-ib-device mlx5_roce0
$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 --disaggregation-ib-device mlx5_roce0
$ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--disaggregation-ib-device mlx5_roce0
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--port 30001 \
--base-gpu-id 1 \
--disaggregation-ib-device mlx5_roce0
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
```
### DeepSeek Multi-Node
```bash
# prefill 0
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# prefill 1
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# decode 0
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
# decode 1
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
```
### Advanced Configuration
......@@ -98,22 +164,88 @@ pip install . --config-settings=setup-args="-Ducx_path=/path/to/ucx"
### Llama Single Node
```bash
$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --disaggregation-transfer-backend nixl
$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 --disaggregation-transfer-backend nixl
$ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--port 30001 \
--base-gpu-id 1 \
--disaggregation-transfer-backend nixl
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
```
### DeepSeek Multi-Node
```bash
# prefill 0
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# prefill 1
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# decode 0
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
# decode 1
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
```
## ASCEND
......@@ -135,16 +267,44 @@ export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
### Llama Single Node
```bash
$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --disaggregation-transfer-backend ascend
$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 --disaggregation-transfer-backend ascend
$ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--disaggregation-transfer-backend ascend
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--port 30001 \
--base-gpu-id 1 \
--disaggregation-transfer-backend ascend
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
```
### DeepSeek Multi-Node
```bash
# prefill 0
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend ascend --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 1 --node-rank 0 --tp-size 16
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend ascend \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 1 \
--node-rank 0 \
--tp-size 16
# decode 0
$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend ascend --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 1 --node-rank 0 --tp-size 16
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend ascend \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 1 \
--node-rank 0 \
--tp-size 16
```
......@@ -43,10 +43,20 @@ You can find all arguments by `python3 -m sglang.launch_server --help`
```bash
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3-8B-Instruct \
--tp 4 \
--dist-init-addr sgl-dev-0:50000 \
--nnodes 2 \
--node-rank 0
# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3-8B-Instruct \
--tp 4 \
--dist-init-addr sgl-dev-0:50000 \
--nnodes 2 \
--node-rank 1
```
Please consult the documentation below and [server_args.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py) to learn more about the arguments you may provide when launching a server.
......
......@@ -158,7 +158,14 @@ The precompilation process typically takes around 10 minutes to complete.
**Usage**:
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
```
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --trust-remote-code --tp 8
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--speculative-algorithm EAGLE \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--trust-remote-code \
--tp 8
```
- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
- FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development.
......@@ -177,7 +184,14 @@ See [Reasoning Parser](https://docs.sglang.ai/advanced_features/separate_reasoni
Add arguments `--tool-call-parser deepseekv3` and `--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja`(recommended) to enable this feature. For example (running on 1 * H20 node):
```
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --port 30000 --host 0.0.0.0 --mem-fraction-static 0.9 --tool-call-parser deepseekv3 --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja
python3 -m sglang.launch_server \
--model deepseek-ai/DeepSeek-V3-0324 \
--tp 8 \
--port 30000 \
--host 0.0.0.0 \
--mem-fraction-static 0.9 \
--tool-call-parser deepseekv3 \
--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja
```
Sample Request:
......
......@@ -43,7 +43,12 @@ export PYTHON_EXECUTION_BACKEND=UV
Launch the server with the demo tool server:
`python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --tool-server demo --tp 2`
```bash
python3 -m sglang.launch_server \
--model-path openai/gpt-oss-120b \
--tool-server demo \
--tp 2
```
For production usage, sglang can act as an MCP client for multiple services. An [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) is provided. Start the servers and point sglang to them:
```bash
......
......@@ -11,7 +11,10 @@ Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-projec
To serve Llama 4 models on 8xH100/H200 GPUs:
```bash
python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --tp 8 --context-length 1000000
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp 8 \
--context-length 1000000
```
### Configuration Tips
......@@ -29,7 +32,16 @@ python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-In
**Usage**:
Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
```
python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --tp 8 --context-length 1000000
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--trust-remote-code \
--tp 8 \
--context-length 1000000
```
- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode.
......@@ -50,11 +62,21 @@ Commands:
```bash
# Llama-4-Scout-17B-16E-Instruct model
python -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--port 30000 \
--tp 8 \
--mem-fraction-static 0.8 \
--context-length 65536
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
# Llama-4-Maverick-17B-128E-Instruct
python -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--port 30000 \
--tp 8 \
--mem-fraction-static 0.8 \
--context-length 65536
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
```
......
......@@ -21,7 +21,13 @@ python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
``` bash
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-algo NEXTN
python3 -m sglang.launch_server \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--tp 4 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-algo NEXTN
```
Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/10233).
......@@ -258,7 +258,10 @@ Detailed example in [structured outputs](../advanced_features/structured_outputs
Launch a server with `--enable-custom-logit-processor` flag on.
```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3-8B-Instruct \
--port 30000 \
--enable-custom-logit-processor
```
Define a custom logit processor that will always sample a specific token id.
......
......@@ -8,7 +8,10 @@ It should just work for most official models such as Llama-2/Llama-3.
If needed, you can also override the chat template when launching the server:
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-7b-chat-hf \
--port 30000 \
--chat-template llama-2
```
If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
......@@ -30,7 +33,10 @@ You can load the JSON format, which is defined by `conversation.py`.
```
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-7b-chat-hf \
--port 30000 \
--chat-template ./my_model_template.json
```
## Jinja Format
......@@ -38,5 +44,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers.
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-7b-chat-hf \
--port 30000 \
--chat-template ./my_model_template.jinja
```
......@@ -7,9 +7,19 @@
```bash
# replace 172.16.4.52:20000 with your own node ip address and port of the first node
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-405B-Instruct \
--tp 16 \
--dist-init-addr 172.16.4.52:20000 \
--nnodes 2 \
--node-rank 0
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-405B-Instruct \
--tp 16 \
--dist-init-addr 172.16.4.52:20000 \
--nnodes 2 \
--node-rank 1
```
Note that LLama 405B (fp8) can also be launched on a single node.
......
......@@ -139,7 +139,10 @@ This section describes how to set up the monitoring stack (Prometheus + Grafana)
1. **Start your SGLang server with metrics enabled:**
```bash
python -m sglang.launch_server --model-path <your_model_path> --port 30000 --enable-metrics
python -m sglang.launch_server \
--model-path <your_model_path> \
--port 30000 \
--enable-metrics
```
Replace `<your_model_path>` with the actual path to your model (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct`). Ensure the server is accessible from the monitoring stack (you might need `--host 0.0.0.0` if running in Docker). By default, the metrics endpoint will be available at `http://<sglang_server_host>:30000/metrics`.
......@@ -212,6 +215,17 @@ You can customize the setup by modifying these files. For instance, you might ne
#### Check if the metrics are being collected
Run `python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5` to generate some requests.
Run:
```
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 3000 \
--random-input 1024 \
--random-output 1024 \
--random-range-ratio 0.5
```
to generate some requests.
Then you should be able to see the metrics in the Grafana dashboard.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment