Unverified Commit 76ca91df authored by Chayenne's avatar Chayenne Committed by GitHub
Browse files

Docs/CI: Enable Fake Finish for Docs Only PR (#3350)

parent cdae77b0
...@@ -6,11 +6,13 @@ on: ...@@ -6,11 +6,13 @@ on:
paths: paths:
- "python/sglang/**" - "python/sglang/**"
- "test/**" - "test/**"
- "docs/**"
pull_request: pull_request:
branches: [ main ] branches: [ main ]
paths: paths:
- "python/sglang/**" - "python/sglang/**"
- "test/**" - "test/**"
- "docs/**"
workflow_dispatch: workflow_dispatch:
inputs: inputs:
version: version:
...@@ -27,9 +29,38 @@ concurrency: ...@@ -27,9 +29,38 @@ concurrency:
cancel-in-progress: true cancel-in-progress: true
jobs: jobs:
filter:
runs-on: ubuntu-latest
outputs:
run_tests: ${{ steps.set_run_tests.outputs.run_tests }}
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Filter changes
id: filter
uses: dorny/paths-filter@v2
with:
filters: |
docs:
- 'docs/**'
sglang:
- 'python/sglang/**'
test:
- 'test/**'
- name: Set run_tests output
id: set_run_tests
run: |
if [ "${{ steps.filter.outputs.sglang }}" == "true" ] || [ "${{ steps.filter.outputs.test }}" == "true" ]; then
echo "run_tests=true" >> $GITHUB_OUTPUT
else
echo "run_tests=false" >> $GITHUB_OUTPUT
fi
unit-test-frontend: unit-test-frontend:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 1-gpu-runner runs-on: 1-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -48,7 +79,10 @@ jobs: ...@@ -48,7 +79,10 @@ jobs:
python3 run_suite.py --suite per-commit python3 run_suite.py --suite per-commit
unit-test-backend-1-gpu: unit-test-backend-1-gpu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 1-gpu-runner runs-on: 1-gpu-runner
strategy: strategy:
fail-fast: false fail-fast: false
...@@ -70,13 +104,14 @@ jobs: ...@@ -70,13 +104,14 @@ jobs:
RANGE=${{ matrix.range }} RANGE=${{ matrix.range }}
range_begin=${RANGE%-*} range_begin=${RANGE%-*}
range_end=${RANGE#*-} range_end=${RANGE#*-}
cd test/srt cd test/srt
python3 run_suite.py --suite per-commit --range-begin ${range_begin} --range-end ${range_end} python3 run_suite.py --suite per-commit --range-begin ${range_begin} --range-end ${range_end}
unit-test-backend-2-gpu: unit-test-backend-2-gpu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 2-gpu-runner runs-on: 2-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -113,7 +148,10 @@ jobs: ...@@ -113,7 +148,10 @@ jobs:
python3 test_moe_ep.py python3 test_moe_ep.py
performance-test-1-gpu-part-1: performance-test-1-gpu-part-1:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 1-gpu-runner runs-on: 1-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -155,9 +193,11 @@ jobs: ...@@ -155,9 +193,11 @@ jobs:
cd test/srt cd test/srt
python3 -m unittest test_bench_serving.TestBenchServing.test_online_latency_eagle python3 -m unittest test_bench_serving.TestBenchServing.test_online_latency_eagle
performance-test-1-gpu-part-2: performance-test-1-gpu-part-2:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 1-gpu-runner runs-on: 1-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -188,7 +228,10 @@ jobs: ...@@ -188,7 +228,10 @@ jobs:
python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_default_fp8 python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_default_fp8
performance-test-2-gpu: performance-test-2-gpu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 2-gpu-runner runs-on: 2-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -224,9 +267,11 @@ jobs: ...@@ -224,9 +267,11 @@ jobs:
cd test/srt cd test/srt
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache
accuracy-test-1-gpu: accuracy-test-1-gpu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 1-gpu-runner runs-on: 1-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -237,7 +282,6 @@ jobs: ...@@ -237,7 +282,6 @@ jobs:
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer' }} FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer' }}
run: | run: |
bash scripts/ci_install_dependency.sh bash scripts/ci_install_dependency.sh
git clone https://github.com/merrymercy/human-eval.git git clone https://github.com/merrymercy/human-eval.git
cd human-eval cd human-eval
pip install -e . pip install -e .
...@@ -248,9 +292,11 @@ jobs: ...@@ -248,9 +292,11 @@ jobs:
cd test/srt cd test/srt
python3 test_eval_accuracy_large.py python3 test_eval_accuracy_large.py
accuracy-test-2-gpu: accuracy-test-2-gpu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && github.event.pull_request.draft == false needs: filter
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false &&
needs.filter.outputs.run_tests == 'true'
runs-on: 2-gpu-runner runs-on: 2-gpu-runner
steps: steps:
- name: Checkout code - name: Checkout code
...@@ -261,7 +307,6 @@ jobs: ...@@ -261,7 +307,6 @@ jobs:
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer' }} FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer' }}
run: | run: |
bash scripts/ci_install_dependency.sh bash scripts/ci_install_dependency.sh
git clone https://github.com/merrymercy/human-eval.git git clone https://github.com/merrymercy/human-eval.git
cd human-eval cd human-eval
pip install -e . pip install -e .
...@@ -272,8 +317,8 @@ jobs: ...@@ -272,8 +317,8 @@ jobs:
cd test/srt cd test/srt
python3 test_moe_eval_accuracy_large.py python3 test_moe_eval_accuracy_large.py
finish: finish:
if: always()
needs: [ needs: [
unit-test-frontend, unit-test-backend-1-gpu, unit-test-backend-2-gpu, unit-test-frontend, unit-test-backend-1-gpu, unit-test-backend-2-gpu,
performance-test-1-gpu-part-1, performance-test-1-gpu-part-2, performance-test-2-gpu, performance-test-1-gpu-part-1, performance-test-1-gpu-part-2, performance-test-2-gpu,
......
...@@ -60,7 +60,7 @@ The core features include: ...@@ -60,7 +60,7 @@ The core features include:
references/accuracy_evaluation.md references/accuracy_evaluation.md
references/custom_chat_template.md references/custom_chat_template.md
references/deepseek.md references/deepseek.md
references/llama_405B.md references/multi_node.md
references/modelscope.md references/modelscope.md
references/contribution_guide.md references/contribution_guide.md
references/troubleshooting.md references/troubleshooting.md
......
# DeepSeek Model Optimizations # DeepSeek Model Usage and Optimizations
SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek-V3](https://github.com/sgl-project/sglang/issues/2591). SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591).
## Launch DeepSeek V3 with SGLang
## Multi-head Latent Attention (MLA) Throughput Optimizations SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3).
### Download Weights
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3]([https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only)) offical guide to download the weights.
### Launch with One node of 8 H200
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.** Also, `--enable-dp-attention` can be useful to improve for Deepseek V3/R1's throughput. Please refer to [Data Parallelism Attention](https://docs.sglang.ai/references/deepseek.html#multi-head-latent-attention-mla-throughput-optimizations) for detail.
### Running examples on Multi-node
- [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes).
- [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
## Optimisations
### Multi-head Latent Attention (MLA) Throughput Optimizations
**Description**: [MLA](https://arxiv.org/pdf/2405.04434) is an innovative attention mechanism introduced by the DeepSeek team, aimed at improving inference efficiency. SGLang has implemented specific optimizations for this, including: **Description**: [MLA](https://arxiv.org/pdf/2405.04434) is an innovative attention mechanism introduced by the DeepSeek team, aimed at improving inference efficiency. SGLang has implemented specific optimizations for this, including:
- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase. - **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
- **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process. - **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process.
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption. - **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
...@@ -24,7 +44,7 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ...@@ -24,7 +44,7 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
**Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details. **Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details.
## Data Parallelism Attention ### Data Parallelism Attention
**Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer. **Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer.
...@@ -40,17 +60,18 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ...@@ -40,17 +60,18 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
**Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models). **Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models).
## Multi Node Tensor Parallelism ### Multi Node Tensor Parallelism
**Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory. **Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory.
**Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) for usage examples. **Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) for usage examples.
## Block-wise FP8 ### Block-wise FP8
**Description**: SGLang implements block-wise FP8 quantization with two key optimizations: **Description**: SGLang implements block-wise FP8 quantization with two key optimizations:
- **Activation**: E4M3 format using per-token-per-128-channel sub-vector scales with online casting. - **Activation**: E4M3 format using per-token-per-128-channel sub-vector scales with online casting.
- **Weight**: Per-128x128-block quantization for better numerical stability. - **Weight**: Per-128x128-block quantization for better numerical stability.
**Usage**: turn on by default for DeepSeek V3 models. **Usage**: turn on by default for DeepSeek V3 models.
# Run Llama 3.1 405B # Run Multi-Node Inference
## Run 405B (fp8) on a Single Node ## Llama 3.1 405B
```bash **Run 405B (fp16) on Two Nodes**
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
```
## Run 405B (fp16) on Two Nodes
```bash ```bash
# on the first node, replace 172.16.4.52:20000 with your own node ip address and port # replace 172.16.4.52:20000 with your own node ip address and port of the first node
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0
# on the second node, replace 172.18.45.52:20000 with your own node ip address and port # replace 172.18.45.52:20000 with your own node ip address and port of the second node
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.18.45.52:20000 --nnodes 2 --node-rank 1 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.18.45.52:20000 --nnodes 2 --node-rank 1
``` ```
Note that LLama 405B (fp8) can also be launched on a single node.
```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
```
## DeepSeek V3/R1
Please refer to [DeepSeek documents for reference.](https://docs.sglang.ai/references/deepseek.html#running-examples-on-multi-node).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment