"vscode:/vscode.git/clone" did not exist on "98a2b3d2d8a6c78da9d63d5911ba3300e2c00ce3"
Unverified Commit 5cc11705 authored by Chayenne's avatar Chayenne Committed by GitHub
Browse files

Doc: add block-wise FP8 in dpsk model reference (#2830)

parent 11fffbc9
...@@ -9,11 +9,14 @@ SGLang provides several optimizations specifically designed for the DeepSeek mod ...@@ -9,11 +9,14 @@ SGLang provides several optimizations specifically designed for the DeepSeek mod
- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase. - **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
- **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process. - **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process.
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption. - **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
- **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes. - **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
Overall, with these optimizations, we have achieved up to a 7x acceleration in output throughput compared to the previous version. Overall, with these optimizations, we have achieved up to a 7x acceleration in output throughput compared to the previous version.
![Data Parallelism Attention for DeepSeek Series Models](https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg)
![Multi-head Latent Attention for DeepSeek Series Models](https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg)
**Usage**: MLA optimization is enabled by defalut, to disable, use `--disable-mla`. **Usage**: MLA optimization is enabled by defalut, to disable, use `--disable-mla`.
...@@ -22,6 +25,7 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ...@@ -22,6 +25,7 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
## Data Parallelism Attention ## Data Parallelism Attention
**Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer. **Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer.
![Data Parallelism Attention for DeepSeek Series Models](https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg). ![Data Parallelism Attention for DeepSeek Series Models](https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg).
**Usage**: This optimization is aimed at improving throughput and should be used for scenarios with high QPS (Queries Per Second). Data Parallelism Attention optimization can be enabeld by `--enable-dp-attention` for DeepSeek Series Models. **Usage**: This optimization is aimed at improving throughput and should be used for scenarios with high QPS (Queries Per Second). Data Parallelism Attention optimization can be enabeld by `--enable-dp-attention` for DeepSeek Series Models.
...@@ -29,6 +33,16 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ...@@ -29,6 +33,16 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
**Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models). **Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models).
## Multi Node Tensor Parallelism ## Multi Node Tensor Parallelism
**Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory. **Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory.
**Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) for usage examples. **Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) for usage examples.
## Block-wise FP8
**Description**: SGLang implements block-wise FP8 quantization with two key optimizations:
- **Activation**: E4M3 format using per-token-per-128-channel sub-vector scales with online casting.
- **Weight**: Per-128x128-block quantization for better numerical stability.
**Usage**: turn on by default for DeepSeek V3 models.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment