Unverified Commit f6772f14 authored by Baizhou Zhang's avatar Baizhou Zhang Committed by GitHub
Browse files

[Fix] Turn off DeepGEMM by default (#5263)

parent ac5b78ba
...@@ -136,7 +136,9 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin ...@@ -136,7 +136,9 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin
- **Weight**: Per-128x128-block quantization for better numerical stability. - **Weight**: Per-128x128-block quantization for better numerical stability.
**Usage**: Turn on by default for DeepSeek V3 models. - **DeepGEMM**: The [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) kernel library deisgned for FP8 matrix multiplications. Note that enabling DeepGEMM will cause large compilation overhead during the first few run.
**Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is turned off by default, and can be enabled with environment variable `SGL_ENABLE_JIT_DEEPGEMM=1`.
### Multi-token Prediction ### Multi-token Prediction
**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting. **Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting.
......
...@@ -45,7 +45,9 @@ if _is_cuda: ...@@ -45,7 +45,9 @@ if _is_cuda:
from sgl_kernel import sgl_per_token_group_quant_fp8, sgl_per_token_quant_fp8 from sgl_kernel import sgl_per_token_group_quant_fp8, sgl_per_token_quant_fp8
sm_version = get_device_sm() sm_version = get_device_sm()
if sm_version == 90 and get_bool_env_var("SGL_ENABLE_JIT_DEEPGEMM", default="true"): if sm_version == 90 and get_bool_env_var(
"SGL_ENABLE_JIT_DEEPGEMM", default="false"
):
_enable_jit_deepgemm = True _enable_jit_deepgemm = True
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment