[Fix] Turn off DeepGEMM by default (#5263)

f6772f14 · Baizhou Zhang · GitHub · ac5b78ba · f6772f14 · f6772f14
Unverified Commit f6772f14 authored Apr 14, 2025 by Baizhou Zhang Committed by GitHub Apr 14, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 2 deletions

docs/references/deepseek.md docs/references/deepseek.md +3 -1

python/sglang/srt/layers/quantization/fp8_kernel.py python/sglang/srt/layers/quantization/fp8_kernel.py +3 -1

No files found.
--- a/docs/references/deepseek.md
+++ b/docs/references/deepseek.md
@@ -136,7 +136,9 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin

 - **Weight**: Per-128x128-block quantization for better numerical stability.

-**Usage**: Turn on by default for DeepSeek V3 models.
+- **DeepGEMM**: The [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) kernel library deisgned for FP8 matrix multiplications. Note that enabling DeepGEMM will cause large compilation overhead during the first few run.
+
+**Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is turned off by default, and can be enabled with environment variable `SGL_ENABLE_JIT_DEEPGEMM=1`.

 ### Multi-token Prediction
 **Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting.

--- a/python/sglang/srt/layers/quantization/fp8_kernel.py
+++ b/python/sglang/srt/layers/quantization/fp8_kernel.py
@@ -45,7 +45,9 @@ if _is_cuda:
    from sgl_kernel import sgl_per_token_group_quant_fp8, sgl_per_token_quant_fp8

    sm_version = get_device_sm()
-    if sm_version == 90 and get_bool_env_var("SGL_ENABLE_JIT_DEEPGEMM", default="true"):
+    if sm_version == 90 and get_bool_env_var(
+        "SGL_ENABLE_JIT_DEEPGEMM", default="false"
+    ):
        _enable_jit_deepgemm = True