[doc] Update document for flashinfer mla (#3907)

71ed0183 · Baizhou Zhang · GitHub · 8b681d77 · 71ed0183 · 71ed0183
Unverified Commit 71ed0183 authored Feb 26, 2025 by Baizhou Zhang Committed by GitHub Feb 26, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

docs/backend/server_arguments.md docs/backend/server_arguments.md +1 -0

docs/references/deepseek.md docs/references/deepseek.md +1 -1

No files found.
--- a/docs/backend/server_arguments.md
+++ b/docs/backend/server_arguments.md
@@ -133,6 +133,7 @@ Please consult the documentation below to learn more about the parameters you ma

 * `attention_backend`: The backend for attention computation and KV cache management.
 * `sampling_backend`: The backend for sampling.
+* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper. It can optimize the throughput of deepseek models.

 ## Constrained Decoding


--- a/docs/references/deepseek.md
+++ b/docs/references/deepseek.md
@@ -113,7 +113,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be

 - **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.

- **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process.
+- **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. This optimization can be significant under long context scenarios. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html).

 - **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.