Unverified Commit 71ed0183 authored by Baizhou Zhang's avatar Baizhou Zhang Committed by GitHub
Browse files

[doc] Update document for flashinfer mla (#3907)

parent 8b681d77
......@@ -133,6 +133,7 @@ Please consult the documentation below to learn more about the parameters you ma
* `attention_backend`: The backend for attention computation and KV cache management.
* `sampling_backend`: The backend for sampling.
* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper. It can optimize the throughput of deepseek models.
## Constrained Decoding
......
......@@ -113,7 +113,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be
- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
- **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process.
- **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. This optimization can be significant under long context scenarios. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html).
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment