Unverified Commit 3e02526b authored by Baizhou Zhang's avatar Baizhou Zhang Committed by GitHub
Browse files

[Doc] Add experimental tag for flashinfer mla (#3925)

parent d8a98a2c
...@@ -133,7 +133,7 @@ Please consult the documentation below to learn more about the parameters you ma ...@@ -133,7 +133,7 @@ Please consult the documentation below to learn more about the parameters you ma
* `attention_backend`: The backend for attention computation and KV cache management. * `attention_backend`: The backend for attention computation and KV cache management.
* `sampling_backend`: The backend for sampling. * `sampling_backend`: The backend for sampling.
* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper. It can optimize the throughput of deepseek models. * `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper that accelerates deepseek models. (In Experiment Stage)
## Constrained Decoding ## Constrained Decoding
......
...@@ -85,7 +85,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be ...@@ -85,7 +85,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be
- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase. - **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
- **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. This optimization can be significant under long context scenarios. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html). - **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html). (In Experiment Stage)
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption. - **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment