Unverified Commit ac053100 authored by Baizhou Zhang's avatar Baizhou Zhang Committed by GitHub
Browse files

[Docs] Modify ep related server args and remove cublas part of deepseek (#3732)

parent d5d80ab4
...@@ -83,8 +83,8 @@ Please consult the documentation below to learn more about the parameters you ma ...@@ -83,8 +83,8 @@ Please consult the documentation below to learn more about the parameters you ma
* `load_balance_method`: Will be deprecated. Load balancing strategy for data parallel requests. * `load_balance_method`: Will be deprecated. Load balancing strategy for data parallel requests.
### Expert parallelism ### Expert parallelism
* `enable_ep_moe`: Enables expert parallelism that distributes the experts onto multiple GPUs for MoE models.
* `ep_size`: Distribute the experts onto multiple GPUs for MoE models. Remember to shard the model weights with `tp_size=ep_size`, for detailed benchmarking refer to [this PR](https://github.com/sgl-project/sglang/pull/2203). * `ep_size`: The size of EP. Please shard the model weights with `tp_size=ep_size`, for detailed benchmarking refer to [this PR](https://github.com/sgl-project/sglang/pull/2203). If not set, `ep_size` will be automatically set to `tp_size`.
## Memory and scheduling ## Memory and scheduling
...@@ -179,7 +179,6 @@ Please consult the documentation below to learn more about the parameters you ma ...@@ -179,7 +179,6 @@ Please consult the documentation below to learn more about the parameters you ma
* `enable_mixed_chunk`: Enables mixing prefill and decode, see [this discussion](https://github.com/sgl-project/sglang/discussions/1163). * `enable_mixed_chunk`: Enables mixing prefill and decode, see [this discussion](https://github.com/sgl-project/sglang/discussions/1163).
* `enable_dp_attention`: Enable [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) for Deepseek models. Note that you need to choose `dp_size = tp_size` for this. * `enable_dp_attention`: Enable [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) for Deepseek models. Note that you need to choose `dp_size = tp_size` for this.
* `enable_ep_moe`: Enables expert parallelism, see the description of `ep_size`.
* `enable_torch_compile`: Torch compile the model. This is an experimental feature. * `enable_torch_compile`: Torch compile the model. This is an experimental feature.
* `torch_compile_max_bs`: The maximum batch size when using `torch_compile`. * `torch_compile_max_bs`: The maximum batch size when using `torch_compile`.
* `cuda_graph_max_bs`: Adjust the maximum batchsize when using cuda graph. By default this is chosen for you based on GPU specifics. * `cuda_graph_max_bs`: Adjust the maximum batchsize when using cuda graph. By default this is chosen for you based on GPU specifics.
......
...@@ -77,21 +77,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ...@@ -77,21 +77,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
- **Weight**: Per-128x128-block quantization for better numerical stability. - **Weight**: Per-128x128-block quantization for better numerical stability.
**Usage**: turn on by default for DeepSeek V3 models. **Usage**: turn on by default for DeepSeek V3 models.
### Cublas Grouped Gemm
**Description**: [Grouped Gemm API](https://docs.nvidia.com/cuda/cublas/index.html#cublasgemmgroupedbatchedex) provided by Cublas 12.5 is attached to SGLang for acceleration of
settings where a group of matrix multiplication with different shapes needs to be executed. Typical examples are expert parallel in MoE layers, and lora modules in multi-serving Lora layers.
**Usage**: SGLang currently only supports Pytorch 2.5, which is installed with Cuda 12.4 packages together. Users need to work on a Cuda environment >= 12.5 and forcely upgrade the Cublas package in the following way:
1. Make sure the system Cuda version is >= 12.5 with `nvcc -V`
2. Install sglang under instruction of [official document ](https://docs.sglang.ai/start/install.html)
3. Reinstall cublas 12.5 through `pip install nvidia-cublas-cu12==12.5.3.2` so that the cublas package is upgraded
4. Compile the new sgl-kernel library with `cd sgl-kernel && make build`
Then the cublas grouped gemm kernel can be imported with
```python
from sgl_kernel import cublas_grouped_gemm
```
Currently Cublas only support grouped gemm kernel for fp16/bf16/fp32 tensors, so fp8 tensors cannot be applied.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment