Unverified Commit 7c1692aa authored by Chayenne's avatar Chayenne Committed by GitHub
Browse files

Docs: Reorngaize dpsk links (#3900)

parent 8f019c7d
...@@ -2,18 +2,8 @@ ...@@ -2,18 +2,8 @@
The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended). The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended).
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html). For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html).
## Hardware Recommendation
- 8 x NVIDIA H200 GPUs
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below.
For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726)
## Installation & Launch ## Installation & Launch
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.
...@@ -183,6 +173,15 @@ python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0. ...@@ -183,6 +173,15 @@ python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128 python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128
``` ```
### Example: Serving with 8 A100/A800 with AWQ Quantization
AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
```bash
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
```
### Example: Serving on any cloud or Kubernetes with SkyPilot ### Example: Serving on any cloud or Kubernetes with SkyPilot
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1). SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).
......
...@@ -124,6 +124,8 @@ drun -p 30000:30000 \ ...@@ -124,6 +124,8 @@ drun -p 30000:30000 \
--port 30000 --port 30000
``` ```
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
### Running Llama3.1 ### Running Llama3.1
Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command: Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command:
......
...@@ -2,9 +2,88 @@ ...@@ -2,9 +2,88 @@
SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591). SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591).
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
## Launch DeepSeek V3 with SGLang ## Launch DeepSeek V3 with SGLang
SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). Refer to [installation and launch](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch) to learn how to run fast inference of DeepSeek V3/R1 with SGLang. SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). To run DeepSeek V3/R1 models, the requirements are as follows:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Weight Configurations</title>
<style>
table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
}
th, td {
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}
th {
background-color: #f2f2f2;
}
tr:nth-child(even) {
background-color: #f9f9f9;
}
</style>
</head>
<body>
<table>
<thead>
<tr>
<th>Weight Type</th>
<th>Configuration</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3"><b>Full precision FP8 (recommended)</b></td>
<td>8 x H200</td>
</tr>
<tr>
<td>8 x MI300X</td>
</tr>
<tr>
<td>2 x 8 x H100/800/20</td>
</tr>
<tr>
<td rowspan="4">Full precision BF16</td>
<td>2 x 8 x H200</td>
</tr>
<tr>
<td>2 x 8 x MI300X</td>
</tr>
<tr>
<td>4 x 8 x H100/800/20</td>
</tr>
<tr>
<td>4 x 8 x A100/A800</td>
</tr>
<tr>
<td rowspan="2">Quantized weights (AWQ)</td>
<td>8 x H100/800/20</td>
</tr>
<tr>
<td>8 x A100/A800</td>
</tr>
</tbody>
</table>
</body>
</html>
Detailed commands for reference:
- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended)
- [8 x MI300X](https://docs.sglang.ai/references/amd.html#running-deepseek-v3)
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
### Download Weights ### Download Weights
...@@ -87,11 +166,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ...@@ -87,11 +166,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs? 1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs?
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout. **Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.
2. **Question**: How to use quantized DeepSeek models?
**Answer**: AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
```bash
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment