"git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "181688012a2abadc93b316d91a513bee36193615"
Unverified Commit b7d05594 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Update docs (#1768)


Co-authored-by: default avatarChayenne Zhao <zhaochenyang20@gmail.com>
Co-authored-by: default avatarChayenne <zhaochen20@outlook.com>
parent 80a90547
...@@ -56,10 +56,12 @@ You can install SGLang using any of the methods below. ...@@ -56,10 +56,12 @@ You can install SGLang using any of the methods below.
pip install --upgrade pip pip install --upgrade pip
pip install "sglang[all]" pip install "sglang[all]"
# Install FlashInfer CUDA kernels # Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
``` ```
**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.**
### Method 2: From source ### Method 2: From source
``` ```
# Use the last release branch # Use the last release branch
...@@ -69,10 +71,12 @@ cd sglang ...@@ -69,10 +71,12 @@ cd sglang
pip install --upgrade pip pip install --upgrade pip
pip install -e "python[all]" pip install -e "python[all]"
# Install FlashInfer CUDA kernels # Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
``` ```
**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.**
### Method 3: Using docker ### Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
...@@ -226,7 +230,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -226,7 +230,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
``` ```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. - To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
...@@ -247,7 +252,6 @@ We also provide an inference engine **without a HTTP server**. For example, ...@@ -247,7 +252,6 @@ We also provide an inference engine **without a HTTP server**. For example,
```python ```python
import sglang as sgl import sglang as sgl
def main(): def main():
prompts = [ prompts = [
"Hello, my name is", "Hello, my name is",
...@@ -267,12 +271,8 @@ if __name__ == "__main__": ...@@ -267,12 +271,8 @@ if __name__ == "__main__":
main() main()
``` ```
This can be used for: This can be used for offline batch inference and building custom servers.
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).
1. **Offline Batch Inference**
2. **Building Custom Servers**
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine)
### Supported Models ### Supported Models
...@@ -440,7 +440,6 @@ print(state["answer_1"]) ...@@ -440,7 +440,6 @@ print(state["answer_1"])
``` ```
#### More Examples #### More Examples
Anthropic and VertexAI (Gemini) models are also supported. Anthropic and VertexAI (Gemini) models are also supported.
You can find more examples at [examples/quick_start](examples/frontend_language/quick_start). You can find more examples at [examples/quick_start](examples/frontend_language/quick_start).
......
...@@ -79,7 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -79,7 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
``` ```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
...@@ -100,7 +100,6 @@ We also provide an inference engine **without a HTTP server**. For example, ...@@ -100,7 +100,6 @@ We also provide an inference engine **without a HTTP server**. For example,
```python ```python
import sglang as sgl import sglang as sgl
def main(): def main():
prompts = [ prompts = [
"Hello, my name is", "Hello, my name is",
...@@ -120,12 +119,8 @@ if __name__ == "__main__": ...@@ -120,12 +119,8 @@ if __name__ == "__main__":
main() main()
``` ```
This can be used for: This can be used for offline batch inference and building custom servers.
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).
1. **Offline Batch Inference**
2. **Building Custom Servers**
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine)
### Supported Models ### Supported Models
......
...@@ -68,7 +68,6 @@ print(state["answer_1"]) ...@@ -68,7 +68,6 @@ print(state["answer_1"])
``` ```
#### More Examples #### More Examples
Anthropic and VertexAI (Gemini) models are also supported. Anthropic and VertexAI (Gemini) models are also supported.
You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start). You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start).
......
...@@ -6,11 +6,11 @@ Achieving a large batch size is the most important thing for attaining high thro ...@@ -6,11 +6,11 @@ Achieving a large batch size is the most important thing for attaining high thro
When the server is running at full load, look for the following in the log: When the server is running at full load, look for the following in the log:
```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417``` ```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317```
### Tune Your Request Submission Speed ### Tune Your Request Submission Speed
`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed. `#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed.
A healthy range for `#queue-req` is `50 - 1000`. A healthy range for `#queue-req` is `50 - 500`.
On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server. On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server.
### Tune `--schedule-conservativeness` ### Tune `--schedule-conservativeness`
...@@ -31,6 +31,10 @@ If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096 ...@@ -31,6 +31,10 @@ If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096
If OOM happens during decoding, try to decrease `--max-running-requests`. If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
### Try advanced options
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
### (Minor) Tune `--schedule-policy` ### (Minor) Tune `--schedule-policy`
If you have many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match. If you have many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match.
When you have no shared prefixes at all or you always send the requests with the shared prefixes together, When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
......
...@@ -7,23 +7,27 @@ You can install SGLang using any of the methods below. ...@@ -7,23 +7,27 @@ You can install SGLang using any of the methods below.
pip install --upgrade pip pip install --upgrade pip
pip install "sglang[all]" pip install "sglang[all]"
# Install FlashInfer CUDA kernels # Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
``` ```
**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.**
### Method 2: From source ### Method 2: From source
``` ```
# Use the last release branch # Use the last release branch
git clone -b v0.3.0 https://github.com/sgl-project/sglang.git git clone -b v0.3.4.post1 https://github.com/sgl-project/sglang.git
cd sglang cd sglang
pip install --upgrade pip pip install --upgrade pip
pip install -e "python[all]" pip install -e "python[all]"
# Install FlashInfer CUDA kernels # Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
``` ```
**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.**
### Method 3: Using docker ### Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
...@@ -94,3 +98,4 @@ sky status --endpoint 30000 sglang ...@@ -94,3 +98,4 @@ sky status --endpoint 30000 sglang
### Common Notes ### Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment