- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
...
...
@@ -128,7 +129,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/
## Frontend: Structured Generation Language (SGLang)
# Frontend: Structured Generation Language (SGLang)
The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow.
### Quick Start
The example below shows how to use sglang to answer a mulit-turn question.
The example below shows how to use SGLang to answer a multi-turn question.