Unverified Commit 0ab7bcaf authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Simplify documentation in README.md (#1851)

parent 438526a8
This diff is collapsed.
## Backend: SGLang Runtime (SRT) # Backend: SGLang Runtime (SRT)
The SGLang Runtime (SRT) is an efficient serving engine. The SGLang Runtime (SRT) is an efficient serving engine.
### Quick Start ### Quick Start
...@@ -79,6 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -79,6 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
``` ```
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
...@@ -128,7 +129,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/ ...@@ -128,7 +129,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/
- Llama / Llama 2 / Llama 3 / Llama 3.1 - Llama / Llama 2 / Llama 3 / Llama 3.1
- Mistral / Mixtral / Mistral NeMo - Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2 - Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE - Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
- DeepSeek / DeepSeek 2 - DeepSeek / DeepSeek 2
- OLMoE - OLMoE
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) - [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
...@@ -151,6 +152,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/ ...@@ -151,6 +152,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/
- MiniCPM / MiniCPM 3 - MiniCPM / MiniCPM 3
- XVERSE / XVERSE MoE - XVERSE / XVERSE MoE
- SmolLM - SmolLM
- GLM-4
**Embedding Models** **Embedding Models**
...@@ -173,6 +175,17 @@ Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instru ...@@ -173,6 +175,17 @@ Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instru
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000 SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
``` ```
Or start it by docker.
```bash
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--env "SGLANG_USE_MODELSCOPE=true" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
```
</details> </details>
#### Run Llama 3.1 405B #### Run Llama 3.1 405B
......
## Frontend: Structured Generation Language (SGLang) # Frontend: Structured Generation Language (SGLang)
The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow. The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow.
### Quick Start ### Quick Start
The example below shows how to use sglang to answer a mulit-turn question. The example below shows how to use SGLang to answer a multi-turn question.
#### Using Local Models #### Using Local Models
First, launch a server with First, launch a server with
......
## Install SGLang # Install
You can install SGLang using any of the methods below. You can install SGLang using any of the methods below.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment