Unverified Commit b20daf98 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Update README.md (#1198)

parent f6af3a65
...@@ -17,7 +17,7 @@ SGLang is a fast serving framework for large language models and vision language ...@@ -17,7 +17,7 @@ SGLang is a fast serving framework for large language models and vision language
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include: The core features include:
- **Fast Backend Runtime**: Efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, flashinfer kernels, and quantization (AWQ/FP8/GPTQ/Marlin). - **Fast Backend Runtime**: Efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, and quantization (AWQ/FP8/GPTQ/Marlin).
- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions. - **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
## News ## News
...@@ -248,17 +248,19 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec ...@@ -248,17 +248,19 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
#### Use Models From ModelScope #### Use Models From ModelScope
<details> <details>
To use model from [ModelScope](https://www.modelscope.cn), setting environment variable SGLANG_USE_MODELSCOPE. To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
``` ```
export SGLANG_USE_MODELSCOPE=true export SGLANG_USE_MODELSCOPE=true
``` ```
Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server
``` ```
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000 SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
``` ```
</details> </details>
#### Run Llama 3.1 405B #### Run Llama 3.1 405B
<details>
```bash ```bash
# Run 405B (fp8) on a single node # Run 405B (fp8) on a single node
...@@ -272,6 +274,8 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/ ...@@ -272,6 +274,8 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
``` ```
</details>
### Benchmark Performance ### Benchmark Performance
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, consider using `sglang.bench_serving`. - Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, consider using `sglang.bench_serving`.
...@@ -407,7 +411,7 @@ def tip_suggestion(s): ...@@ -407,7 +411,7 @@ def tip_suggestion(s):
s += "In summary" + sgl.gen("summary") s += "In summary" + sgl.gen("summary")
``` ```
#### Multi Modality #### Multi-Modality
Use `sgl.image` to pass an image as input. Use `sgl.image` to pass an image as input.
```python ```python
...@@ -461,7 +465,7 @@ def character_gen(s, name): ...@@ -461,7 +465,7 @@ def character_gen(s, name):
s += sgl.gen("json_output", max_tokens=256, regex=character_regex) s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
``` ```
See also [json_decode.py](examples/usage/json_decode.py) for an additional example on specifying formats with Pydantic models. See also [json_decode.py](examples/usage/json_decode.py) for an additional example of specifying formats with Pydantic models.
#### Batching #### Batching
Use `run_batch` to run a batch of requests with continuous batching. Use `run_batch` to run a batch of requests with continuous batching.
...@@ -523,7 +527,6 @@ def chat_example(s): ...@@ -523,7 +527,6 @@ def chat_example(s):
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability. - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`. - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
## Benchmark And Performance ## Benchmark And Performance
![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg) ![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg) ![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment