Commit 9f009261 authored by Lianmin Zheng's avatar Lianmin Zheng
Browse files

Improve docs

parent 159cc741
...@@ -44,12 +44,8 @@ pip install -e "python[all]" ...@@ -44,12 +44,8 @@ pip install -e "python[all]"
``` ```
### Notes ### Notes
- If you are using older GPUs (NVIDIA V100, T4), please pick the correct triton compiler version to avoid some known bugs.
- For NVIDIA T4, please use `pip install "triton>=2.2.0"`.
- For NVIDIA V100, please install the [nightly](https://triton-lang.org/main/getting-started/installation.html) version.
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"` - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`
## Quick Start ## Quick Start
The example below shows how to use sglang to answer a mulit-turn question. The example below shows how to use sglang to answer a mulit-turn question.
...@@ -367,7 +363,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port ...@@ -367,7 +363,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
``` ```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
``` ```
- You can turn on [flashinfer](docs/flashinfer.md) to accelerate the inference by using highly optimized CUDA kernels. - See [flashinfer.md](docs/flashinfer.md) on accelerating inference using highly optimized CUDA kernels.
- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
### Supported Models ### Supported Models
- Llama - Llama
......
...@@ -5,6 +5,7 @@ ...@@ -5,6 +5,7 @@
Achieving a large batch size is the most important thing for attaining high throughput. Achieving a large batch size is the most important thing for attaining high throughput.
When the server is running at full load, look for the following in the log: When the server is running at full load, look for the following in the log:
```[gpu_id=0] #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417``` ```[gpu_id=0] #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417```
### Tune Your Request Submission Speed ### Tune Your Request Submission Speed
...@@ -22,10 +23,10 @@ On the other hand, if you see `token usage` very high and you frequently see war ...@@ -22,10 +23,10 @@ On the other hand, if you see `token usage` very high and you frequently see war
### Tune `--dp-size` and `--tp-size` ### Tune `--dp-size` and `--tp-size`
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
### (Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`. ### (Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters. If you see out of memory (OOM) errors, you can decrease these parameters.
If OOM happens during prefill, try to decrease `--max-prefill-tokens`. If OOM happens during prefill, try to decrease `--max-prefill-tokens`.
If OOM happens during decoding, try to decrease `--max-running-requests`. If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
### (Minor) Tune `--schedule-heuristic` ### (Minor) Tune `--schedule-heuristic`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment