Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
9f009261
Commit
9f009261
authored
Jun 01, 2024
by
Lianmin Zheng
Browse files
Improve docs
parent
159cc741
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
7 additions
and
9 deletions
+7
-9
README.md
README.md
+2
-5
docs/hyperparameter_tuning.md
docs/hyperparameter_tuning.md
+5
-4
No files found.
README.md
View file @
9f009261
...
@@ -44,12 +44,8 @@ pip install -e "python[all]"
...
@@ -44,12 +44,8 @@ pip install -e "python[all]"
```
```
### Notes
### Notes
-
If you are using older GPUs (NVIDIA V100, T4), please pick the correct triton compiler version to avoid some known bugs.
-
For NVIDIA T4, please use
`pip install "triton>=2.2.0"`
.
-
For NVIDIA V100, please install the
[
nightly
](
https://triton-lang.org/main/getting-started/installation.html
)
version.
-
If you only need to use the OpenAI backend, you can avoid installing other dependencies by using
`pip install "sglang[openai]"`
-
If you only need to use the OpenAI backend, you can avoid installing other dependencies by using
`pip install "sglang[openai]"`
## Quick Start
## Quick Start
The example below shows how to use sglang to answer a mulit-turn question.
The example below shows how to use sglang to answer a mulit-turn question.
...
@@ -367,7 +363,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
...
@@ -367,7 +363,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
```
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
```
```
-
You can turn on
[
flashinfer
](
docs/flashinfer.md
)
to accelerate the inference by using highly optimized CUDA kernels.
-
See
[
flashinfer.md
](
docs/flashinfer.md
)
on accelerating inference using highly optimized CUDA kernels.
-
See
[
hyperparameter_tuning.md
](
docs/hyperparameter_tuning.md
)
on tuning hyperparameters for better performance.
### Supported Models
### Supported Models
-
Llama
-
Llama
...
...
docs/hyperparameter_tuning.md
View file @
9f009261
...
@@ -5,6 +5,7 @@
...
@@ -5,6 +5,7 @@
Achieving a large batch size is the most important thing for attaining high throughput.
Achieving a large batch size is the most important thing for attaining high throughput.
When the server is running at full load, look for the following in the log:
When the server is running at full load, look for the following in the log:
```
[gpu_id=0] #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417```
```
[gpu_id=0] #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417```
### Tune Your Request Submission Speed
### Tune Your Request Submission Speed
...
@@ -22,10 +23,10 @@ On the other hand, if you see `token usage` very high and you frequently see war
...
@@ -22,10 +23,10 @@ On the other hand, if you see `token usage` very high and you frequently see war
### Tune `--dp-size` and `--tp-size`
### Tune `--dp-size` and `--tp-size`
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
### (Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`
.
### (Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters.
If you see out of memory (OOM) errors, you can decrease these parameters.
If OOM happens during prefill, try to decrease `--max-prefill-tokens`.
If OOM happens during prefill, try to decrease `--max-prefill-tokens`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
### (Minor) Tune `--schedule-heuristic`
### (Minor) Tune `--schedule-heuristic`
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment