Unverified Commit d557319a authored by Michael Yao's avatar Michael Yao Committed by GitHub
Browse files

[Docs] Fix links and grammar issues (#4162)

parent 95085d65
...@@ -7,13 +7,14 @@ It should just work for most official models such as Llama-2/Llama-3. ...@@ -7,13 +7,14 @@ It should just work for most official models such as Llama-2/Llama-3.
If needed, you can also override the chat template when launching the server: If needed, you can also override the chat template when launching the server:
``` ```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
``` ```
If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file. If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
## JSON Format ## JSON Format
You can load the JSON format, which is defined by `conversation.py`. You can load the JSON format, which is defined by `conversation.py`.
```json ```json
...@@ -28,13 +29,14 @@ You can load the JSON format, which is defined by `conversation.py`. ...@@ -28,13 +29,14 @@ You can load the JSON format, which is defined by `conversation.py`.
} }
``` ```
``` ```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
``` ```
## Jinja Format ## Jinja Format
You can also use the Jinja template format, defined by Hugging Face transformers https://huggingface.co/docs/transformers/main/en/chat_templating
``` You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers.
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
``` ```
...@@ -28,11 +28,12 @@ If you see `decode out of memory happened` occasionally but not frequently, it i ...@@ -28,11 +28,12 @@ If you see `decode out of memory happened` occasionally but not frequently, it i
### Tune `--dp-size` and `--tp-size` ### Tune `--dp-size` and `--tp-size`
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../backend/sglang_router.md) for a better data parallelism rather than using `dp_size` parameter. Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter.
## Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests` ## Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can try to tune the following parameters. If you see out of memory (OOM) errors, you can try to tune the following parameters.
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. - If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
- If OOM happens during decoding, try to decrease `--max-running-requests`. - If OOM happens during decoding, try to decrease `--max-running-requests`.
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
...@@ -48,16 +49,15 @@ If you want to deploy a model on many different machines, you can ship the `torc ...@@ -48,16 +49,15 @@ If you want to deploy a model on many different machines, you can ship the `torc
1. Generate the cache by setting `TORCHINDUCTOR_CACHE_DIR` and running the model once. 1. Generate the cache by setting `TORCHINDUCTOR_CACHE_DIR` and running the model once.
```bash ```bash
TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
``` ```
2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`. 2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`.
## Tune `--schedule-policy` ## Tune `--schedule-policy`
If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match. If the workload has many shared prefixes, use the default `--schedule-policy lpm`. Where `lpm` stands for longest prefix match.
When you have no shared prefixes at all or you always send the requests with the shared prefixes together, When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
you can try `--schedule-policy fcfs`. `fcfs` stands for first come first serve. `fcfs` has a lower scheduling overhead. you can try `--schedule-policy fcfs`. Where `fcfs` stands for first come first serve. This policy has a lower scheduling overhead.
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
SGLang supports various quantization methods, including offline quantization and online dynamic quantization. SGLang supports various quantization methods, including offline quantization and online dynamic quantization.
Offline quantization loads pre-quantized model weights directly during inference. This is required for quantization methods Offline quantization loads pre-quantized model weights directly during inference. This is required for quantization methods
such as GPTQ and AWQ that collects and pre-compute various stats from the original weights using the calibration dataset. such as GPTQ and AWQ, which collect and pre-compute various statistics from the original weights using the calibration dataset.
Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime.
Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors
...@@ -12,7 +12,8 @@ on-the-fly to convert high-precision weights into a lower-precision format. ...@@ -12,7 +12,8 @@ on-the-fly to convert high-precision weights into a lower-precision format.
**Note: For better performance, usability and convenience, offline quantization is recommended over online quantization.** **Note: For better performance, usability and convenience, offline quantization is recommended over online quantization.**
If you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time. If you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time.
For popular pre-quantized models, please visit [ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2) or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on HF for some For popular pre-quantized models, please visit [ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)
or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on HF for some
popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization
to guard against abnormal quantization loss regressions. to guard against abnormal quantization loss regressions.
...@@ -111,9 +112,9 @@ python3 -m sglang.launch_server \ ...@@ -111,9 +112,9 @@ python3 -m sglang.launch_server \
--port 30000 --host 0.0.0.0 --port 30000 --host 0.0.0.0
``` ```
Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]` Our team is working on supporting more online quantization methods. SGLang will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`.
We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: SGLang also supports quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:
```bash ```bash
python3 -m sglang.launch_server \ python3 -m sglang.launch_server \
...@@ -122,7 +123,7 @@ python3 -m sglang.launch_server \ ...@@ -122,7 +123,7 @@ python3 -m sglang.launch_server \
--port 30000 --host 0.0.0.0 --port 30000 --host 0.0.0.0
``` ```
We support the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`. SGLang supports the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`.
Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command: Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command:
...@@ -134,8 +135,6 @@ python3 -m sglang.launch_server \ ...@@ -134,8 +135,6 @@ python3 -m sglang.launch_server \
--port 30000 --host 0.0.0.0 --port 30000 --host 0.0.0.0
``` ```
## Reference ## Reference
- [GPTQModel](https://github.com/ModelCloud/GPTQModel) - [GPTQModel](https://github.com/ModelCloud/GPTQModel)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment