Unverified Commit 023d0a73 authored by Xiaoyu Zhang's avatar Xiaoyu Zhang Committed by GitHub
Browse files

fix small typos in docs (#2047)

parent 32c9a7ec
...@@ -79,8 +79,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -79,8 +79,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
``` ```
- To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. - To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currently.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
......
# Frontend: Structured Generation Language (SGLang) # Frontend: Structured Generation Language (SGLang)
The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow. The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow.
## Quick Start ## Quick Start
The example below shows how to use SGLang to answer a multi-turn question. The example below shows how to use SGLang to answer a multi-turn question.
......
...@@ -31,8 +31,8 @@ If you see out of memory (OOM) errors, you can try to tune the following paramet ...@@ -31,8 +31,8 @@ If you see out of memory (OOM) errors, you can try to tune the following paramet
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
### Try Advanced Options ### Try Advanced Options
- To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. - To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currently.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
### Tune `--schedule-policy` ### Tune `--schedule-policy`
If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match. If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match.
......
...@@ -11,4 +11,4 @@ If you see out of memory (OOM) errors, you can try to tune the following paramet ...@@ -11,4 +11,4 @@ If you see out of memory (OOM) errors, you can try to tune the following paramet
## CUDA error: an illegal memory access was encountered ## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues. This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub. - If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above seciton to avoid the OOM. - If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above section to avoid the OOM.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment