[Docs] Fix links and grammar issues (#4162)

d557319a · Michael Yao · GitHub · 95085d65 · d557319a · d557319a
Unverified Commit d557319a authored Mar 07, 2025 by Michael Yao Committed by GitHub Mar 06, 2025
3 changed files
--- a/docs/backend/custom_chat_template.md
+++ b/docs/backend/custom_chat_template.md
@@ -7,13 +7,14 @@ It should just work for most official models such as Llama-2/Llama-3.

 If needed, you can also override the chat template when launching the server:

-```
+```bash
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
 ```

 If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.

 ## JSON Format
+
 You can load the JSON format, which is defined by `conversation.py`.

 ```json
@@ -28,13 +29,14 @@ You can load the JSON format, which is defined by `conversation.py`.
 }
 ```

-```
+```bash
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
 ```

 ## Jinja Format
-You can also use the Jinja template format, defined by Hugging Face transformers https://huggingface.co/docs/transformers/main/en/chat_templating

-```
+You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers.
+
+```bash
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
 ```
--- a/docs/backend/hyperparameter_tuning.md
+++ b/docs/backend/hyperparameter_tuning.md
@@ -28,11 +28,12 @@ If you see `decode out of memory happened` occasionally but not frequently, it i

 ### Tune `--dp-size` and `--tp-size`

-Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../backend/sglang_router.md) for a better data parallelism rather than using `dp_size` parameter.
+Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter.

 ## Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`

 If you see out of memory (OOM) errors, you can try to tune the following parameters.
+
 - If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
 - If OOM happens during decoding, try to decrease `--max-running-requests`.
 - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
@@ -48,16 +49,15 @@ If you want to deploy a model on many different machines, you can ship the `torc

 1. Generate the cache by setting `TORCHINDUCTOR_CACHE_DIR` and running the model once.

-```bash
-TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
-```
+   ```bash
+   TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
+   ```

 2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`.

-
 ## Tune `--schedule-policy`

-If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match.
+If the workload has many shared prefixes, use the default `--schedule-policy lpm`. Where `lpm` stands for longest prefix match.

 When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
-you can try `--schedule-policy fcfs`. `fcfs` stands for first come first serve. `fcfs` has a lower scheduling overhead.
+you can try `--schedule-policy fcfs`. Where `fcfs` stands for first come first serve. This policy has a lower scheduling overhead.
--- a/docs/backend/quantization.md
+++ b/docs/backend/quantization.md
@@ -3,7 +3,7 @@
 SGLang supports various quantization methods, including offline quantization and online dynamic quantization.

 Offline quantization loads pre-quantized model weights directly during inference. This is required for quantization methods
-such as GPTQ and AWQ that collects and pre-compute various stats from the original weights using the calibration dataset.
+such as GPTQ and AWQ, which collect and pre-compute various statistics from the original weights using the calibration dataset.

 Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime.
 Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors
@@ -12,7 +12,8 @@ on-the-fly to convert high-precision weights into a lower-precision format.
 **Note: For better performance, usability and convenience, offline quantization is recommended over online quantization.**

 If you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time.
-For popular pre-quantized models, please visit [ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2) or [NeuralMagic](https://huggingface.co/collections/neuralmagic)  collections on HF for some
+For popular pre-quantized models, please visit [ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)
+or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on HF for some
 popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization
 to guard against abnormal quantization loss regressions.

@@ -111,9 +112,9 @@ python3 -m sglang.launch_server \
    --port 30000 --host 0.0.0.0
 ```

-Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`
+Our team is working on supporting more online quantization methods. SGLang will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`.

-We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:
+SGLang also supports quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:

 ```bash
 python3 -m sglang.launch_server \
@@ -122,7 +123,7 @@ python3 -m sglang.launch_server \
    --port 30000 --host 0.0.0.0
 ```

-We support the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`.
+SGLang supports the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`.

 Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command:

@@ -134,8 +135,6 @@ python3 -m sglang.launch_server \
    --port 30000 --host 0.0.0.0
 ```

-
-
 ## Reference

 - [GPTQModel](https://github.com/ModelCloud/GPTQModel)