Improve docs and developer guide (#9044)

2e8e7e35 · Lianmin Zheng · GitHub · 2449a0af · 2e8e7e35 · 2e8e7e35
Unverified Commit 2e8e7e35 authored Aug 10, 2025 by Lianmin Zheng Committed by GitHub Aug 10, 2025
9 changed files
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -18,7 +18,7 @@

 ## Checklist

- [ ] Format your code according to the [Code formatting with pre-commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
- [ ] Add unit tests according to the [Running and adding unit tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
- [ ] Update documentation according to [Writing documentations](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci).
- [ ] Provide accuracy and speed benchmark results according to [Testing the accuracy](https://docs.sglang.ai/references/contribution_guide.html#testing-the-accuracy) and [Benchmark and profiling]()
+- [ ] Format your code according to the [Format code with pre-commit](https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit).
+- [ ] Add unit tests according to the [Run and add unit tests](https://docs.sglang.ai/developer_guide/contribution_guide.html#run-and-add-unit-tests).
+- [ ] Update documentation according to [Write documentations](https://docs.sglang.ai/developer_guide/contribution_guide.html#write-documentations).
+- [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.ai/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.ai/developer_guide/contribution_guide.html#benchmark-the-speed).
--- a/docs/advanced_features/server_arguments.md
+++ b/docs/advanced_features/server_arguments.md
@@ -14,7 +14,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help`
  python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
  ```

- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../router/router.md) for data parallelism.
+- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../advanced_features/router.md) for data parallelism.

  ```bash
  python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2

--- a/docs/basic_usage/sampling_params.md
+++ b/docs/basic_usage/sampling_params.md
 # Sampling Parameters

 This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
-If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb).
+If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](openai_api_completions.ipynb).

 ## `/generate` Endpoint

-The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
+The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.

 | Argument                   | Type/Default                                                                 | Description                                                                                                                                                     |
 |----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -135,7 +135,7 @@ for chunk in response.iter_lines(decode_unicode=False):
 print("")
 ```

-Detailed example in [openai compatible api](https://docs.sglang.ai/backend/openai_api_completions.html#id2).
+Detailed example in [openai compatible api](openai_api_completions.ipynb).

 ### Multimodal

@@ -176,7 +176,7 @@ The `image_data` can be a file name, a URL, or a base64 encoded string. See also

 Streaming is supported in a similar manner as [above](#streaming).

-Detailed example in [openai api vision](./openai_api_vision.ipynb).
+Detailed example in [OpenAI API Vision](openai_api_vision.ipynb).

 ### Structured Outputs (JSON, Regex, EBNF)


--- a/docs/developer_guide/contribution_guide.md
+++ b/docs/developer_guide/contribution_guide.md
@@ -69,9 +69,10 @@ Another effective strategy is to review the file modification history and contac
 If you modify files protected by code owners, their approval is required to merge the code.

 ## General Code Style
- Avoid code duplication. If the same code snippet (more than 5 lines) appears multiple times, extract it into a shared function.
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, as much as possible. Use vectorized code instead.
- Keep files short. If a file exceeds 2,000 lines of code, please split it into multiple smaller files.
+- Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function.
+- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code.
+- Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files.
+- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize every minor overhead as much as possible.

 ## Tips for newcomers


--- a/docs/index.rst
+++ b/docs/index.rst
@@ -87,7 +87,7 @@ The core features include:
   references/faq.md
   references/environment_variables.md
   references/production_metrics.md
+   references/multi_node_deployment/multi_node_index.rst
   references/custom_chat_template.md
   references/frontend/frontend_index.rst
-   references/multi_node_deployment/multi_node_index.rst
   references/learn_more.md
--- a/docs/platforms/nvidia_jetson.md
+++ b/docs/platforms/nvidia_jetson.md
@@ -66,7 +66,7 @@ This enables TorchAO's int4 weight-only quantization with a 128-group size. The
 * * * * *
 Structured output with XGrammar
 -------------------------------
-Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb).
+Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
 * * * * *

 Thanks to the support from [shahizat](https://github.com/shahizat).

--- a/docs/supported_models/generative_models.md
+++ b/docs/supported_models/generative_models.md
@@ -25,9 +25,9 @@ in the GitHub search bar.

 | Model Family (Variants)             | Example HuggingFace Identifier                     | Description                                                                            |
 |-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------|
-| **DeepSeek** (v1, v2, v3/R1)        | `deepseek-ai/DeepSeek-R1`                        | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](https://docs.sglang.ai/references/deepseek) and [Reasoning Parser](https://docs.sglang.ai/backend/separate_reasoning)|
-| **Qwen** (3, 3MoE, 2.5, 2 series)       | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B`       | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](https://docs.sglang.ai/backend/separate_reasoning)|
-| **Llama** (2, 3.x, 4 series)        | `meta-llama/Llama-4-Scout-17B-16E-Instruct`       | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](https://docs.sglang.ai/references/llama4)  |
+| **DeepSeek** (v1, v2, v3/R1)        | `deepseek-ai/DeepSeek-R1`                        | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)|
+| **Qwen** (3, 3MoE, 2.5, 2 series)       | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B`       | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)|
+| **Llama** (2, 3.x, 4 series)        | `meta-llama/Llama-4-Scout-17B-16E-Instruct`       | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md))  |
 | **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2`             | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
 | **Gemma** (v1, v2, v3)              | `google/gemma-3-1b-it`                            | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
 | **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. |

--- a/docs/supported_models/support_new_models.md
+++ b/docs/supported_models/support_new_models.md
@@ -27,7 +27,7 @@ standard LLM support:
 3. **Multimodal Data Processor**:
   Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
   model’s dedicated processor.
-   See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
+   See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors)
   for more details.

 4. **Handle Multimodal Tokens**:

--- a/docs/supported_models/transformers_fallback.md
+++ b/docs/supported_models/transformers_fallback.md
@@ -18,7 +18,7 @@ python3 -m sglang.launch_server \

 ### Quantization

-Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang.
+Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang.

 ### Remote code