Refactor the docs (#9031)

2449a0af · Lianmin Zheng · GitHub · 0f229c07 · 2449a0af · 2449a0af
Unverified Commit 2449a0af authored Aug 10, 2025 by Lianmin Zheng Committed by GitHub Aug 10, 2025
20 changed files
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -10,7 +10,7 @@
 /python/sglang/srt/eplb @fzyzcjy
 /python/sglang/srt/function_call @CatherineSue
 /python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan @BBuf @kushanam @Edwardf0t1
-/python/sglang/srt/lora @Ying1123 @Fridge003
+/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang
 /python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
 /python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
 /python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock

--- a/.github/REVIEWERS.md
+++ b/.github/REVIEWERS.md
+# Area Reviewer
+Here are some reviewers for common areas. You can ping them to review your code if you touch related parts.
+## Hardware platforms
+- general @Alcanderian
+- AMD GPU @HaiShaw
+- Blackwell GPU @kushanam @trevor-m @zhyncs
+- CPU @mingfeima
+## Kernel
+- general @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @HaiShaw
+- triton attention backend @ispobock
+- flash attention @hebiao064
+## Scheduler and memory pool
+- general @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
+- constrained decoding @hnyls2002
+- hierarhical cache @xiezhq-hermann @DarkSharpness
+- lora @Fridge003 @Ying1123 @lifuhuang
+- speculative decoding @merrymercy @Ying1123  @kssteven418
+- sliding window attention @hanming-lu
+## Parallelism
+- expert parallelism @fzyzcjy @ch-wan
+- data parallelism attention @ch-wan
+- pipeline parallelism @Ying1123
+- tensor parallelism @merrymercy
+## PD disaggregation
+- general @ByronHsu @ShangmingCai @@ShangmingCai @hnyls2002
+- Mooncake backend @ShangmingCai
+## Build and release
+- general @zhyncs @merrymercy
+## API Server
+- general @CatherineSue @slin1237 @ispobock
+- function calling and reasoning parsing @CatherineSue
+- OpenAI API @CatherineSue @slin1237
+## SGL-Router
+- general @slin1237 @ByronHsu
+## Model
+- multimodal models @mickqian @JustinTong0323
+- other new models @zhaochenyang20
+## Reinforcment learning
+- general @zhaochenyang20 @hebiao064 @fzyzcjy @zhuzilin
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
-<!-- Thank you for your contribution! We appreciate it. The following guidelines will help improve your pull request and facilitate feedback. If anything is unclear, don't hesitate to submit your pull request and ask the maintainers for assistance. -->
+<!-- Thank you for your contribution! Please follow these guidelines to enhance your pull request. If anything is unclear, submit your PR and reach out to maintainers for assistance. Join our Slack community at https://slack.sglang.ai to discuss further. -->
 ## Motivation
-<!-- Explain the purpose of this PR and the goals it aims to achieve. -->
+<!-- Describe the purpose and goals of this pull request. -->
 ## Modifications
-<!-- Describe the changes made in this PR. -->
+<!-- Detail the changes made in this pull request. -->
-## Accuracy Test
+## Accuracy Tests
-<!-- If this PR affects model-side code (e.g., kernels, model architecture), please provide accuracy test results. Ref: https://docs.sglang.ai/references/accuracy_evaluation.html -->
+<!-- If this pull request affects model outputs (e.g., changes to the kernel or model forward code), provide accuracy test results. -->
-## Benchmark & Profiling
+## Benchmarking and Profiling
-<!-- If this PR is expected to impact performance, please provide benchmark and profiling results. Ref: https://docs.sglang.ai/references/benchmark_and_profiling.html -->
+<!-- If this pull request impacts inference speed, provide benchmarking and profiling results. -->
 ## Checklist
- [ ] Format your code according to the [Code Formatting with Pre-Commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
+- [ ] Format your code according to the [Code formatting with pre-commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
- [ ] Add unit tests as outlined in the [Running Unit Tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
+- [ ] Add unit tests according to the [Running and adding unit tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
- [ ] Update documentation / docstrings / example tutorials as needed, according to [Writing Documentation](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci).
+- [ ] Update documentation according to [Writing documentations](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci).
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to [Benchmark and Profiling](https://docs.sglang.ai/references/benchmark_and_profiling.html) and [Accuracy Results](https://docs.sglang.ai/references/accuracy_evaluation.html).
+- [ ] Provide accuracy and speed benchmark results according to [Testing the accuracy](https://docs.sglang.ai/references/contribution_guide.html#testing-the-accuracy) and [Benchmark and profiling]()
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
--- a/.github/workflows/release-docs.yml
+++ b/.github/workflows/release-docs.yml
@@ -41,6 +41,7 @@ jobs:
          make compile
      - name: Push HTML to sgl-project.github.io
+        timeout-minutes: 60
        env:
          GITHUB_TOKEN: ${{ secrets.DOCUMENTATION_PAT_TOKEN }}
        run: |

--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -42,7 +42,7 @@ repos:
        exclude: |
          (?x)^(
            test/srt/test_reasoning_parser\.py|
-            docs/backend/vlm_query\.ipynb
+            docs/advanced_features/vlm_query\.ipynb
          )$
  - repo: https://github.com/pre-commit/mirrors-clang-format
    rev: v18.1.8

--- a/docs/Makefile
+++ b/docs/Makefile
@@ -40,8 +40,9 @@ compile:
 # Serve documentation with auto-build and live reload
 serve:
-	@echo "Starting auto-build server at http://localhost:$(PORT)"
+	@echo "Starting auto-build server at http://0.0.0.0:$(PORT)"
 	@$(SPHINXAUTOBUILD) "$(SOURCEDIR)" "$(BUILDDIR)/html" \
+		--host 0.0.0.0 \
 		--port $(PORT) \
 		--watch $(SOURCEDIR) \
 		--re-ignore ".*\.(ipynb_checkpoints|pyc|pyo|pyd|git)"

--- a/docs/README.md
+++ b/docs/README.md
 # SGLang Documentation
-We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. Most documentation files are located under the `docs/` folder. We prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline.
+We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase.
+Most documentation files are located under the `docs/` folder.
 ## Docs Workflow
 ### Install Dependency
 ```bash
+apt-get update && apt-get install -y pandoc parallel retry
 pip install -r requirements.txt
 ```
@@ -15,11 +17,11 @@ pip install -r requirements.txt
 Update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly.
 - **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
 ```bash
 # 1) Compile all Jupyter notebooks
-make compile
+make compile  # This step can take a long time (10+ mins). You can consider skipping this step if you can make sure your added files are correct.
+make html
 # 2) Compile and Preview documentation locally with auto-build
 # This will automatically rebuild docs when files change
@@ -43,68 +45,11 @@ pre-commit run --all-files
 ```
 ---
-### **Port Allocation and CI Efficiency**
+## Documentation Style Guidelines
-**To launch and kill the server:**
+- For common functionalities, we prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline. For complex features (e.g., distributed serving), Markdown is preferred.
+- Keep in mind the documentation execution time when writing interactive Jupyter notebooks. Each interactive notebook will be run and compiled against every commit to ensure they are runnable, so it is important to apply some tips to reduce the documentation compilation time:
-```python
+  - Use small models (e.g., `qwen/qwen2.5-0.5b-instruct`) for most cases to reduce server launch time.
-from sglang.test.test_utils import is_in_ci
+  - Reuse the launched server as much as possible to reduce server launch time.
-from sglang.utils import wait_for_server, print_highlight, terminate_process
+- Do not use absolute links (e.g., `https://docs.sglang.ai/get_started/install.html`). Always prefer relative links (e.g., `../get_started/install.md`).
+- Follow the existing examples to learn how to launch a server, send a query and other common styles.
-if is_in_ci():
-    from patch import launch_server_cmd
-else:
-    from sglang.utils import launch_server_cmd
-server_process, port = launch_server_cmd(
-    """
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
- --host 0.0.0.0
-"""
-)
-wait_for_server(f"http://localhost:{port}")
-# Terminate Server
-terminate_process(server_process)
-```
-**To launch and kill the engine:**
-```python
-# Launch Engine
-import sglang as sgl
-import asyncio
-from sglang.test.test_utils import is_in_ci
-if is_in_ci():
-    import patch
-llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
-# Terminalte Engine
-llm.shutdown()
-```
-### **Why this approach?**
- **Dynamic Port Allocation**: Avoids port conflicts by selecting an available port at runtime, enabling multiple server instances to run in parallel.
- **Optimized for CI**: The `patch` version of `launch_server_cmd` and `sgl.Engine()` in CI environments helps manage GPU memory dynamically, preventing conflicts and improving test parallelism.
- **Better Parallel Execution**: Ensures smooth concurrent tests by avoiding fixed port collisions and optimizing memory usage.
-### **Model Selection**
-For demonstrations in the docs, **prefer smaller models** to reduce memory consumption and speed up inference. Running larger models in CI can lead to instability due to memory constraints.
-### **Prompt Alignment Example**
-When designing prompts, ensure they align with SGLang's structured formatting. For example:
-```python
-prompt = """You are an AI assistant. Answer concisely and accurately.
-User: What is the capital of France?
-Assistant: The capital of France is Paris."""
-```
-This keeps responses aligned with expected behavior and improves reliability across different files.
--- a/docs/backend/attention_backend.md
+++ b/docs/backend/attention_backend.md
 # Attention Backend
+SGLang supports multiple attention backends. Each of them has different pros and cons.
+You can test them according to your needs.
 ## Supporting matrix for different attention backends
 | **Backend**              | **Page Size > 1** | **Spec Decoding** | **MLA** | **Sliding Window** | **MultiModal** |
@@ -7,10 +10,10 @@
 | **FlashInfer**           | ❌                | ✅                 | ✅      | ✅                 | ✅              |
 | **FA3**                  | ✅                | ✅                 | ✅      | ✅                 | ✅              |
 | **Triton**               | ❌                | ✅                 | ✅      | ✅                 | ❌              |
-| **Torch Native**         | ❌                | ❌                 | ❌      | ❌                 | ❌              |
+| **Torch Native**         | ❌                | ❌                 | ✅      | ❌                 | ❌              |
 | **FlashMLA**             | ✅                | ✅                 | ✅      | ❌                 | ❌              |
 | **TRTLLM MLA**           | ✅                | ❌                 | ✅      | ✅                 | ❌              |
-| **Ascend**               | ✅                | ❌                 | ❌      | ❌                 | ❌              |
+| **Ascend**               | ✅                | ❌                 | ✅      | ❌                 | ❌              |
 **Notes:**
 - TRTLLM MLA only implements decode operations. For prefill operations (including multimodal inputs), it falls back to FlashInfer MLA backend.
@@ -21,7 +24,7 @@ The "❌" and "✅" symbols in the table above under "Page Size > 1" indicate wh
 ## User guide
-#### Launch command for different attention backends.
+### Launch command for different attention backends.
 - FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
 ```bash

--- a/docs/backend/function_calling.ipynb
+++ b/docs/backend/function_calling.ipynb
@@ -29,18 +29,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from openai import OpenAI\n",
    "import json\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
-    "from sglang.test.test_utils import is_in_ci\n",
+    "from openai import OpenAI\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
-    "    import nest_asyncio\n",
-    "\n",
-    "    nest_asyncio.apply()\n",
    "\n",
    "server_process, port = launch_server_cmd(\n",
    "    \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0\"  # qwen25\n",
@@ -304,7 +296,7 @@
   "metadata": {},
   "source": [
    "\n",
-    "## Execute the Tool"
+    "### Execute the Tool"
   ]
  },
  {
@@ -389,17 +381,8 @@
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
-    "import json\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
-    "from sglang.test.test_utils import is_in_ci\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
-    "    import nest_asyncio\n",
-    "\n",
-    "    nest_asyncio.apply()\n",
    "\n",
    "# Start a new server session for tool choice examples\n",
    "server_process_tool_choice, port_tool_choice = launch_server_cmd(\n",
@@ -498,6 +481,15 @@
    "    print(f\"Arguments: {tool_call.function.arguments}\")"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(server_process_tool_choice)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},

--- a/docs/backend/hyperparameter_tuning.md
+++ b/docs/backend/hyperparameter_tuning.md
@@ -52,11 +52,11 @@ Note that CUDA graph consumes more memory, so you may need to reduce `--mem-frac
 ### Tune `--dp-size` and `--tp-size`
-Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter.
+Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../advanced_features/router.md) for a better data parallelism rather than using `dp_size` parameter.
 ### Try other options
 - `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`.
 - Try other quantization (e.g. FP8 quantization with `--quantization fp8`)
- Try other parallelism strategies (e.g. expert parallelism) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`).
+- Try other parallelism strategies (e.g. [expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`).
 - If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead.
--- a/docs/backend/lora.ipynb
+++ b/docs/backend/lora.ipynb
@@ -61,17 +61,11 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from sglang.test.test_utils import is_in_ci\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
-    "\n",
-    "from sglang.utils import wait_for_server, terminate_process\n",
-    "\n",
    "import json\n",
-    "import requests"
+    "import requests\n",
+    "\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
+    "from sglang.utils import wait_for_server, terminate_process"
   ]
  },
  {

--- a/docs/advanced_features/observability.md
+++ b/docs/advanced_features/observability.md
+# Observability
+## Production Metrics
+SGLang exposes the following metrics via Prometheus. You can enable them by adding `--enable-metrics` when launching the server.
+You can query them by:
+```
+curl http://localhost:30000/metrics
+```
+See [Production Metrics](../references/production_metrics.md) for more details.
+## Logging
+By default, SGLang does not log any request contents. You can log them by using `--log-requests`.
+You can control the verbosity by using `--log-request-level`.
+See [Logging](server_arguments.md#logging) for more details.
+## Request Dump and Replay
+You can dump all requests and replay them later for benchmarking or other purposes.
+To start dumping, use the following command to send a request to a server:
+```
+python3 -m sglang.srt.managers.configure_logging --url http://localhost:30000 --dump-requests-folder /tmp/sglang_request_dump --dump-requests-threshold 100
+```
+The server will dump the requests into a pickle file for every 100 requests.
+To replay the request dump, use `scripts/playground/replay_request_dump.py`.
+## Crash Dump and Replay
+Sometimes the server might crash, and you may want to debug the cause of the crash.
+SGLang supports crash dumping, which will dump all requests from the 5 minutes before the crash, allowing you to replay the requests and debug the reason later.
+To enable crash dumping, use `--crash-dump-folder /tmp/crash_dump`.
+To replay the crash dump, use `scripts/playground/replay_request_dump.py`.
--- a/docs/backend/pd_disaggregation.md
+++ b/docs/backend/pd_disaggregation.md
--- a/docs/backend/quantization.md
+++ b/docs/backend/quantization.md
--- a/docs/router/router.md
+++ b/docs/router/router.md
--- a/docs/backend/separate_reasoning.ipynb
+++ b/docs/backend/separate_reasoning.ipynb
@@ -56,16 +56,9 @@
   "source": [
    "import requests\n",
    "from openai import OpenAI\n",
-    "from sglang.test.test_utils import is_in_ci\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
-    "\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
-    "\n",
    "server_process, port = launch_server_cmd(\n",
    "    \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1\"\n",
    ")\n",

--- a/docs/backend/server_arguments.md
+++ b/docs/backend/server_arguments.md
@@ -38,7 +38,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help`
 - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
 - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](custom_chat_template.md).
+- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).
 - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
  ```bash

--- a/docs/backend/speculative_decoding.ipynb
+++ b/docs/backend/speculative_decoding.ipynb
@@ -7,7 +7,6 @@
    "# Speculative Decoding\n",
    "\n",
    "SGLang now provides an EAGLE-based (EAGLE-2/EAGLE-3) speculative decoding option. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.\n",
-    "**Note:** Currently, Speculative Decoding in SGLang is compatible with radix cache and chunked prefill.\n",
    "\n",
    "### Performance Highlights\n",
    "\n",
@@ -18,7 +17,7 @@
    "|--------|----------------|\n",
    "| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |\n",
    "| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |\n",
-    "| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |\n"
+    "| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |"
   ]
  },
  {
@@ -30,12 +29,14 @@
    "To enable EAGLE speculative decoding the following parameters are relevant:\n",
    "* `speculative_draft_model_path`: Specifies draft model. This parameter is required.\n",
    "* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.\n",
-    "\n",
    "* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.\n",
-    "\n",
    "* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.\n",
    "\n",
-    "These parameters are the same for EAGLE-2 and EAGLE-3."
+    "These parameters are the same for EAGLE-2 and EAGLE-3.\n",
+    "\n",
+    "You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).\n",
+    "\n",
+    "In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. "
   ]
  },
  {
@@ -53,13 +54,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from sglang.test.test_utils import is_in_ci\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
-    "\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
    "import openai"

--- a/docs/backend/structured_outputs.ipynb
+++ b/docs/backend/structured_outputs.ipynb
@@ -15,8 +15,8 @@
    "\n",
    "SGLang supports three grammar backends:\n",
    "\n",
-    "- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.\n",
    "- [XGrammar](https://github.com/mlc-ai/xgrammar)(default): Supports JSON schema, regular expression, and EBNF constraints.\n",
+    "- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.\n",
    "- [Llguidance](https://github.com/guidance-ai/llguidance): Supports JSON schema, regular expression, and EBNF constraints.\n",
    "\n",
    "We suggest using XGrammar for its better performance and utility. XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n",
@@ -43,13 +43,8 @@
   "source": [
    "import openai\n",
    "import os\n",
-    "from sglang.test.test_utils import is_in_ci\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
    "\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",

--- a/docs/backend/structured_outputs_for_reasoning_models.ipynb
+++ b/docs/backend/structured_outputs_for_reasoning_models.ipynb
@@ -39,13 +39,8 @@
   "source": [
    "import openai\n",
    "import os\n",
-    "from sglang.test.test_utils import is_in_ci\n",
-    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
    "\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",