Refactor the docs (#9031)

2449a0af · Lianmin Zheng · GitHub · 0f229c07 · 2449a0af · 2449a0af
Unverified Commit 2449a0af authored Aug 10, 2025 by Lianmin Zheng Committed by GitHub Aug 10, 2025
20 changed files
--- a/docs/start/install.md
+++ b/docs/start/install.md
 # Install SGLang
-You can install SGLang using any of the methods below.
+You can install SGLang using one of the methods below.
-For running DeepSeek V3/R1, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is recommended to use the latest version and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid environment-related issues.
+This page primarily applies to common NVIDIA GPU platforms.
+For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).
-It is recommended to use uv to install the dependencies for faster installation:
 ## Method 1: With pip or uv
+It is recommended to use uv for faster installation:
 ```bash
 pip install --upgrade pip
 pip install uv
 uv pip install "sglang[all]>=0.5.0rc0"
 ```
-**Quick Fixes to Common Problems**
+**Quick fixes to common problems**
- SGLang currently uses torch 2.7.1, so you need to install flashinfer for torch 2.7.1. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
 - If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
  1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
  2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
+- SGLang currently uses torch 2.8 and flashinfer for torch 2.8. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
 ## Method 2: From source
@@ -30,34 +28,18 @@ uv pip install "sglang[all]>=0.5.0rc0"
 git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
 cd sglang
+# Install the python packages
 pip install --upgrade pip
 pip install -e "python[all]"
 ```
-Note: SGLang currently uses torch 2.7.1, so you need to install flashinfer for torch 2.7.1. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html).
+**Quick fixes to common problems**
+- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
-If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](https://github.com/sgl-project/sglang/blob/main/docs/references/development_guide_using_docker.md#setup-docker-container) for guidance. The docker image is `lmsysorg/sglang:dev`.
+- SGLang currently uses torch 2.8 and flashinfer for torch 2.8. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
-Note: For AMD ROCm system with Instinct/MI GPUs, do following instead:
-```bash
-# Use the last release branch
-git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
-cd sglang
-pip install --upgrade pip
-cd sgl-kernel
-python setup_rocm.py install
-cd ..
-pip install -e "python[all_hip]"
-```
-Note: Please refer to [the CPU environment setup command list](../references/cpu.md#install-from-source)
-to set up the SGLang environment for running the models with CPU servers.
 ## Method 3: Using docker
-The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
+The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
 Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
 ```bash
@@ -71,29 +53,24 @@ docker run --gpus all \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
 ```
-Note: For AMD ROCm system with Instinct/MI GPUs, it is recommended to use `docker/Dockerfile.rocm` to build images, example and usage as below:
+## Method 4: Using Kubernetes
-```bash
+Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
-docker build --build-arg SGL_BRANCH=v0.5.0rc0 -t v0.5.0rc0-rocm630 -f Dockerfile.rocm .
-alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
+<details>
-    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+<summary>More</summary>
-    -v $HOME/dockerx:/dockerx -v /data:/data'
-drun -p 30000:30000 \
+1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    --env "HF_TOKEN=<secret>" \
-    v0.5.0rc0-rocm630 \
-    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
-# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
+   Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
-drun v0.5.0rc0-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
-```
+2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
+   Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
-Note: Please refer to [the CPU installation guide using Docker](../references/cpu.md#install-using-docker)
+</details>
-to set up the SGLang environment for running the models with CPU servers.
-## Method 4: Using docker compose
+## Method 5: Using docker compose
 <details>
 <summary>More</summary>
@@ -105,21 +82,6 @@ to set up the SGLang environment for running the models with CPU servers.
 2. Execute the command `docker compose up -d` in your terminal.
 </details>
-## Method 5: Using Kubernetes
-<details>
-<summary>More</summary>
-1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
-   Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
-2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
-   Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
-</details>
 ## Method 6: Run on Kubernetes or Clouds with SkyPilot
 <details>
@@ -166,6 +128,6 @@ sky status --endpoint 30000 sglang
 ## Common Notes
 - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
 - To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
+- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
+- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -12,31 +12,41 @@ The core features include:
 .. toctree::
   :maxdepth: 1
-   :caption: Installation
+   :caption: Get Started
-   start/install.md
+   get_started/install.md
 .. toctree::
   :maxdepth: 1
-   :caption: Backend Tutorial
+   :caption: Basic Usage
-   references/deepseek
+   basic_usage/send_request.ipynb
-   references/llama4
+   basic_usage/openai_api.rst
-   backend/send_request.ipynb
+   basic_usage/offline_engine_api.ipynb
-   backend/openai_api_completions.ipynb
+   basic_usage/native_api.ipynb
-   backend/openai_api_vision.ipynb
+   basic_usage/sampling_params.md
-   backend/openai_api_embeddings.ipynb
+   basic_usage/deepseek.md
-   backend/native_api.ipynb
+   basic_usage/gpt_oss.md
-   backend/offline_engine_api.ipynb
+   basic_usage/llama4.md
 .. toctree::
   :maxdepth: 1
-   :caption: Advanced Backend Configurations
+   :caption: Advanced Features
-   backend/server_arguments.md
+   advanced_features/server_arguments.md
-   backend/sampling_params.md
+   advanced_features/hyperparameter_tuning.md
-   backend/hyperparameter_tuning.md
+   advanced_features/speculative_decoding.ipynb
-   backend/attention_backend.md
+   advanced_features/structured_outputs.ipynb
+   advanced_features/structured_outputs_for_reasoning_models.ipynb
+   advanced_features/function_calling.ipynb
+   advanced_features/separate_reasoning.ipynb
+   advanced_features/quantization.md
+   advanced_features/lora.ipynb
+   advanced_features/pd_disaggregation.md
+   advanced_features/vlm_query.ipynb
+   advanced_features/router.md
+   advanced_features/observability.md
+   advanced_features/attention_backend.md
 .. toctree::
   :maxdepth: 1
@@ -46,43 +56,38 @@ The core features include:
   supported_models/multimodal_language_models.md
   supported_models/embedding_models.md
   supported_models/reward_models.md
+   supported_models/rerank_models.md
   supported_models/support_new_models.md
   supported_models/transformers_fallback.md
+   supported_models/modelscope.md
 .. toctree::
   :maxdepth: 1
-   :caption: Advanced Features
+   :caption: Hardware Platforms
-   backend/speculative_decoding.ipynb
-   backend/structured_outputs.ipynb
-   backend/function_calling.ipynb
-   backend/separate_reasoning.ipynb
-   backend/structured_outputs_for_reasoning_models.ipynb
-   backend/custom_chat_template.md
-   backend/quantization.md
-   backend/lora.ipynb
-   backend/pd_disaggregation.md
-   backend/vlm_query.ipynb
-.. toctree::
-   :maxdepth: 1
-   :caption: Frontend Tutorial
-   frontend/frontend.ipynb
+   platforms/amd_gpu.md
-   frontend/choices_methods.md
+   platforms/blackwell_gpu.md
+   platforms/cpu_server.md
+   platforms/tpu.md
+   platforms/nvidia_jetson.md
+   platforms/ascend_npu.md
 .. toctree::
   :maxdepth: 1
-   :caption: SGLang Router
+   :caption: Developer Guide
-   router/router.md
+   developer_guide/contribution_guide.md
+   developer_guide/development_guide_using_docker.md
+   developer_guide/benchmark_and_profiling.md
 .. toctree::
   :maxdepth: 1
   :caption: References
-      references/general
+   references/faq.md
-      references/hardware
+   references/environment_variables.md
-      references/advanced_deploy
+   references/production_metrics.md
-      references/performance_analysis_and_optimization
+   references/custom_chat_template.md
-      references/developer
+   references/frontend/frontend_index.rst
+   references/multi_node_deployment/multi_node_index.rst
+   references/learn_more.md
--- a/docs/references/amd.md
+++ b/docs/references/amd.md
-# SGLang on AMD
+# AMD GPUs
-This document describes how to set up an AMD-based environment for [SGLang](https://github.com/sgl-project/sglang). If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues) on the SGLang repository.
+This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
 ## System Configuration
 When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
 - [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
-  - [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
+- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
-  - [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
+- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
-  - [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
+- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
+- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
 **NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
@@ -35,24 +36,35 @@ You can automate or verify this change using [this helpful script](https://githu
 Again, please go through the entire documentation to confirm your system is using the recommended configuration.
-## Installing SGLang
+## Install SGLang
-For general installation instructions, see the official [SGLang Installation Docs](../start/install.md). Below are the AMD-specific steps summarized for convenience.
+You can install SGLang using one of the methods below.
 ### Install from Source
 ```bash
-git clone https://github.com/sgl-project/sglang.git
+# Use the last release branch
+git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
 cd sglang
+# Compile sgl-kernel
 pip install --upgrade pip
-pip install sgl-kernel --force-reinstall --no-deps
+cd sgl-kernel
+python setup_rocm.py install
+# Install sglang python package
+cd ..
 pip install -e "python[all_hip]"
 ```
 ### Install Using Docker (Recommended)
+The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
+The steps below show how to build and use an image.
 1. Build the docker image.
+   If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
   ```bash
   docker build -t sglang_image -f Dockerfile.rocm .
@@ -68,10 +80,10 @@ pip install -e "python[all_hip]"
       -v /data:/data'
   ```
-If you are using RDMA, please note that:
+   If you are using RDMA, please note that:
+     - `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
+     - You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
-1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
-2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
 3. Launch the server.
   **NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).

--- a/docs/platforms/ascend_npu.md
+++ b/docs/platforms/ascend_npu.md
+# Ascend NPUs
+## Install
+TODO
+## Examples
+TODO
--- a/docs/platforms/blackwell_gpu.md
+++ b/docs/platforms/blackwell_gpu.md
+# Blackwell GPUs
+We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
+## B200 with x86 CPUs
+TODO
+## GB200/GB300 with ARM CPUs
+TODO
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
-# SGLang on CPU
+# CPU Servers
 The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
 Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,

--- a/docs/references/nvidia_jetson.md
+++ b/docs/references/nvidia_jetson.md
-# Apply SGLang on NVIDIA Jetson Orin
+# NVIDIA Jetson Orin
 ## Prerequisites

--- a/docs/platforms/tpu.md
+++ b/docs/platforms/tpu.md
+# TPU
+The support for TPU is under active development. Please stay tuned.
--- a/docs/references/accuracy_evaluation.md
+++ b/docs/references/accuracy_evaluation.md
-# Measuring Model Accuracy in SGLang
-This guide shows how to evaluate model accuracy using SGLang's [built-in benchmarks](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark). Please include accuracy on crucial benchmarks in your PR if you make modifications on the model side, like the kernel and model architecture.
-## Benchmarking Model Accuracy
-This is a reference workflow for the [MMLU benchmark](https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu). For more details or other benchmarks, please refer to the README in each specific benchmark folder under [sglang/benchmark](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark).
-```bash
-# Step 1: Download the dataset
-bash download_data.sh
-# Step 2: Launch the server
-python3 -m sglang.launch_server \
-  --model-path Qwen/Qwen2.5-Math-1.5B-Instruct \  # Model selection
-  --port 30000 \  # Network configuration
-  --mem-fraction-static 0.8  # Memory optimization
-# Step 3: Run the benchmark script
-python3 bench_sglang.py --nsub 10  # Test 10 subjects
-# Step 4: Extract the accuracy
-cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'
-```
-## Customizing Benchmark Scripts
-Some benchmark implementations may differ from ours, causing accuracy discrepancies. To match [[Qwen2.5-Math]](https://github.com/QwenLM/Qwen2.5-Math)'s reported 76.8% GSM8K accuracy, customization is required.
-```python
-# The GSM8K benchmark script includes few shot examples for evaluation by default.
-# Here we exclude them.
-for i in range(len(lines[num_shots:num_questions])):
-    questions.append(get_one_example(lines, i, False))
-    labels.append(get_answer_value(lines[i]["answer"]))
-```
-```python
-@sgl.function
-def few_shot_gsm8k(s, question):
-    # System prompt given in https://github.com/QwenLM/Qwen2.5-Math
-    s += sgl.system("Please reason step by step, and put your final answer within \\boxed{}.") # Include system prompt
-    s += few_shot_examples + question
-    # Stopwords given in evaluation/math_eval.py of the Qwen2.5-Math repo
-    s += sgl.gen(
-        "answer", max_tokens=2048, stop=["Question", "Assistant:", "</s>", "<|im_end|>", "<|endoftext|>"]
-    )
-```
-These adjustments should return the desired accuracy.
-## Extending Evaluation Capabilities
-1. **Contribute New Benchmarks**
-   * Follow our [contribution guidelines](../references/contribution_guide.md) to add new test scripts
-2. **Request Implementations**
-   * Feel free to open an issue describing your evaluation needs
-3. **Use Alternative Tools**
-   * [OpenCompass](https://opencompass.org.cn)
-   * [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
--- a/docs/references/advanced_deploy.rst
+++ b/docs/references/advanced_deploy.rst
-Multi-Node Deployment
-==========================
-.. toctree::
-   :maxdepth: 1
-   multi_node.md
-   deploy_on_k8s.md
-   disaggregation/lws_pd_deploy.md
--- a/docs/backend/custom_chat_template.md
+++ b/docs/backend/custom_chat_template.md
--- a/docs/references/deepseek.rst
+++ b/docs/references/deepseek.rst
-Multi-Node Deployment
-==========================
-.. toctree::
-   :maxdepth: 1
-   deepseek.md
--- a/docs/references/developer.rst
+++ b/docs/references/developer.rst
-Developer Reference
-==========================
-.. toctree::
-   :maxdepth: 1
-   development_guide_using_docker.md
-   release_process.md
-   setup_github_runner.md
--- a/docs/references/faq.md
+++ b/docs/references/faq.md
-# Frequently Asked Questions
+# Troubleshooting and Frequently Asked Questions
-## The results are not deterministic, even with a temperature of 0
+## Troubleshooting
+This page lists common errors and tips for resolving them.
+### CUDA Out of Memory
+If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
+- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
+- If OOM occurs during decoding, try lowering `--max-running-requests`.
+- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
+- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
+### CUDA Error: Illegal Memory Access Encountered
+This error may result from kernel errors or out-of-memory issues:
+- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
+- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
+## Frequently Asked Questions
+### The results are not deterministic, even with a temperature of 0
 You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.

--- a/docs/frontend/choices_methods.md
+++ b/docs/frontend/choices_methods.md
--- a/docs/references/frontend/frontend_index.rst
+++ b/docs/references/frontend/frontend_index.rst
+Frontend Language
+=================
+.. toctree::
+   :maxdepth: 1
+   :caption: Frontend Language
+   frontend_tutorial.ipynb
+   choices_methods.md
--- a/docs/frontend/frontend.ipynb
+++ b/docs/frontend/frontend.ipynb
@@ -29,23 +29,15 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import requests\n",
-    "import os\n",
-    "\n",
    "from sglang import assistant_begin, assistant_end\n",
    "from sglang import assistant, function, gen, system, user\n",
    "from sglang import image\n",
-    "from sglang import RuntimeEndpoint, set_default_backend\n",
+    "from sglang import RuntimeEndpoint\n",
+    "from sglang.lang.api import set_default_backend\n",
    "from sglang.srt.utils import load_image\n",
-    "from sglang.test.test_utils import is_in_ci\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
    "from sglang.utils import print_highlight, terminate_process, wait_for_server\n",
    "\n",
-    "if is_in_ci():\n",
-    "    from patch import launch_server_cmd\n",
-    "else:\n",
-    "    from sglang.utils import launch_server_cmd\n",
-    "\n",
-    "\n",
    "server_process, port = launch_server_cmd(\n",
    "    \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0\"\n",
    ")\n",

--- a/docs/references/general.rst
+++ b/docs/references/general.rst
-General Guidance
-==========
-.. toctree::
-   :maxdepth: 1
-   contribution_guide.md
-   troubleshooting.md
-   faq.md
-   learn_more.md
-   modelscope.md
-   environment_variables.md
-   production_metrics.md
--- a/docs/references/hardware.rst
+++ b/docs/references/hardware.rst
-Hardware Supports
-==========
-.. toctree::
-   :maxdepth: 1
-   amd.md
-   nvidia_jetson.md
-   cpu.md
--- a/docs/references/learn_more.md
+++ b/docs/references/learn_more.md
 # Learn more
 You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
+The latest SGLang features and updates are shared through the [LMSYS blog](https://lmsys.org/blog/).
+The 2025 H2 roadmap can be found at this [issue](https://github.com/sgl-project/sglang/issues/7736).