sglangv0.5.2 & support Qwen3-Next-80B-A3B-Instruct

118f1fc7 · maxiao1 · 118f1fc7 · 118f1fc7 · 118f1fc7 · 118f1fc7
Commit 118f1fc7 authored Sep 13, 2025 by maxiao1
20 changed files
--- a/docs/developer_guide/development_guide_using_docker.md
+++ b/docs/developer_guide/development_guide_using_docker.md
+# Development Guide Using Docker
+
+## Setup VSCode on a Remote Host
+(Optional - you can skip this step if you plan to run sglang dev container locally)
+
+1. In the remote host, download `code` from [Https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
+
+Example
+```bash
+wget https://vscode.download.prss.microsoft.com/dbazure/download/stable/fabdb6a30b49f79a7aba0f2ad9df9b399473380f/vscode_cli_alpine_x64_cli.tar.gz
+tar xf vscode_cli_alpine_x64_cli.tar.gz
+
+# https://code.visualstudio.com/docs/remote/tunnels
+./code tunnel
+```
+
+2. In your local machine, press F1 in VSCode and choose "Remote Tunnels: Connect to Tunnel".
+
+## Setup Docker Container
+
+### Option 1. Use the default dev container automatically from VSCode
+There is a `.devcontainer` folder in the sglang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
+![image](https://github.com/user-attachments/assets/6a245da8-2d4d-4ea8-8db1-5a05b3a66f6d)
+(*Figure 1: Diagram from VSCode official documentation [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).*)
+
+To enable this, you only need to:
+1. Start Visual Studio Code and install [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
+2. Press F1, type and choose "Dev Container: Open Folder in Container.
+3. Input the `sglang` local repo path in your machine and press enter.
+
+The first time you open it in dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
+
+![image](https://github.com/user-attachments/assets/650bba0b-c023-455f-91f9-ab357340106b)
+
+Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, sglang server will be started in the dev container with all your local changes applied automatically:
+
+![image](https://github.com/user-attachments/assets/748c85ba-7f8c-465e-8599-2bf7a8dde895)
+
+
+### Option 2. Start up containers manually (advanced)
+
+The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers.
+
+❗️ **Note on RDMA**
+
+    1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below.
+    2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
+
+```bash
+# Change the name to yours
+docker run -itd --shm-size 32g --gpus all -v <volumes-to-mount> --ipc=host --network=host --privileged --name sglang_dev lmsysorg/sglang:dev /bin/zsh
+docker exec -it sglang_dev /bin/zsh
+```
+Some useful volumes to mount are:
+1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
+2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer.
+
+Example 1: Monting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
+```bash
+docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
+docker exec -it sglang_zhyncs /bin/zsh
+```
+Example 2: Mounting both HuggingFace cache and local SGLang repo. Local code changes are automatically synced to the devcontainer as the SGLang is installed in editable mode in the dev image.
+```bash
+docker run -itd --shm-size 32g --gpus all -v $HOME/.cache/huggingface/:/root/.cache/huggingface -v $HOME/src/sglang:/sgl-workspace/sglang --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
+docker exec -it sglang_zhyncs /bin/zsh
+```
+## Debug SGLang with VSCode Debugger
+1. (Create if not exist) open `launch.json` in VSCode.
+2. Add the following config and save. Please note that you can edit the script as needed to apply different parameters or debug a different program (e.g. benchmark script).
+     ```JSON
+       {
+          "version": "0.2.0",
+          "configurations": [
+              {
+                  "name": "Python Debugger: launch_server",
+                  "type": "debugpy",
+                  "request": "launch",
+                  "module": "sglang.launch_server",
+                  "console": "integratedTerminal",
+                  "args": [
+                      "--model-path", "meta-llama/Llama-3.2-1B",
+                      "--host", "0.0.0.0",
+                      "--port", "30000",
+                      "--trust-remote-code",
+                  ],
+                  "justMyCode": false
+              }
+          ]
+      }
+    ```
+
+3. Press "F5" to start. VSCode debugger will ensure that the program will pause at the breakpoints even if the program is running at remote SSH/Tunnel host + dev container.
+
+## Profile
+
+```bash
+# Change batch size, input, output and add `disable-cuda-graph` (for easier analysis)
+# e.g. DeepSeek V3
+nsys profile -o deepseek_v3 python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --disable-cuda-graph
+```
+
+## Evaluation
+
+```bash
+# e.g. gsm8k 8 shot
+python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
+```
--- a/docs/developer_guide/release_process.md
+++ b/docs/developer_guide/release_process.md
+# PyPI Package Release Process
+
+## Update the version in code
+Update the package version in `python/pyproject.toml` and `python/sglang/__init__.py`.
+
+## Upload the PyPI package
+
+```
+pip install build twine
+```
+
+```
+cd python
+bash upload_pypi.sh
+```
+
+## Make a release in GitHub
+Make a new release https://github.com/sgl-project/sglang/releases/new.
--- a/docs/developer_guide/setup_github_runner.md
+++ b/docs/developer_guide/setup_github_runner.md
+# Set Up Self-Hosted Runners for GitHub Action
+
+## Add a Runner
+
+### Step 1: Start a docker container.
+
+You can mount a folder for the shared huggingface model weights cache. The command below uses `/tmp/huggingface` as an example.
+
+```
+docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
+# Nvidia
+docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
+# AMD
+docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.0rc1-rocm630 /bin/bash
+# AMD just the last 2 GPUs
+docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.0rc1-rocm630 /bin/bash
+```
+
+### Step 2: Configure the runner by `config.sh`
+
+Run these commands inside the container.
+
+```
+apt update && apt install -y curl python3-pip git
+export RUNNER_ALLOW_RUNASROOT=1
+```
+
+Then follow https://github.com/sgl-project/sglang/settings/actions/runners/new?arch=x64&os=linux to run `config.sh`
+
+**Notes**
+- Do not need to specify the runner group
+- Give it a name (e.g., `test-sgl-gpu-0`) and some labels (e.g., `1-gpu-runner`). The labels can be edited later in Github Settings.
+- Do not need to change the work folder.
+
+### Step 3: Run the runner by `run.sh`
+
+- Set up environment variables
+```
+export HF_HOME=/hf_home
+export SGLANG_IS_IN_CI=true
+export HF_TOKEN=hf_xxx
+export OPENAI_API_KEY=sk-xxx
+export CUDA_VISIBLE_DEVICES=0
+```
+
+- Run it forever
+```
+while true; do ./run.sh; echo "Restarting..."; sleep 2; done
+```
--- a/docs/get_started/install.md
+++ b/docs/get_started/install.md
+# Install SGLang
+
+You can install SGLang using one of the methods below.
+
+This page primarily applies to common NVIDIA GPU platforms.
+For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).
+
+## Method 1: With pip or uv
+
+It is recommended to use uv for faster installation:
+
+```bash
+pip install --upgrade pip
+pip install uv
+uv pip install "sglang[all]>=0.5.2"
+```
+
+**Quick fixes to common problems**
+- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
+  1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
+  2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
+
+## Method 2: From source
+
+```bash
+# Use the last release branch
+git clone -b v0.5.2 https://github.com/sgl-project/sglang.git
+cd sglang
+
+# Install the python packages
+pip install --upgrade pip
+pip install -e "python[all]"
+```
+
+**Quick fixes to common problems**
+- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
+
+## Method 3: Using docker
+
+The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
+Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
+
+```bash
+docker run --gpus all \
+    --shm-size 32g \
+    -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HF_TOKEN=<secret>" \
+    --ipc=host \
+    lmsysorg/sglang:latest \
+    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
+```
+
+## Method 4: Using Kubernetes
+
+Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
+
+<details>
+<summary>More</summary>
+
+1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
+
+   Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
+
+2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
+
+   Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
+
+</details>
+
+## Method 5: Using docker compose
+
+<details>
+<summary>More</summary>
+
+> This method is recommended if you plan to serve it as a service.
+> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
+
+1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
+2. Execute the command `docker compose up -d` in your terminal.
+</details>
+
+## Method 6: Run on Kubernetes or Clouds with SkyPilot
+
+<details>
+<summary>More</summary>
+
+To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
+
+1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
+2. Deploy on your own infra with a single command and get the HTTP API endpoint:
+<details>
+<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>
+
+```yaml
+# sglang.yaml
+envs:
+  HF_TOKEN: null
+
+resources:
+  image_id: docker:lmsysorg/sglang:latest
+  accelerators: A100
+  ports: 30000
+
+run: |
+  conda deactivate
+  python3 -m sglang.launch_server \
+    --model-path meta-llama/Llama-3.1-8B-Instruct \
+    --host 0.0.0.0 \
+    --port 30000
+```
+
+</details>
+
+```bash
+# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
+HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
+
+# Get the HTTP API endpoint
+sky status --endpoint 30000 sglang
+```
+
+3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
+</details>
+
+## Common Notes
+
+- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
+- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
+- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
+- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
--- a/docs/index.rst
+++ b/docs/index.rst
+SGLang Documentation
+====================
+
+SGLang is a fast serving framework for large language models and vision language models.
+It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
+The core features include:
+
+- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
+- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
+- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
+- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Get Started
+
+   get_started/install.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Basic Usage
+
+   basic_usage/send_request.ipynb
+   basic_usage/openai_api.rst
+   basic_usage/offline_engine_api.ipynb
+   basic_usage/native_api.ipynb
+   basic_usage/sampling_params.md
+   basic_usage/deepseek.md
+   basic_usage/gpt_oss.md
+   basic_usage/llama4.md
+   basic_usage/qwen3.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced Features
+
+   advanced_features/server_arguments.md
+   advanced_features/hyperparameter_tuning.md
+   advanced_features/speculative_decoding.ipynb
+   advanced_features/structured_outputs.ipynb
+   advanced_features/structured_outputs_for_reasoning_models.ipynb
+   advanced_features/tool_parser.ipynb
+   advanced_features/separate_reasoning.ipynb
+   advanced_features/quantization.md
+   advanced_features/lora.ipynb
+   advanced_features/pd_disaggregation.md
+   advanced_features/vlm_query.ipynb
+   advanced_features/router.md
+   advanced_features/observability.md
+   advanced_features/attention_backend.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Supported Models
+
+   supported_models/generative_models.md
+   supported_models/multimodal_language_models.md
+   supported_models/embedding_models.md
+   supported_models/reward_models.md
+   supported_models/rerank_models.md
+   supported_models/support_new_models.md
+   supported_models/transformers_fallback.md
+   supported_models/modelscope.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Hardware Platforms
+
+   platforms/amd_gpu.md
+   platforms/blackwell_gpu.md
+   platforms/cpu_server.md
+   platforms/tpu.md
+   platforms/nvidia_jetson.md
+   platforms/ascend_npu.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Developer Guide
+
+   developer_guide/contribution_guide.md
+   developer_guide/development_guide_using_docker.md
+   developer_guide/benchmark_and_profiling.md
+   developer_guide/bench_serving.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: References
+
+   references/faq.md
+   references/environment_variables.md
+   references/production_metrics.md
+   references/multi_node_deployment/multi_node_index.rst
+   references/custom_chat_template.md
+   references/frontend/frontend_index.rst
+   references/learn_more.md
--- a/docs/platforms/amd_gpu.md
+++ b/docs/platforms/amd_gpu.md
+# AMD GPUs
+
+This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
+
+## System Configuration
+
+When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
+
+- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
+- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
+- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
+- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
+- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
+
+**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
+
+Below are a few key settings to confirm or enable for SGLang:
+
+### Update GRUB Settings
+
+In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`:
+
+```text
+pci=realloc=off iommu=pt
+```
+
+Afterward, run `sudo update-grub` (or your distro’s equivalent) and reboot.
+
+### Disable NUMA Auto-Balancing
+
+```bash
+sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
+```
+
+You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh).
+
+Again, please go through the entire documentation to confirm your system is using the recommended configuration.
+
+## Install SGLang
+
+You can install SGLang using one of the methods below.
+
+### Install from Source
+
+```bash
+# Use the last release branch
+git clone -b v0.5.2 https://github.com/sgl-project/sglang.git
+cd sglang
+
+# Compile sgl-kernel
+pip install --upgrade pip
+cd sgl-kernel
+python setup_rocm.py install
+
+# Install sglang python package
+cd ..
+pip install -e "python[all_hip]"
+```
+
+### Install Using Docker (Recommended)
+
+The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
+
+The steps below show how to build and use an image.
+
+1. Build the docker image.
+   If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
+
+   ```bash
+   docker build -t sglang_image -f Dockerfile.rocm .
+   ```
+
+2. Create a convenient alias.
+
+   ```bash
+   alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
+       --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
+       --security-opt seccomp=unconfined \
+       -v $HOME/dockerx:/dockerx \
+       -v /data:/data'
+   ```
+
+   If you are using RDMA, please note that:
+     - `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
+     - You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
+
+3. Launch the server.
+
+   **NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
+
+   ```bash
+   drun -p 30000:30000 \
+       -v ~/.cache/huggingface:/root/.cache/huggingface \
+       --env "HF_TOKEN=<secret>" \
+       sglang_image \
+       python3 -m sglang.launch_server \
+       --model-path NousResearch/Meta-Llama-3.1-8B \
+       --host 0.0.0.0 \
+       --port 30000
+   ```
+
+4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](https://docs.sglang.ai/backend/openai_api_completions.html) to send requests to the engine.
+
+   ```bash
+   drun sglang_image \
+       python3 -m sglang.bench_serving \
+       --backend sglang \
+       --dataset-name random \
+       --num-prompts 4000 \
+       --random-input 128 \
+       --random-output 128
+   ```
+
+With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.
+
+## Examples
+
+### Running DeepSeek-V3
+
+The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command:
+
+```bash
+drun -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --ipc=host \
+    --env "HF_TOKEN=<secret>" \
+    sglang_image \
+    python3 -m sglang.launch_server \
+    --model-path deepseek-ai/DeepSeek-V3 \ # <- here
+    --tp 8 \
+    --trust-remote-code \
+    --host 0.0.0.0 \
+    --port 30000
+```
+
+[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
+
+### Running Llama3.1
+
+Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
+
+```bash
+drun -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --ipc=host \
+    --env "HF_TOKEN=<secret>" \
+    sglang_image \
+    python3 -m sglang.launch_server \
+    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
+    --tp 8 \
+    --trust-remote-code \
+    --host 0.0.0.0 \
+    --port 30000
+```
+
+### Warmup Step
+
+When the server displays `The server is fired up and ready to roll!`, it means the startup is successful.
--- a/docs/platforms/ascend_npu.md
+++ b/docs/platforms/ascend_npu.md
+# Ascend NPUs
+
+You can install SGLang using any of the methods below. Please go through `System Settings` section to ensure the clusters are roaring at max performance. Feel free to leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) if you encounter any issues or have any problems.
+
+## System Settings
+
+### CPU performance power scheme
+
+The default power scheme on Ascend hardware is `ondemand` which could affect performance, changing it to `performance` is recommended.
+
+```shell
+echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+
+# Make sure changes are applied successfully
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance
+```
+
+### Disable NUMA balancing
+
+```shell
+sudo sysctl -w kernel.numa_balancing=0
+
+# Check
+cat /proc/sys/kernel/numa_balancing # shows 0
+```
+
+### Prevent swapping out system memory
+
+```shell
+sudo sysctl -w vm.swappiness=10
+
+# Check
+cat /proc/sys/vm/swappiness # shows 10
+```
+
+## Installing SGLang
+
+### Method 1: Installing from source with prerequisites
+
+#### Python Version
+
+Only `python==3.11` is supported currently. If you don't want to break system pre-installed python, try installing with [conda](https://github.com/conda/conda).
+
+```shell
+conda create --name sglang_npu python=3.11
+conda activate sglang_npu
+```
+
+#### MemFabric Adaptor
+
+_TODO: MemFabric is still a working project yet open sourced til August/September, 2025. We will release it as prebuilt wheel package for now._
+
+_Notice: Prebuilt wheel package is based on `aarch64`, please leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) to let us know the requests for `amd64` build._
+
+MemFabric Adaptor is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters.
+
+```shell
+MF_WHL_NAME="mf_adapter-1.0.0-cp311-cp311-linux_aarch64.whl"
+MEMFABRIC_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/${MF_WHL_NAME}"
+wget -O "${MF_WHL_NAME}" "${MEMFABRIC_URL}" && pip install "./${MF_WHL_NAME}"
+```
+
+#### Pytorch and Pytorch Framework Adaptor on Ascend
+
+Only `torch==2.6.0` is supported currently due to NPUgraph and Triton-on-Ascend's limitation, however a more generalized version will be release by the end of September, 2025.
+
+```shell
+PYTORCH_VERSION=2.6.0
+TORCHVISION_VERSION=0.21.0
+pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu
+
+PTA_VERSION="v7.1.0.1-pytorch2.6.0"
+PTA_NAME="torch_npu-2.6.0.post1-cp311-cp311-manylinux_2_28_aarch64.whl"
+PTA_URL="https://gitee.com/ascend/pytorch/releases/download/${PTA_VERSION}/${PTA_WHL_NAME}"
+wget -O "${PTA_NAME}" "${PTA_URL}" && pip install "./${PTA_NAME}"
+```
+
+#### vLLM
+
+vLLM is still a major prerequisite on Ascend NPU. Because of `torch==2.6.0` limitation, only vLLM v0.8.5 is supported.
+
+```shell
+VLLM_TAG=v0.8.5
+git clone --depth 1 https://github.com/vllm-project/vllm.git --branch $VLLM_TAG
+(cd vllm && VLLM_TARGET_DEVICE="empty" pip install -v -e .)
+```
+
+#### Triton on Ascend
+
+_Notice:_ We recommend installing triton-ascend from source due to its rapid development, the version on PYPI can't keep up for know. This problem will be solved on Sep. 2025, afterwards `pip install` would be the one and only installing method.
+
+Please follow Triton-on-Ascend's [installation guide from source](https://gitee.com/ascend/triton-ascend#2%E6%BA%90%E4%BB%A3%E7%A0%81%E5%AE%89%E8%A3%85-triton-ascend) to install the latest `triton-ascend` package.
+
+#### DeepEP-compatible Library
+
+We are also providing a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md).
+
+#### Installing SGLang from source
+
+```shell
+# Use the last release branch
+git clone -b v0.5.2 https://github.com/sgl-project/sglang.git
+cd sglang
+
+pip install --upgrade pip
+pip install -e python[srt_npu]
+```
+
+### Method 2: Using docker
+
+__Notice:__ `--privileged` and `--network=host` are required by RDMA, which is typically needed by Ascend NPU clusters.
+
+__Notice:__ The following docker command is based on Atlas 800I A3 machines. If you are using Atlas 800I A2, make sure only `davinci[0-7]` are mapped into container.
+
+```shell
+# Clone the SGLang repository
+git clone https://github.com/sgl-project/sglang.git
+cd sglang/docker
+
+# Build the docker image
+docker build -t sglang-npu:main -f Dockerfile.npu .
+
+alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
+    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
+    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
+    --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
+    --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
+    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
+    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
+    --volume /etc/ascend_install.info:/etc/ascend_install.info \
+    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'
+
+drun --env "HF_TOKEN=<secret>" \
+    sglang-npu:main \
+    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000
+```
+
+## Examples
+
+### Running DeepSeek-V3
+
+Running DeepSeek with PD disaggregation on 2 x Atlas 800I A3.
+Model weights could be found [here](https://modelers.cn/models/State_Cloud/Deepseek-R1-bf16-hfd-w8a8).
+
+Prefill:
+
+```shell
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
+
+drun sglang-npu:main \
+    python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
+    --trust-remote-code \
+    --attention-backend ascend \
+    --mem-fraction-static 0.8 \
+    --quantization w8a8_int8 \
+    --tp-size 16 \
+    --dp-size 1 \
+    --nnodes 1 \
+    --node-rank 0 \
+    --disaggregation-mode prefill \
+    --disaggregation-bootstrap-port 6657 \
+    --disaggregation-transfer-backend ascend \
+    --dist-init-addr <PREFILL_HOST_IP>:6688 \
+    --host <PREFILL_HOST_IP> \
+    --port 8000
+```
+
+Decode:
+
+```shell
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
+export HCCL_BUFFSIZE=200
+export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
+
+drun sglang-npu:main \
+    python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
+    --trust-remote-code \
+    --attention-backend ascend \
+    --mem-fraction-static 0.8 \
+    --quantization w8a8_int8 \
+    --enable-deepep-moe \
+    --deepep-mode low_latency \
+    --tp-size 16 \
+    --dp-size 1 \
+    --ep-size 16 \
+    --nnodes 1 \
+    --node-rank 0 \
+    --disaggregation-mode decode \
+    --disaggregation-transfer-backend ascend \
+    --dist-init-addr <DECODE_HOST_IP>:6688 \
+    --host <DECODE_HOST_IP> \
+    --port 8001
+```
+
+Mini_LB:
+
+```shell
+drun sglang-npu:main \
+    python -m sglang.srt.disaggregation.launch_lb \
+    --prefill http://<PREFILL_HOST_IP>:8000 \
+    --decode http://<DECODE_HOST_IP>:8001 \
+    --host 127.0.0.1 --port 5000
+```
--- a/docs/platforms/blackwell_gpu.md
+++ b/docs/platforms/blackwell_gpu.md
+# Blackwell GPUs
+
+We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
+
+## B200 with x86 CPUs
+TODO
+
+## GB200/GB300 with ARM CPUs
+TODO
--- a/docs/platforms/cpu_server.md
+++ b/docs/platforms/cpu_server.md
+# CPU Servers
+
+The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
+Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
+which are 4th generation or newer Intel® Xeon® Scalable Processors.
+
+## Optimized Model List
+
+A list of popular LLMs are optimized and run efficiently on CPU,
+including the most notable open-source models like Llama series, Qwen series,
+and the phenomenal high-quality reasoning model DeepSeek-R1.
+
+| Model Name | BF16 | w8a8_int8 | FP8 |
+|:---:|:---:|:---:|:---:|
+| DeepSeek-R1 |   | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
+| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) |   |
+| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) |   |
+| QwQ-32B |   | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) |   |
+| DeepSeek-Distilled-Llama |   | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) |   |
+| Qwen3-235B |   |   | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
+
+**Note:** The model identifiers listed in the table above
+have been verified on 6th Gen Intel® Xeon® P-core platforms.
+
+## Installation
+
+### Install Using Docker
+
+It is recommended to use Docker for setting up the SGLang environment.
+A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
+Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
+
+```bash
+# Clone the SGLang repository
+git clone https://github.com/sgl-project/sglang.git
+cd sglang/docker
+
+# Build the docker image
+docker build -t sglang-cpu:main -f Dockerfile.xeon .
+
+# Initiate a docker container
+docker run \
+    -it \
+    --privileged \
+    --ipc=host \
+    --network=host \
+    -v /dev/shm:/dev/shm \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    -p 30000:30000 \
+    -e "HF_TOKEN=<secret>" \
+    sglang-cpu:main /bin/bash
+```
+
+### Install From Source
+
+If you'd prefer to install SGLang in a bare metal environment,
+the command list is as below.
+It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
+is required to enable SGLang service with CPU engine.
+
+```bash
+# Create and activate a conda environment
+conda create -n sgl-cpu python=3.12 -y
+conda activate sgl-cpu
+
+# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
+pip config set global.index-url https://download.pytorch.org/whl/cpu
+pip config set global.extra-index-url https://pypi.org/simple
+
+# Check if some conda related environment variables have been set
+env | grep -i conda
+# The following environment variable settings are required
+# if they have not been set properly
+export CONDA_EXE=$(which conda)
+export CONDA_ROOT=${CONDA_EXE}/../..
+export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
+export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
+
+# Clone the SGLang code
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+git checkout <YOUR-DESIRED-VERSION>
+
+# Install SGLang dependent libs, and build SGLang main package
+pip install --upgrade pip setuptools
+conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
+pip install -e "python[all_cpu]"
+pip install torch==2.7.1 torchvision==0.22.1 triton==3.3.1 --force-reinstall
+
+# Build the CPU backend kernels
+cd sgl-kernel
+cp pyproject_cpu.toml pyproject.toml
+pip install .
+
+# Other required environment variables
+# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
+export SGLANG_USE_CPU_ENGINE=1
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
+```
+
+## Launch of the Serving Engine
+
+Example command to launch SGLang serving:
+
+```bash
+python -m sglang.launch_server   \
+    --model <MODEL_ID_OR_PATH>   \
+    --trust-remote-code          \
+    --disable-overlap-schedule   \
+    --device cpu                 \
+    --host 0.0.0.0               \
+    --tp 6
+```
+
+Notes:
+
+1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
+
+2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
+    The number of TP specified is how many TP ranks will be used during the execution.
+    In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
+    Usually we can get the SNC information (How many available) from Operation System.
+    User can specify TP to be no more than the total available SNCs in current system.
+
+    If the specified TP rank number differs from the total SNC count,
+    the system will automatically utilize the first `n` SNCs.
+    Note that `n` cannot exceed the total SNC number, doing so will result in an error.
+
+    To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
+    For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
+    which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
+
+    ```bash
+    export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
+    ```
+
+    Please beware that with SGLANG_CPU_OMP_THREADS_BIND set,
+    the available memory amounts of the ranks may not be determined in prior.
+    You may need to set proper `--max-total-tokens` to avoid the out-of-memory error.
+
+3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`.
+    To specify the maximum batch size when using torch compile, set the flag `--torch-compile-max-bs`.
+    For example, `--enable-torch-compile --torch-compile-max-bs 4` means using torch compile and setting the
+    maximum batch size to 4.
+
+4. A warmup step is automatically triggered when the service is started.
+    The server is ready when you see the log `The server is fired up and ready to roll!`.
+
+## Benchmarking with Requests
+
+You can benchmark the performance via the `bench_serving` script.
+Run the command in another terminal.
+
+```bash
+python -m sglang.bench_serving   \
+    --dataset-name random        \
+    --random-input-len 1024      \
+    --random-output-len 1024     \
+    --num-prompts 1              \
+    --request-rate inf           \
+    --random-range-ratio 1.0
+```
+
+The detail explanations of the parameters can be looked up by the command:
+
+```bash
+python -m sglang.bench_serving -h
+```
+
+Additionally, the requests can be formed with
+[OpenAI Completions API](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
+and sent via the command line (e.g. using `curl`) or via your own script.
+
+## Example: Running DeepSeek-R1
+
+An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
+
+```bash
+python -m sglang.launch_server                 \
+    --model meituan/DeepSeek-R1-Channel-INT8   \
+    --trust-remote-code                        \
+    --disable-overlap-schedule                 \
+    --device cpu                               \
+    --quantization w8a8_int8                   \
+    --host 0.0.0.0                             \
+    --mem-fraction-static 0.8                  \
+    --tp 6
+```
+
+Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
+
+```bash
+python -m sglang.launch_server                 \
+    --model deepseek-ai/DeepSeek-R1            \
+    --trust-remote-code                        \
+    --disable-overlap-schedule                 \
+    --device cpu                               \
+    --host 0.0.0.0                             \
+    --mem-fraction-static 0.8                  \
+    --tp 6
+```
+
+Then you can test with `bench_serving` command or construct your own command or script
+following [the benchmarking example](#benchmarking-with-requests).
--- a/docs/platforms/nvidia_jetson.md
+++ b/docs/platforms/nvidia_jetson.md
+# NVIDIA Jetson Orin
+
+## Prerequisites
+
+Before starting, ensure the following:
+
+- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
+- **CUDA Toolkit** and **cuDNN** are installed.
+- Verify that the Jetson AGX Orin is in **high-performance mode**:
+```bash
+sudo nvpmodel -m 0
+```
+* * * * *
+## Installing and running SGLang with Jetson Containers
+Clone the jetson-containers github repository:
+```
+git clone https://github.com/dusty-nv/jetson-containers.git
+```
+Run the installation script:
+```
+bash jetson-containers/install.sh
+```
+Build the container image:
+```
+jetson-containers build sglang
+```
+Run the container:
+```
+jetson-containers run $(autotag sglang)
+```
+Or you can also manually run a container with this command:
+```
+docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
+```
+* * * * *
+
+Running Inference
+-----------------------------------------
+
+Launch the server:
+```bash
+python -m sglang.launch_server \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --device cuda \
+  --dtype half \
+  --attention-backend flashinfer \
+  --mem-fraction-static 0.8 \
+  --context-length 8192
+```
+The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
+
+After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
+* * * * *
+Running quantization with TorchAO
+-------------------------------------
+TorchAO is suggested to NVIDIA Jetson Orin.
+```bash
+python -m sglang.launch_server \
+    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --device cuda \
+    --dtype bfloat16 \
+    --attention-backend flashinfer \
+    --mem-fraction-static 0.8 \
+    --context-length 8192 \
+    --torchao-config int4wo-128
+```
+This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
+
+
+* * * * *
+Structured output with XGrammar
+-------------------------------
+Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
+* * * * *
+
+Thanks to the support from [Nurgaliyev Shakhizat](https://github.com/shahizat), [Dustin Franklin](https://github.com/dusty-nv) and [Johnny Núñez Cano](https://github.com/johnnynunez).
+
+References
+----------
+-   [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)
--- a/docs/platforms/tpu.md
+++ b/docs/platforms/tpu.md
+# TPU
+
+The support for TPU is under active development. Please stay tuned.
--- a/docs/references/custom_chat_template.md
+++ b/docs/references/custom_chat_template.md
+# Custom Chat Template
+
+**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)).
+
+By default, the server uses the chat template specified in the model tokenizer from Hugging Face.
+It should just work for most official models such as Llama-2/Llama-3.
+
+If needed, you can also override the chat template when launching the server:
+
+```bash
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
+```
+
+If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
+
+## JSON Format
+
+You can load the JSON format, which is defined by `conversation.py`.
+
+```json
+{
+  "name": "my_model",
+  "system": "<|im_start|>system",
+  "user": "<|im_start|>user",
+  "assistant": "<|im_start|>assistant",
+  "sep_style": "CHATML",
+  "sep": "<|im_end|>",
+  "stop_str": ["<|im_end|>", "<|im_start|>"]
+}
+```
+
+```bash
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
+```
+
+## Jinja Format
+
+You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers.
+
+```bash
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
+```
--- a/docs/references/environment_variables.md
+++ b/docs/references/environment_variables.md
+# Environment Variables
+
+SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.
+
+*Note: SGLang uses two prefixes for environment variables: `SGL_` and `SGLANG_`. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.*
+
+## General Configuration
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_USE_MODELSCOPE` | Enable using models from ModelScope | `false` |
+| `SGLANG_HOST_IP` | Host IP address for the server | `0.0.0.0` |
+| `SGLANG_PORT` | Port for the server | auto-detected |
+| `SGLANG_LOGGING_CONFIG_PATH` | Custom logging configuration path | Not set |
+| `SGLANG_DISABLE_REQUEST_LOGGING` | Disable request logging | `false` |
+| `SGLANG_HEALTH_CHECK_TIMEOUT` | Timeout for health check in seconds | `20` |
+
+## Performance Tuning
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_ENABLE_TORCH_INFERENCE_MODE` | Control whether to use torch.inference_mode | `false` |
+| `SGLANG_ENABLE_TORCH_COMPILE` | Enable torch.compile | `true` |
+| `SGLANG_SET_CPU_AFFINITY` | Enable CPU affinity setting (often set to `1` in Docker builds) | `0` |
+| `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN` | Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds) | `0` |
+| `SGLANG_IS_FLASHINFER_AVAILABLE` | Control FlashInfer availability check | `true` |
+| `SGLANG_SKIP_P2P_CHECK` | Skip P2P (peer-to-peer) access check | `false` |
+| `SGL_CHUNKED_PREFIX_CACHE_THRESHOLD` | Sets the threshold for enabling chunked prefix caching | `8192` |
+| `SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION` | Enable RoPE fusion in Fused Multi-Layer Attention | `1` |
+
+## DeepGEMM Configuration (Advanced Optimization)
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGL_ENABLE_JIT_DEEPGEMM` | Enable Just-In-Time compilation of DeepGEMM kernels | `"true"` |
+| `SGL_JIT_DEEPGEMM_PRECOMPILE` | Enable precompilation of DeepGEMM kernels | `"true"` |
+| `SGL_JIT_DEEPGEMM_COMPILE_WORKERS` | Number of workers for parallel DeepGEMM kernel compilation | `4` |
+| `SGL_IN_DEEPGEMM_PRECOMPILE_STAGE` | Indicator flag used during the DeepGEMM precompile script | `"false"` |
+| `SGL_DG_CACHE_DIR` | Directory for caching compiled DeepGEMM kernels | `~/.cache/deep_gemm` |
+| `SGL_DG_USE_NVRTC` | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | `"0"` |
+| `SGL_USE_DEEPGEMM_BMM` | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | `"false"` |
+
+## Memory Management
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_DEBUG_MEMORY_POOL` | Enable memory pool debugging | `false` |
+| `SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION` | Clip max new tokens estimation for memory planning | `4096` |
+| `SGLANG_DETOKENIZER_MAX_STATES` | Maximum states for detokenizer | Default value based on system |
+| `SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK` | Disable checks for memory imbalance across Tensor Parallel ranks | Not set (defaults to enabled check) |
+
+## Model-Specific Options
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_USE_AITER` | Use AITER optimize implementation | `false` |
+| `SGLANG_INT4_WEIGHT` | Enable INT4 weight quantization | `false` |
+| `SGLANG_MOE_PADDING` | Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds) | `0` |
+| `SGLANG_FORCE_FP8_MARLIN` | Force using FP8 MARLIN kernels even if other FP8 kernels are available | `false` |
+| `SGLANG_ENABLE_FLASHINFER_GEMM` | Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs | `false` |
+| `SGLANG_SUPPORT_CUTLASS_BLOCK_FP8` | Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs | `false` |
+| `SGLANG_CUTLASS_MOE` | Use Cutlass FP8 MoE kernel on Blackwell GPUs | `false` |
+
+
+## Distributed Computing
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_BLOCK_NONZERO_RANK_CHILDREN` | Control blocking of non-zero rank children processes | `1` |
+| `SGL_IS_FIRST_RANK_ON_NODE` | Indicates if the current process is the first rank on its node | `"true"` |
+| `SGLANG_PP_LAYER_PARTITION` | Pipeline parallel layer partition specification | Not set |
+
+## Testing & Debugging (Internal/CI)
+
+*These variables are primarily used for internal testing, continuous integration, or debugging.*
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_IS_IN_CI` | Indicates if running in CI environment | `false` |
+| `SGLANG_AMD_CI` | Indicates running in AMD CI environment | `0` |
+| `SGLANG_TEST_RETRACT` | Enable retract decode testing | `false` |
+| `SGLANG_RECORD_STEP_TIME` | Record step time for profiling | `false` |
+| `SGLANG_TEST_REQUEST_TIME_STATS` | Test request time statistics | `false` |
+| `SGLANG_CI_SMALL_KV_SIZE` | Use small KV cache size in CI | Not set |
+
+## Profiling & Benchmarking
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_TORCH_PROFILER_DIR` | Directory for PyTorch profiler output | `/tmp` |
+| `SGLANG_PROFILE_WITH_STACK` | Set `with_stack` option (bool) for PyTorch profiler (capture stack trace) | `true` |
+
+## Storage & Caching
+
+| Environment Variable | Description | Default Value |
+| --- | --- | --- |
+| `SGLANG_DISABLE_OUTLINES_DISK_CACHE` | Disable Outlines disk cache | `true` |
--- a/docs/references/faq.md
+++ b/docs/references/faq.md
+# Troubleshooting and Frequently Asked Questions
+
+## Troubleshooting
+
+This page lists common errors and tips for resolving them.
+
+### CUDA Out of Memory
+If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
+
+- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
+- If OOM occurs during decoding, try lowering `--max-running-requests`.
+- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
+- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
+
+### CUDA Error: Illegal Memory Access Encountered
+This error may result from kernel errors or out-of-memory issues:
+- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
+- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
+
+
+## Frequently Asked Questions
+
+### The results are not deterministic, even with a temperature of 0
+
+You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.
+
+From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs.
+
+To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting.
+
+We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower.
+
+We have two issues to track our progress:
+- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729).
+- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335).
--- a/docs/references/frontend/choices_methods.md
+++ b/docs/references/frontend/choices_methods.md
+# Choices Methods in SGLang
+This doc describes the choices methods supported by SGLang.
+
+The optional `choices_method` arg determines how options supplied to SGLang's `choices` primitive are selected. Only the `RuntimeEndpoint` backend supports the `choices_method` arg. Other backends, such as `OpenAI`, have bespoke selection implementations due to API limitations.
+
+## Methods
+
+### Token Length Normalized
+
+Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.
+
+Usage example (alternatively, simply omit the `choices_method` arg):
+```python
+@sgl.function
+def example(s):
+    s += sgl.user("What is the capital of France?")
+    s += sgl.assistant(
+        sgl.gen(
+            "answer",
+            choices=["London", "Paris", "Berlin"],
+            choices_method=sgl.token_length_normalized,
+        )
+    )
+```
+
+
+This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are `["Paris", "Antidisestablishmentarianism"]`.
+
+### Greedy Token Selection
+
+Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.
+
+Usage example:
+```python
+@sgl.function
+def example(s):
+    s += sgl.user("What is the capital of France?")
+    s += sgl.assistant(
+        sgl.gen(
+            "answer",
+            choices=["London", "Paris", "Berlin"],
+            choices_method=sgl.greedy_token_selection,
+        )
+    )
+```
+
+This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:
+```python
+@sgl.function
+def us_president_example(s):
+    s += sgl.user("Name a US president.")
+    s += sgl.assistant(
+        sgl.gen(
+            "answer",
+            choices=["Donald Duck", "Millard Fillmore"],
+            choices_method=sgl.greedy_token_selection,
+        )
+    )
+```
+
+### Unconditional Likelihood Normalized
+
+Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in [this EleutherAI blogpost](https://blog.eleuther.ai/multiple-choice-normalization/). This method incurs an additional LLM call to obtain the unconditional likelihoods.
+
+Usage example:
+```python
+@sgl.function
+def example(s):
+    s += sgl.user("What is the capital of France?")
+    s += sgl.assistant(
+        sgl.gen(
+            "answer",
+            choices=["London", "Paris", "Berlin"],
+            choices_method=sgl.unconditional_likelihood_normalized,
+        )
+    )
+```
--- a/docs/references/frontend/frontend_index.rst
+++ b/docs/references/frontend/frontend_index.rst
+Frontend Language
+=================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Frontend Language
+
+   frontend_tutorial.ipynb
+   choices_methods.md
--- a/docs/references/frontend/frontend_tutorial.ipynb
+++ b/docs/references/frontend/frontend_tutorial.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SGLang Frontend Language"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Launch A Server\n",
+    "\n",
+    "Launch the server in your terminal and wait for it to initialize."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sglang import assistant_begin, assistant_end\n",
+    "from sglang import assistant, function, gen, system, user\n",
+    "from sglang import image\n",
+    "from sglang import RuntimeEndpoint\n",
+    "from sglang.lang.api import set_default_backend\n",
+    "from sglang.srt.utils import load_image\n",
+    "from sglang.test.doc_patch import launch_server_cmd\n",
+    "from sglang.utils import print_highlight, terminate_process, wait_for_server\n",
+    "\n",
+    "server_process, port = launch_server_cmd(\n",
+    "    \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning\"\n",
+    ")\n",
+    "\n",
+    "wait_for_server(f\"http://localhost:{port}\")\n",
+    "print(f\"Server started on http://localhost:{port}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Basic Usage\n",
+    "\n",
+    "The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def basic_qa(s, question):\n",
+    "    s += system(f\"You are a helpful assistant than can answer questions.\")\n",
+    "    s += user(question)\n",
+    "    s += assistant(gen(\"answer\", max_tokens=512))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "state = basic_qa(\"List 3 countries and their capitals.\")\n",
+    "print_highlight(state[\"answer\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multi-turn Dialog\n",
+    "\n",
+    "SGLang frontend language can also be used to define multi-turn dialogs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def multi_turn_qa(s):\n",
+    "    s += system(f\"You are a helpful assistant than can answer questions.\")\n",
+    "    s += user(\"Please give me a list of 3 countries and their capitals.\")\n",
+    "    s += assistant(gen(\"first_answer\", max_tokens=512))\n",
+    "    s += user(\"Please give me another list of 3 countries and their capitals.\")\n",
+    "    s += assistant(gen(\"second_answer\", max_tokens=512))\n",
+    "    return s\n",
+    "\n",
+    "\n",
+    "state = multi_turn_qa()\n",
+    "print_highlight(state[\"first_answer\"])\n",
+    "print_highlight(state[\"second_answer\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Control flow\n",
+    "\n",
+    "You may use any Python code within the function to define more complex control flows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def tool_use(s, question):\n",
+    "    s += assistant(\n",
+    "        \"To answer this question: \"\n",
+    "        + question\n",
+    "        + \". I need to use a \"\n",
+    "        + gen(\"tool\", choices=[\"calculator\", \"search engine\"])\n",
+    "        + \". \"\n",
+    "    )\n",
+    "\n",
+    "    if s[\"tool\"] == \"calculator\":\n",
+    "        s += assistant(\"The math expression is: \" + gen(\"expression\"))\n",
+    "    elif s[\"tool\"] == \"search engine\":\n",
+    "        s += assistant(\"The key word to search is: \" + gen(\"word\"))\n",
+    "\n",
+    "\n",
+    "state = tool_use(\"What is 2 * 2?\")\n",
+    "print_highlight(state[\"tool\"])\n",
+    "print_highlight(state[\"expression\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Parallelism\n",
+    "\n",
+    "Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def tip_suggestion(s):\n",
+    "    s += assistant(\n",
+    "        \"Here are two tips for staying healthy: \"\n",
+    "        \"1. Balanced Diet. 2. Regular Exercise.\\n\\n\"\n",
+    "    )\n",
+    "\n",
+    "    forks = s.fork(2)\n",
+    "    for i, f in enumerate(forks):\n",
+    "        f += assistant(\n",
+    "            f\"Now, expand tip {i+1} into a paragraph:\\n\"\n",
+    "            + gen(\"detailed_tip\", max_tokens=256, stop=\"\\n\\n\")\n",
+    "        )\n",
+    "\n",
+    "    s += assistant(\"Tip 1:\" + forks[0][\"detailed_tip\"] + \"\\n\")\n",
+    "    s += assistant(\"Tip 2:\" + forks[1][\"detailed_tip\"] + \"\\n\")\n",
+    "    s += assistant(\n",
+    "        \"To summarize the above two tips, I can say:\\n\" + gen(\"summary\", max_tokens=512)\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "state = tip_suggestion()\n",
+    "print_highlight(state[\"summary\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Constrained Decoding\n",
+    "\n",
+    "Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def regular_expression_gen(s):\n",
+    "    s += user(\"What is the IP address of the Google DNS servers?\")\n",
+    "    s += assistant(\n",
+    "        gen(\n",
+    "            \"answer\",\n",
+    "            temperature=0,\n",
+    "            regex=r\"((25[0-5]|2[0-4]\\d|[01]?\\d\\d?).){3}(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\",\n",
+    "        )\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "state = regular_expression_gen()\n",
+    "print_highlight(state[\"answer\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use `regex` to define a `JSON` decoding schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "character_regex = (\n",
+    "    r\"\"\"\\{\\n\"\"\"\n",
+    "    + r\"\"\"    \"name\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
+    "    + r\"\"\"    \"house\": \"(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)\",\\n\"\"\"\n",
+    "    + r\"\"\"    \"blood status\": \"(Pure-blood|Half-blood|Muggle-born)\",\\n\"\"\"\n",
+    "    + r\"\"\"    \"occupation\": \"(student|teacher|auror|ministry of magic|death eater|order of the phoenix)\",\\n\"\"\"\n",
+    "    + r\"\"\"    \"wand\": \\{\\n\"\"\"\n",
+    "    + r\"\"\"        \"wood\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
+    "    + r\"\"\"        \"core\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
+    "    + r\"\"\"        \"length\": [0-9]{1,2}\\.[0-9]{0,2}\\n\"\"\"\n",
+    "    + r\"\"\"    \\},\\n\"\"\"\n",
+    "    + r\"\"\"    \"alive\": \"(Alive|Deceased)\",\\n\"\"\"\n",
+    "    + r\"\"\"    \"patronus\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
+    "    + r\"\"\"    \"bogart\": \"[\\w\\d\\s]{1,16}\"\\n\"\"\"\n",
+    "    + r\"\"\"\\}\"\"\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "@function\n",
+    "def character_gen(s, name):\n",
+    "    s += user(\n",
+    "        f\"{name} is a character in Harry Potter. Please fill in the following information about this character.\"\n",
+    "    )\n",
+    "    s += assistant(gen(\"json_output\", max_tokens=256, regex=character_regex))\n",
+    "\n",
+    "\n",
+    "state = character_gen(\"Harry Potter\")\n",
+    "print_highlight(state[\"json_output\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Batching \n",
+    "\n",
+    "Use `run_batch` to run a batch of prompts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def text_qa(s, question):\n",
+    "    s += user(question)\n",
+    "    s += assistant(gen(\"answer\", stop=\"\\n\"))\n",
+    "\n",
+    "\n",
+    "states = text_qa.run_batch(\n",
+    "    [\n",
+    "        {\"question\": \"What is the capital of the United Kingdom?\"},\n",
+    "        {\"question\": \"What is the capital of France?\"},\n",
+    "        {\"question\": \"What is the capital of Japan?\"},\n",
+    "    ],\n",
+    "    progress_bar=True,\n",
+    ")\n",
+    "\n",
+    "for i, state in enumerate(states):\n",
+    "    print_highlight(f\"Answer {i+1}: {states[i]['answer']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Streaming \n",
+    "\n",
+    "Use `stream` to stream the output to the user."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def text_qa(s, question):\n",
+    "    s += user(question)\n",
+    "    s += assistant(gen(\"answer\", stop=\"\\n\"))\n",
+    "\n",
+    "\n",
+    "state = text_qa.run(\n",
+    "    question=\"What is the capital of France?\", temperature=0.1, stream=True\n",
+    ")\n",
+    "\n",
+    "for out in state.text_iter():\n",
+    "    print(out, end=\"\", flush=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Complex Prompts\n",
+    "\n",
+    "You may use `{system|user|assistant}_{begin|end}` to define complex prompts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def chat_example(s):\n",
+    "    s += system(\"You are a helpful assistant.\")\n",
+    "    # Same as: s += s.system(\"You are a helpful assistant.\")\n",
+    "\n",
+    "    with s.user():\n",
+    "        s += \"Question: What is the capital of France?\"\n",
+    "\n",
+    "    s += assistant_begin()\n",
+    "    s += \"Answer: \" + gen(\"answer\", max_tokens=100, stop=\"\\n\")\n",
+    "    s += assistant_end()\n",
+    "\n",
+    "\n",
+    "state = chat_example()\n",
+    "print_highlight(state[\"answer\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(server_process)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multi-modal Generation\n",
+    "\n",
+    "You may use SGLang frontend language to define multi-modal prompts.\n",
+    "See [here](https://docs.sglang.ai/supported_models/generative_models.html) for supported models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "server_process, port = launch_server_cmd(\n",
+    "    \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning\"\n",
+    ")\n",
+    "\n",
+    "wait_for_server(f\"http://localhost:{port}\")\n",
+    "print(f\"Server started on http://localhost:{port}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ask a question about an image."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@function\n",
+    "def image_qa(s, image_file, question):\n",
+    "    s += user(image(image_file) + question)\n",
+    "    s += assistant(gen(\"answer\", max_tokens=256))\n",
+    "\n",
+    "\n",
+    "image_url = \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
+    "image_bytes, _ = load_image(image_url)\n",
+    "state = image_qa(image_bytes, \"What is in the image?\")\n",
+    "print_highlight(state[\"answer\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(server_process)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/references/learn_more.md
+++ b/docs/references/learn_more.md
+# Learn more
+
+You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
+
+The latest SGLang features and updates are shared through the [LMSYS blog](https://lmsys.org/blog/).
+
+The 2025 H2 roadmap can be found at this [issue](https://github.com/sgl-project/sglang/issues/7736).
--- a/docs/references/multi_node_deployment/deploy_on_k8s.md
+++ b/docs/references/multi_node_deployment/deploy_on_k8s.md
+# Deploy On Kubernetes
+
+This document is for deploying a RoCE network-based SGLang two-node inference service on a Kubernetes (K8S) cluster.
+
+[LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.
+
+SGLang can also be deployed with LWS on Kubernetes for distributed model serving.
+
+Please see this guide for more details on deploying SGLang on Kubernetes using LWS.
+
+Here we take the deployment of DeepSeek-R1 as an example.
+
+## Prerequisites
+
+1. At least two Kubernetes nodes, each with two H20 systems and eight GPUs, are required.
+
+2. Make sure your K8S cluster has LWS correctly installed. If it hasn't been set up yet, please follow the [installation instructions](https://github.com/kubernetes-sigs/lws/blob/main/site/content/en/docs/installation/_index.md). **Note:** For LWS versions ≤0.5.x, you must use the Downward API to obtain `LWS_WORKER_INDEX`, as native support for this feature was introduced in v0.6.0.
+
+## Basic example
+
+For the basic example documentation, refer to [Deploy Distributed Inference Service with SGLang and LWS on GPUs](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/sglang).
+
+However, that document only covers the basic NCCL socket mode.
+
+In this section, we’ll make some simple modifications to adapt the setup to the RDMA scenario.
+
+## RDMA RoCE case
+
+* Check your env:
+
+```bash
+[root@node1 ~]# ibstatus
+Infiniband device 'mlx5_bond_0' port 1 status:
+        default gid:     fe80:0000:0000:0000:0225:9dff:fe64:c79a
+        base lid:        0x0
+        sm lid:          0x0
+        state:           4: ACTIVE
+        phys state:      5: LinkUp
+        rate:            200 Gb/sec (2X NDR)
+        link_layer:      Ethernet
+
+Infiniband device 'mlx5_bond_1' port 1 status:
+        default gid:     fe80:0000:0000:0000:0225:9dff:fe6e:c3ec
+        base lid:        0x0
+        sm lid:          0x0
+        state:           4: ACTIVE
+        phys state:      5: LinkUp
+        rate:            200 Gb/sec (2X NDR)
+        link_layer:      Ethernet
+
+Infiniband device 'mlx5_bond_2' port 1 status:
+        default gid:     fe80:0000:0000:0000:0225:9dff:fe73:0dd7
+        base lid:        0x0
+        sm lid:          0x0
+        state:           4: ACTIVE
+        phys state:      5: LinkUp
+        rate:            200 Gb/sec (2X NDR)
+        link_layer:      Ethernet
+
+Infiniband device 'mlx5_bond_3' port 1 status:
+        default gid:     fe80:0000:0000:0000:0225:9dff:fe36:f7ff
+        base lid:        0x0
+        sm lid:          0x0
+        state:           4: ACTIVE
+        phys state:      5: LinkUp
+        rate:            200 Gb/sec (2X NDR)
+        link_layer:      Ethernet
+```
+
+* Prepare the `lws.yaml` file for deploying on k8s.
+
+```yaml
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  name: sglang
+spec:
+  replicas: 1
+  leaderWorkerTemplate:
+    size: 2
+    restartPolicy: RecreateGroupOnPodRestart
+    leaderTemplate:
+      metadata:
+        labels:
+          role: leader
+      spec:
+        dnsPolicy: ClusterFirstWithHostNet
+        hostNetwork: true
+        hostIPC: true
+        containers:
+          - name: sglang-leader
+            image: sglang:latest
+            securityContext:
+              privileged: true
+            env:
+              - name: NCCL_IB_GID_INDEX
+                value: "3"
+            command:
+              - python3
+              - -m
+              - sglang.launch_server
+              - --model-path
+              - /work/models
+              - --mem-fraction-static
+              -  "0.93"
+              - --torch-compile-max-bs
+              - "8"
+              - --max-running-requests
+              - "20"
+              - --tp
+              - "16" # Size of Tensor Parallelism
+              - --dist-init-addr
+              - $(LWS_LEADER_ADDRESS):20000
+              - --nnodes
+              - $(LWS_GROUP_SIZE)
+              - --node-rank
+              - $(LWS_WORKER_INDEX)
+              - --trust-remote-code
+              - --host
+              - "0.0.0.0"
+              - --port
+              - "40000"
+            resources:
+              limits:
+                nvidia.com/gpu: "8"
+            ports:
+              - containerPort: 40000
+            readinessProbe:
+              tcpSocket:
+                port: 40000
+              initialDelaySeconds: 15
+              periodSeconds: 10
+            volumeMounts:
+              - mountPath: /dev/shm
+                name: dshm
+              - name: model
+                mountPath: /work/models
+              - name: ib
+                mountPath: /dev/infiniband
+        volumes:
+          - name: dshm
+            emptyDir:
+              medium: Memory
+          - name: model
+            hostPath:
+              path: '< your models dir >' # modify it according your models dir
+          - name: ib
+            hostPath:
+              path: /dev/infiniband
+    workerTemplate:
+      spec:
+        dnsPolicy: ClusterFirstWithHostNet
+        hostNetwork: true
+        hostIPC: true
+        containers:
+          - name: sglang-worker
+            image: sglang:latest
+            securityContext:
+              privileged: true
+            env:
+            - name: NCCL_IB_GID_INDEX
+              value: "3"
+            command:
+              - python3
+              - -m
+              - sglang.launch_server
+              - --model-path
+              - /work/models
+              - --mem-fraction-static
+              - "0.93"
+              - --torch-compile-max-bs
+              - "8"
+              - --max-running-requests
+              - "20"
+              - --tp
+              - "16" # Size of Tensor Parallelism
+              - --dist-init-addr
+              - $(LWS_LEADER_ADDRESS):20000
+              - --nnodes
+              - $(LWS_GROUP_SIZE)
+              - --node-rank
+              - $(LWS_WORKER_INDEX)
+              - --trust-remote-code
+            resources:
+              limits:
+                nvidia.com/gpu: "8"
+            volumeMounts:
+              - mountPath: /dev/shm
+                name: dshm
+              - name: model
+                mountPath: /work/models
+              - name: ib
+                mountPath: /dev/infiniband
+        volumes:
+          - name: dshm
+            emptyDir:
+              medium: Memory
+          - name: ib
+            hostPath:
+              path: /dev/infiniband
+          - name: model
+            hostPath:
+              path: /data1/models/deepseek_v3_moe
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: sglang-leader
+spec:
+  selector:
+    leaderworkerset.sigs.k8s.io/name: sglang
+    role: leader
+  ports:
+    - protocol: TCP
+      port: 40000
+      targetPort: 40000
+
+```
+
+* Then use  `kubectl apply -f lws.yaml` you will get this output.
+
+```text
+NAME           READY   STATUS    RESTARTS       AGE
+sglang-0       0/1     Running   0              9s
+sglang-0-1     1/1     Running   0              9s
+```
+
+Wait for the sglang leader (`sglang-0`) status to change to 1/1, which indicates it is `Ready`.
+
+You can use the command `kubectl logs -f sglang-0` to view the logs of the leader node.
+
+Once successful, you should see output like this:
+
+```text
+[2025-02-17 05:27:24 TP1] Capture cuda graph end. Time elapsed: 84.89 s
+[2025-02-17 05:27:24 TP6] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP0] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP7] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP3] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP2] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP4] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP1] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24 TP5] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
+[2025-02-17 05:27:24] INFO:     Started server process [1]
+[2025-02-17 05:27:24] INFO:     Waiting for application startup.
+[2025-02-17 05:27:24] INFO:     Application startup complete.
+[2025-02-17 05:27:24] INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
+[2025-02-17 05:27:25] INFO:     127.0.0.1:48908 - "GET /get_model_info HTTP/1.1" 200 OK
+[2025-02-17 05:27:25 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
+[2025-02-17 05:27:32] INFO:     127.0.0.1:48924 - "POST /generate HTTP/1.1" 200 OK
+[2025-02-17 05:27:32] The server is fired up and ready to roll!
+```
+
+If it doesn’t start up successfully, please follow these steps to check for any remaining issues. Thanks!
+
+### Debug
+
+* Set `NCCL_DEBUG=TRACE` to check if it is a NCCL communication problem.
+
+This should resolve most NCCL-related issues.
+
+***Notice: If you find that NCCL_DEBUG=TRACE is not effective in the container environment, but the process is stuck or you encounter hard-to-diagnose issues, try switching to a different container image. Some images may not handle standard error output properly.***
+
+#### RoCE scenario
+
+* Please make sure that RDMA devices are available in the cluster environment.
+* Please make sure that the nodes in the cluster have Mellanox NICs with RoCE. In this example, we use Mellanox ConnectX 5 model NICs, and the proper OFED driver has been installed. If not, please refer to the document [Install OFED Driver](https://docs.nvidia.com/networking/display/mlnxofedv461000/installing+mellanox+ofed) to install the driver.
+* Check your env:
+
+  ```shell
+  $ lspci -nn | grep Eth | grep Mellanox
+  0000:7f:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0000:7f:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0000:c7:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0000:c7:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0001:08:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0001:08:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0001:a2:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  0001:a2:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
+  ```
+
+* Check the OFED driver:
+
+  ```shell
+  ofed_info -s
+  OFED-internal-23.07-0.5.0:
+  ```
+
+* Show RDMA link status and check IB devices:
+
+  ```shell
+  $ rdma link show
+  8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0
+  9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2
+  10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4
+  11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6
+
+  $ ibdev2netdev
+  8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0
+  9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2
+  10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4
+  11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6
+  ```
+
+* Test RoCE network speed on the host:
+
+  ```shell
+  yum install qperf
+  # for server：
+  execute qperf
+  # for client
+  qperf -t 60 -cm1 <server_ip>   rc_rdma_write_bw
+  ```
+
+* Check RDMA accessible in your container:
+
+  ```shell
+  # ibv_devices
+  # ibv_devinfo
+  ```
+
+## Keys to success
+
+* In the YAML configuration above, pay attention to the NCCL environment variable. For older versions of NCCL, you should check the NCCL_IB_GID_INDEX environment setting.
+* NCCL_SOCKET_IFNAME is also crucial, but in a containerized environment, this typically isn’t an issue.
+* In some cases, it’s necessary to configure GLOO_SOCKET_IFNAME correctly.
+* NCCL_DEBUG is essential for troubleshooting, but I've found that sometimes it doesn't show error logs within containers. This could be related to the Docker image you're using. You may want to try switching images if needed.
+* Avoid using Docker images based on Ubuntu 18.04, as they tend to have compatibility issues.
+
+## Remaining issues
+
+* In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation.
+* We utilize privileged mode, which  isn’t secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
+
+## TODO
+
+* Integrated with [k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin).
--- a/docs/references/multi_node_deployment/lws_pd/lws-examples/d-svc.yaml
+++ b/docs/references/multi_node_deployment/lws_pd/lws-examples/d-svc.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: deepseekr10528-decode-main
+spec:
+  selector:
+    leaderworkerset.sigs.k8s.io/name: deepseekr10528-decode-main
+    role: leader
+  ports:
+    - protocol: TCP
+      port: 30000
+      targetPort: 30000