Fix install instructions and pyproject.tomls (#11781)

67e34c56 · Lianmin Zheng · GitHub · 1d726528 · 67e34c56 · 67e34c56
Unverified Commit 67e34c56 authored Oct 18, 2025 by Lianmin Zheng Committed by GitHub Oct 18, 2025
10 changed files
--- a/README.md
+++ b/README.md
@@ -46,14 +46,15 @@
 </details>

 ## About
-SGLang is a fast serving framework for large language models and vision language models.
-It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
-The core features include:
+SGLang is a high-performance serving framework for large language models and vision-language models.
+It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
+Its core features include:

- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
+- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
+- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), and reward models (Skywork), with easy extensibility for integrating new models. Compatible with most Hugging Face models and OpenAI APIs.
+- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.
+- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, supporting chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
+- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 300,000 GPUs worldwide.

 ## Getting Started
 - [Install SGLang](https://docs.sglang.ai/get_started/install.html)
@@ -69,7 +70,8 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 [Development Roadmap (2025 H2)](https://github.com/sgl-project/sglang/issues/7736)

 ## Adoption and Sponsorship
-SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 1,000,000 GPUs worldwide.
+SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 300,000 GPUs worldwide.
+SGLang is currently hosted under the non-profit open-source organization [LMSYS](https://lmsys.org/about/).

 <img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/refs/heads/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>


--- a/docs/get_started/install.md
+++ b/docs/get_started/install.md
@@ -12,7 +12,7 @@ It is recommended to use uv for faster installation:
 ```bash
 pip install --upgrade pip
 pip install uv
-uv pip install sglang --upgrade
+uv pip install sglang --prerelease=allow
 ```

 **Quick fixes to common problems**
@@ -129,5 +129,3 @@ sky status --endpoint 30000 sglang

 - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
 - To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
--- a/docs/index.rst
+++ b/docs/index.rst
 SGLang Documentation
 ====================

-SGLang is a fast serving framework for large language models and vision language models.
-It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
-The core features include:
-
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
+SGLang is a high-performance serving framework for large language models and vision-language models.
+It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
+Its core features include:
+
+- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
+- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), and reward models (Skywork), with easy extensibility for integrating new models. Compatible with most Hugging Face models and OpenAI APIs.
+- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.
+- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, supporting chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
+- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 300,000 GPUs worldwide.

 .. toctree::
   :maxdepth: 1

--- a/docs/platforms/tpu.md
+++ b/docs/platforms/tpu.md
 # TPU

-The support for TPU is under active development. Please stay tuned.
+See https://github.com/sgl-project/sglang-jax
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -13,6 +13,7 @@ classifiers = [
  "Programming Language :: Python :: 3",
  "License :: OSI Approved :: Apache Software License",
 ]
+
 dependencies = [
  "IPython",
  "aiohttp",
@@ -21,6 +22,7 @@ dependencies = [
  "build",
  "compressed-tensors",
  "cuda-python",
+  "decord2",
  "datasets",
  "einops",
  "fastapi",
@@ -73,7 +75,12 @@ dependencies = [
 ]

 [project.optional-dependencies]
-decord = ["decord2"]
+tracing = [
+  "opentelemetry-api",
+  "opentelemetry-exporter-otlp",
+  "opentelemetry-exporter-otlp-proto-grpc",
+  "opentelemetry-sdk",
+]
 test = [
  "accelerate",
  "expecttest",
@@ -86,13 +93,10 @@ test = [
  "sentence_transformers",
  "tabulate",
 ]
-tracing = [
-  "opentelemetry-api",
-  "opentelemetry-exporter-otlp",
-  "opentelemetry-exporter-otlp-proto-grpc",
-  "opentelemetry-sdk",
-]
-all = ["sglang[test]", "sglang[decord]"]
+all = []
+dev = ["sglang[test]"]
+
+# Temporary tags
 cu130 = [
  "torch==2.9.0",
  "torchaudio==2.9.0",
@@ -104,13 +108,9 @@ cu130_all = [
  "sglang[cu130]"
 ]

-
-# The following will be deprecated in 2 weeks
-dev = ["sglang[test]", "sglang[decord]"]
-all_aarch64 = ["sglang[test]"]
-blackwell = ["sglang[test]", "sglang[decord]"]
-blackwell_aarch64 = ["sglang[test]"]
-
+# To be deprecated in 2 weeks
+blackwell = ["sglang[dev]"]
+blackwell_aarch64 = ["sglang[dev]"]

 [project.urls]
 "Homepage" = "https://github.com/sgl-project/sglang"

--- a/python/pyproject_cpu.toml
+++ b/python/pyproject_cpu.toml
@@ -5,85 +5,88 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "sglang"
-version = "0.5.3rc0"
+version = "0.5.3.post3"
 description = "SGLang is a fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.10"
 license = { file = "LICENSE" }
 classifiers = [
-    "Programming Language :: Python :: 3",
-    "License :: OSI Approved :: Apache Software License",
+  "Programming Language :: Python :: 3",
+  "License :: OSI Approved :: Apache Software License",
 ]

 dependencies = [
-    "aiohttp",
-    "anthropic>=0.20.0",
-    "blobfile==3.0.0",
-    "build",
-    "compressed-tensors",
-    "datasets",
-    "decord",
-    "einops",
-    "fastapi",
-    "hf_transfer",
-    "huggingface_hub",
-    "intel-openmp",
-    "interegular",
-    "IPython",
-    "llguidance>=0.7.11,<0.8.0",
-    "modelscope",
-    "msgspec",
-    "ninja",
-    "numpy",
-    "openai==1.99.1",
-    "openai-harmony==0.0.4",
-    "orjson",
-    "outlines==0.1.11",
-    "packaging",
-    "partial_json_parser",
-    "pillow",
-    "prometheus-client>=0.20.0",
-    "psutil",
-    "pybase64",
-    "pydantic",
-    "python-multipart",
-    "pyzmq>=25.1.2",
-    "requests",
-    "scipy",
-    "sentencepiece",
-    "setproctitle",
-    "soundfile==0.13.1",
-    "tiktoken",
-    "timm==1.0.16",
-    "torchao==0.9.0",
-    "tqdm",
-    "transformers==4.57.1",
-    "uvicorn",
-    "uvloop",
-    "xgrammar==0.1.25",
+  "IPython",
+  "aiohttp",
+  "anthropic>=0.20.0",
+  "blobfile==3.0.0",
+  "build",
+  "compressed-tensors",
+  "datasets",
+  "decord",
+  "einops",
+  "fastapi",
+  "hf_transfer",
+  "huggingface_hub",
+  "intel-openmp",
+  "interegular",
+  "llguidance>=0.7.11,<0.8.0",
+  "modelscope",
+  "msgspec",
+  "ninja",
+  "numpy",
+  "openai-harmony==0.0.4",
+  "openai==1.99.1",
+  "orjson",
+  "outlines==0.1.11",
+  "packaging",
+  "partial_json_parser",
+  "pillow",
+  "prometheus-client>=0.20.0",
+  "psutil",
+  "py-spy",
+  "pybase64",
+  "pydantic",
+  "python-multipart",
+  "pyzmq>=25.1.2",
+  "requests",
+  "scipy",
+  "sentencepiece",
+  "setproctitle",
+  "soundfile==0.13.1",
+  "tiktoken",
+  "timm==1.0.16",
+  "torchao==0.9.0",
+  "tqdm",
+  "transformers==4.57.1",
+  "uvicorn",
+  "uvloop",
+  "xgrammar==0.1.25",
+  "grpcio==1.75.1", # keep it align with compile_proto.py
+  "grpcio-tools==1.75.1", # keep it align with compile_proto.py
+  "grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py
 ]

 [project.optional-dependencies]
 tracing = [
-    "opentelemetry-sdk",
-    "opentelemetry-api",
-    "opentelemetry-exporter-otlp",
-    "opentelemetry-exporter-otlp-proto-grpc",
+  "opentelemetry-sdk",
+  "opentelemetry-api",
+  "opentelemetry-exporter-otlp",
+  "opentelemetry-exporter-otlp-proto-grpc",
 ]
-
 test = [
-    "accelerate",
-    "expecttest",
-    "jsonlines",
-    "matplotlib",
-    "pandas",
-    "peft",
-    "sentence_transformers",
-    "pytest",
-    "tabulate",
+  "accelerate",
+  "expecttest",
+  "jsonlines",
+  "matplotlib",
+  "pandas",
+  "peft",
+  "pytest",
+  "sentence_transformers",
+  "tabulate",
 ]
-
-dev = ["sglang", "sglang[test]"]
+all = []
+dev = ["sglang[test]"]

 [project.urls]
 "Homepage" = "https://github.com/sgl-project/sglang"
@@ -91,31 +94,33 @@ dev = ["sglang", "sglang[test]"]

 [tool.setuptools.package-data]
 "sglang" = [
-    "srt/layers/moe/fused_moe_triton/configs/*/*.json",
-    "srt/layers/quantization/configs/*.json",
-    "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
+  "srt/layers/moe/fused_moe_triton/configs/*/*.json",
+  "srt/layers/quantization/configs/*.json",
+  "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
+  "srt/speculative/cpp_ngram/*.cpp",
+  "srt/speculative/cpp_ngram/*.h",
 ]

 [tool.setuptools.packages.find]
 exclude = [
-    "assets*",
-    "benchmark*",
-    "docs*",
-    "dist*",
-    "playground*",
-    "scripts*",
-    "tests*",
+  "assets*",
+  "benchmark*",
+  "docs*",
+  "dist*",
+  "playground*",
+  "scripts*",
+  "tests*",
 ]

 [tool.wheel]
 exclude = [
-    "assets*",
-    "benchmark*",
-    "docs*",
-    "dist*",
-    "playground*",
-    "scripts*",
-    "tests*",
+  "assets*",
+  "benchmark*",
+  "docs*",
+  "dist*",
+  "playground*",
+  "scripts*",
+  "tests*",
 ]

 [tool.codespell]

--- a/python/pyproject_other.toml
+++ b/python/pyproject_other.toml
@@ -10,76 +10,77 @@ readme = "README.md"
 requires-python = ">=3.10"
 license = { file = "LICENSE" }
 classifiers = [
-    "Programming Language :: Python :: 3",
-    "License :: OSI Approved :: Apache Software License",
+  "Programming Language :: Python :: 3",
+  "License :: OSI Approved :: Apache Software License",
 ]
 dependencies = ["aiohttp", "requests", "tqdm", "numpy", "IPython", "setproctitle"]

 [project.optional-dependencies]
 runtime_common = [
-    "blobfile==3.0.0",
-    "build",
-    "compressed-tensors",
-    "datasets",
-    "einops",
-    "fastapi",
-    "hf_transfer",
-    "huggingface_hub",
-    "interegular",
-    "llguidance>=0.7.11,<0.8.0",
-    "modelscope",
-    "msgspec",
-    "ninja",
-    "openai==1.99.1",
-    "openai-harmony==0.0.4",
-    "orjson",
-    "outlines==0.1.11",
-    "packaging",
-    "partial_json_parser",
-    "pillow",
-    "prometheus-client>=0.20.0",
-    "psutil",
-    "pybase64",
-    "pydantic",
-    "pynvml",
-    "python-multipart",
-    "pyzmq>=25.1.2",
-    "scipy",
-    "sentencepiece",
-    "soundfile==0.13.1",
-    "timm==1.0.16",
-    "tiktoken",
-    "torchao==0.9.0",
-    "transformers==4.57.1",
-    "uvicorn",
-    "uvloop",
-    "xgrammar==0.1.25",
+  "IPython",
+  "aiohttp",
+  "anthropic>=0.20.0",
+  "blobfile==3.0.0",
+  "build",
+  "compressed-tensors",
+  "decord2",
+  "datasets",
+  "einops",
+  "fastapi",
+  "hf_transfer",
+  "huggingface_hub",
+  "interegular",
+  "llguidance>=0.7.11,<0.8.0",
+  "modelscope",
+  "msgspec",
+  "ninja",
+  "numpy",
+  "openai-harmony==0.0.4",
+  "openai==1.99.1",
+  "orjson",
+  "outlines==0.1.11",
+  "packaging",
+  "partial_json_parser",
+  "pillow",
+  "prometheus-client>=0.20.0",
+  "psutil",
+  "py-spy",
+  "pybase64",
+  "pydantic",
+  "python-multipart",
+  "pyzmq>=25.1.2",
+  "requests",
+  "scipy",
+  "sentencepiece",
+  "setproctitle",
+  "soundfile==0.13.1",
+  "tiktoken",
+  "timm==1.0.16",
+  "torchao==0.9.0",
+  "tqdm",
+  "transformers==4.57.1",
+  "uvicorn",
+  "uvloop",
+  "xgrammar==0.1.25",
+  "grpcio==1.75.1", # keep it align with compile_proto.py
+  "grpcio-tools==1.75.1", # keep it align with compile_proto.py
+  "grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py
 ]

 tracing = [
-    "opentelemetry-sdk",
-    "opentelemetry-api",
-    "opentelemetry-exporter-otlp",
-    "opentelemetry-exporter-otlp-proto-grpc",
-]
-
-srt = [
-    "sglang[runtime_common]",
-    "sgl-kernel==0.3.15",
-    "torch==2.8.0",
-    "torchaudio==2.8.0",
-    "torchvision",
-    "cuda-python",
-    "flashinfer_python==0.4.0",
+  "opentelemetry-sdk",
+  "opentelemetry-api",
+  "opentelemetry-exporter-otlp",
+  "opentelemetry-exporter-otlp-proto-grpc",
 ]

 # HIP (Heterogeneous-computing Interface for Portability) for AMD
 # => base docker rocm/vllm-dev:20250114, not from public vllm whl
 srt_hip = [
-    "sglang[runtime_common]",
-    "torch",
-    "petit_kernel==0.0.2",
-    "wave-lang==3.7.0",
+  "sglang[runtime_common]",
+  "torch",
+  "petit_kernel==0.0.2",
+  "wave-lang==3.7.0",
 ]

 # https://docs.sglang.ai/platforms/ascend_npu.html
@@ -89,29 +90,24 @@ srt_npu = ["sglang[runtime_common]"]
 # https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
 srt_hpu = ["sglang[runtime_common]"]

-openai = ["openai==1.99.1", "tiktoken"]
-anthropic = ["anthropic>=0.20.0"]
-litellm = ["litellm>=1.0.0"]
-torch_memory_saver = ["torch_memory_saver==0.0.9rc1"]
-decord = ["decord"]
 test = [
-    "accelerate",
-    "expecttest",
-    "jsonlines",
-    "matplotlib",
-    "pandas",
-    "peft",
-    "sentence_transformers",
-    "pytest",
-    "tabulate",
+  "accelerate",
+  "expecttest",
+  "gguf",
+  "jsonlines",
+  "matplotlib",
+  "pandas",
+  "peft",
+  "pytest",
+  "sentence_transformers",
+  "tabulate",
 ]
-all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[torch_memory_saver]", "sglang[decord]"]
-all_hip = ["sglang[srt_hip]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"]
-all_hpu = ["sglang[srt_hpu]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"]
-all_npu = ["sglang[srt_npu]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"]
+all_hip = ["sglang[srt_hip]"]
+all_npu = ["sglang[srt_npu]"]
+all_hpu = ["sglang[srt_hpu]"]

-dev = ["sglang[all]", "sglang[test]"]
 dev_hip = ["sglang[all_hip]", "sglang[test]"]
+dev_npu = ["sglang[all_npu]", "sglang[test]"]
 dev_hpu = ["sglang[all_hpu]", "sglang[test]"]

 [project.urls]
@@ -120,31 +116,33 @@ dev_hpu = ["sglang[all_hpu]", "sglang[test]"]

 [tool.setuptools.package-data]
 "sglang" = [
-    "srt/layers/moe/fused_moe_triton/configs/*/*.json",
-    "srt/layers/quantization/configs/*.json",
-    "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
+  "srt/layers/moe/fused_moe_triton/configs/*/*.json",
+  "srt/layers/quantization/configs/*.json",
+  "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
+  "srt/speculative/cpp_ngram/*.cpp",
+  "srt/speculative/cpp_ngram/*.h",
 ]

 [tool.setuptools.packages.find]
 exclude = [
-    "assets*",
-    "benchmark*",
-    "docs*",
-    "dist*",
-    "playground*",
-    "scripts*",
-    "tests*",
+  "assets*",
+  "benchmark*",
+  "docs*",
+  "dist*",
+  "playground*",
+  "scripts*",
+  "tests*",
 ]

 [tool.wheel]
 exclude = [
-    "assets*",
-    "benchmark*",
-    "docs*",
-    "dist*",
-    "playground*",
-    "scripts*",
-    "tests*",
+  "assets*",
+  "benchmark*",
+  "docs*",
+  "dist*",
+  "playground*",
+  "scripts*",
+  "tests*",
 ]

 [tool.codespell]

--- a/python/pyproject_xpu.toml
+++ b/python/pyproject_xpu.toml
@@ -6,84 +6,87 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "sglang"
-version = "0.5.3rc0"
+version = "0.5.3.post3"
 description = "SGLang is a fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.10"
 license = { file = "LICENSE" }
 classifiers = [
-    "Programming Language :: Python :: 3",
-    "License :: OSI Approved :: Apache Software License",
+  "Programming Language :: Python :: 3",
+  "License :: OSI Approved :: Apache Software License",
 ]

 dependencies = [
-    "aiohttp",
-    "anthropic>=0.20.0",
-    "blobfile==3.0.0",
-    "build",
-    "compressed-tensors",
-    "datasets",
-    "decord",
-    "einops",
-    "fastapi",
-    "hf_transfer",
-    "huggingface_hub",
-    "interegular",
-    "IPython",
-    "llguidance>=0.7.11,<0.8.0",
-    "modelscope",
-    "msgspec",
-    "ninja",
-    "numpy",
-    "openai==1.99.1",
-    "openai-harmony==0.0.4",
-    "orjson",
-    "outlines==0.1.11",
-    "packaging",
-    "partial_json_parser",
-    "pillow",
-    "prometheus-client>=0.20.0",
-    "psutil",
-    "pybase64",
-    "pydantic",
-    "python-multipart",
-    "pyzmq>=25.1.2",
-    "requests",
-    "scipy",
-    "sentencepiece",
-    "setproctitle",
-    "soundfile==0.13.1",
-    "tiktoken",
-    "timm==1.0.16",
-    "torchao==0.9.0",
-    "tqdm",
-    "transformers==4.57.1",
-    "uvicorn",
-    "uvloop",
-    "xgrammar==0.1.25",
+  "IPython",
+  "aiohttp",
+  "anthropic>=0.20.0",
+  "blobfile==3.0.0",
+  "build",
+  "compressed-tensors",
+  "datasets",
+  "decord",
+  "einops",
+  "fastapi",
+  "hf_transfer",
+  "huggingface_hub",
+  "interegular",
+  "llguidance>=0.7.11,<0.8.0",
+  "modelscope",
+  "msgspec",
+  "ninja",
+  "numpy",
+  "openai-harmony==0.0.4",
+  "openai==1.99.1",
+  "orjson",
+  "outlines==0.1.11",
+  "packaging",
+  "partial_json_parser",
+  "pillow",
+  "prometheus-client>=0.20.0",
+  "psutil",
+  "py-spy",
+  "pybase64",
+  "pydantic",
+  "python-multipart",
+  "pyzmq>=25.1.2",
+  "requests",
+  "scipy",
+  "sentencepiece",
+  "setproctitle",
+  "soundfile==0.13.1",
+  "tiktoken",
+  "timm==1.0.16",
+  "torchao==0.9.0",
+  "tqdm",
+  "transformers==4.57.1",
+  "uvicorn",
+  "uvloop",
+  "xgrammar==0.1.25",
+  "grpcio==1.75.1", # keep it align with compile_proto.py
+  "grpcio-tools==1.75.1", # keep it align with compile_proto.py
+  "grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py
 ]

 [project.optional-dependencies]
 tracing = [
-    "opentelemetry-sdk",
-    "opentelemetry-api",
-    "opentelemetry-exporter-otlp",
-    "opentelemetry-exporter-otlp-proto-grpc",
+  "opentelemetry-sdk",
+  "opentelemetry-api",
+  "opentelemetry-exporter-otlp",
+  "opentelemetry-exporter-otlp-proto-grpc",
 ]
-
 test = [
-    "accelerate",
-    "expecttest",
-    "jsonlines",
-    "matplotlib",
-    "pandas",
-    "peft",
-    "sentence_transformers",
-    "pytest",
-    "tabulate",
+  "accelerate",
+  "expecttest",
+  "jsonlines",
+  "matplotlib",
+  "pandas",
+  "peft",
+  "pytest",
+  "sentence_transformers",
+  "tabulate",
 ]
-
-dev = ["sglang", "sglang[test]"]
+all = []
+dev = ["sglang[test]"]

 [project.urls]
 "Homepage" = "https://github.com/sgl-project/sglang"
@@ -91,31 +94,33 @@ dev = ["sglang", "sglang[test]"]

 [tool.setuptools.package-data]
 "sglang" = [
-    "srt/layers/moe/fused_moe_triton/configs/*/*.json",
-    "srt/layers/quantization/configs/*.json",
-    "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
+  "srt/layers/moe/fused_moe_triton/configs/*/*.json",
+  "srt/layers/quantization/configs/*.json",
+  "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
+  "srt/speculative/cpp_ngram/*.cpp",
+  "srt/speculative/cpp_ngram/*.h",
 ]

 [tool.setuptools.packages.find]
 exclude = [
-    "assets*",
-    "benchmark*",
-    "docs*",
-    "dist*",
-    "playground*",
-    "scripts*",
-    "tests*",
+  "assets*",
+  "benchmark*",
+  "docs*",
+  "dist*",
+  "playground*",
+  "scripts*",
+  "tests*",
 ]

 [tool.wheel]
 exclude = [
-    "assets*",
-    "benchmark*",
-    "docs*",
-    "dist*",
-    "playground*",
-    "scripts*",
-    "tests*",
+  "assets*",
+  "benchmark*",
+  "docs*",
+  "dist*",
+  "playground*",
+  "scripts*",
+  "tests*",
 ]

 [tool.codespell]

--- a/python/sglang/srt/model_executor/model_runner.py
+++ b/python/sglang/srt/model_executor/model_runner.py
@@ -623,7 +623,7 @@ class ModelRunner:
            server_args.disable_chunked_prefix_cache = True

        if not server_args.disable_chunked_prefix_cache:
-            logger.info("Chunked prefix cache is turned on.")
+            log_info_on_rank0(logger, "Chunked prefix cache is turned on.")

        if server_args.attention_backend == "aiter":
            if self.model_config.context_len > 8192:

--- a/python/sglang/srt/server_args.py
+++ b/python/sglang/srt/server_args.py
@@ -253,7 +253,6 @@ class ServerArgs:
    log_requests: bool = False
    log_requests_level: int = 2
    crash_dump_folder: Optional[str] = None
-    crash_on_nan: bool = False
    show_time_cost: bool = False
    enable_metrics: bool = False
    enable_metrics_for_all_schedulers: bool = False
@@ -1899,12 +1898,6 @@ class ServerArgs:
            default=ServerArgs.crash_dump_folder,
            help="Folder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled.",
        )
-        parser.add_argument(
-            "--crash-on-nan",
-            type=str,
-            default=ServerArgs.crash_on_nan,
-            help="Crash the server on nan logprobs.",
-        )
        parser.add_argument(
            "--show-time-cost",
            action="store_true",