Unverified Commit 67e34c56 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Fix install instructions and pyproject.tomls (#11781)

parent 1d726528
......@@ -46,14 +46,15 @@
</details>
## About
SGLang is a fast serving framework for large language models and vision language models.
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:
SGLang is a high-performance serving framework for large language models and vision-language models.
It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
Its core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), and reward models (Skywork), with easy extensibility for integrating new models. Compatible with most Hugging Face models and OpenAI APIs.
- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, supporting chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 300,000 GPUs worldwide.
## Getting Started
- [Install SGLang](https://docs.sglang.ai/get_started/install.html)
......@@ -69,7 +70,8 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
[Development Roadmap (2025 H2)](https://github.com/sgl-project/sglang/issues/7736)
## Adoption and Sponsorship
SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 1,000,000 GPUs worldwide.
SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 300,000 GPUs worldwide.
SGLang is currently hosted under the non-profit open-source organization [LMSYS](https://lmsys.org/about/).
<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/refs/heads/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
......
......@@ -12,7 +12,7 @@ It is recommended to use uv for faster installation:
```bash
pip install --upgrade pip
pip install uv
uv pip install sglang --upgrade
uv pip install sglang --prerelease=allow
```
**Quick fixes to common problems**
......@@ -129,5 +129,3 @@ sky status --endpoint 30000 sglang
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
SGLang Documentation
====================
SGLang is a fast serving framework for large language models and vision language models.
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
SGLang is a high-performance serving framework for large language models and vision-language models.
It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
Its core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), and reward models (Skywork), with easy extensibility for integrating new models. Compatible with most Hugging Face models and OpenAI APIs.
- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, supporting chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 300,000 GPUs worldwide.
.. toctree::
:maxdepth: 1
......
# TPU
The support for TPU is under active development. Please stay tuned.
See https://github.com/sgl-project/sglang-jax
......@@ -13,6 +13,7 @@ classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
]
dependencies = [
"IPython",
"aiohttp",
......@@ -21,6 +22,7 @@ dependencies = [
"build",
"compressed-tensors",
"cuda-python",
"decord2",
"datasets",
"einops",
"fastapi",
......@@ -73,7 +75,12 @@ dependencies = [
]
[project.optional-dependencies]
decord = ["decord2"]
tracing = [
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
"opentelemetry-sdk",
]
test = [
"accelerate",
"expecttest",
......@@ -86,13 +93,10 @@ test = [
"sentence_transformers",
"tabulate",
]
tracing = [
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
"opentelemetry-sdk",
]
all = ["sglang[test]", "sglang[decord]"]
all = []
dev = ["sglang[test]"]
# Temporary tags
cu130 = [
"torch==2.9.0",
"torchaudio==2.9.0",
......@@ -104,13 +108,9 @@ cu130_all = [
"sglang[cu130]"
]
# The following will be deprecated in 2 weeks
dev = ["sglang[test]", "sglang[decord]"]
all_aarch64 = ["sglang[test]"]
blackwell = ["sglang[test]", "sglang[decord]"]
blackwell_aarch64 = ["sglang[test]"]
# To be deprecated in 2 weeks
blackwell = ["sglang[dev]"]
blackwell_aarch64 = ["sglang[dev]"]
[project.urls]
"Homepage" = "https://github.com/sgl-project/sglang"
......
......@@ -5,85 +5,88 @@ build-backend = "setuptools.build_meta"
[project]
name = "sglang"
version = "0.5.3rc0"
version = "0.5.3.post3"
description = "SGLang is a fast serving framework for large language models and vision language models."
readme = "README.md"
requires-python = ">=3.10"
license = { file = "LICENSE" }
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
]
dependencies = [
"aiohttp",
"anthropic>=0.20.0",
"blobfile==3.0.0",
"build",
"compressed-tensors",
"datasets",
"decord",
"einops",
"fastapi",
"hf_transfer",
"huggingface_hub",
"intel-openmp",
"interegular",
"IPython",
"llguidance>=0.7.11,<0.8.0",
"modelscope",
"msgspec",
"ninja",
"numpy",
"openai==1.99.1",
"openai-harmony==0.0.4",
"orjson",
"outlines==0.1.11",
"packaging",
"partial_json_parser",
"pillow",
"prometheus-client>=0.20.0",
"psutil",
"pybase64",
"pydantic",
"python-multipart",
"pyzmq>=25.1.2",
"requests",
"scipy",
"sentencepiece",
"setproctitle",
"soundfile==0.13.1",
"tiktoken",
"timm==1.0.16",
"torchao==0.9.0",
"tqdm",
"transformers==4.57.1",
"uvicorn",
"uvloop",
"xgrammar==0.1.25",
"IPython",
"aiohttp",
"anthropic>=0.20.0",
"blobfile==3.0.0",
"build",
"compressed-tensors",
"datasets",
"decord",
"einops",
"fastapi",
"hf_transfer",
"huggingface_hub",
"intel-openmp",
"interegular",
"llguidance>=0.7.11,<0.8.0",
"modelscope",
"msgspec",
"ninja",
"numpy",
"openai-harmony==0.0.4",
"openai==1.99.1",
"orjson",
"outlines==0.1.11",
"packaging",
"partial_json_parser",
"pillow",
"prometheus-client>=0.20.0",
"psutil",
"py-spy",
"pybase64",
"pydantic",
"python-multipart",
"pyzmq>=25.1.2",
"requests",
"scipy",
"sentencepiece",
"setproctitle",
"soundfile==0.13.1",
"tiktoken",
"timm==1.0.16",
"torchao==0.9.0",
"tqdm",
"transformers==4.57.1",
"uvicorn",
"uvloop",
"xgrammar==0.1.25",
"grpcio==1.75.1", # keep it align with compile_proto.py
"grpcio-tools==1.75.1", # keep it align with compile_proto.py
"grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py
]
[project.optional-dependencies]
tracing = [
"opentelemetry-sdk",
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
"opentelemetry-sdk",
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
]
test = [
"accelerate",
"expecttest",
"jsonlines",
"matplotlib",
"pandas",
"peft",
"sentence_transformers",
"pytest",
"tabulate",
"accelerate",
"expecttest",
"jsonlines",
"matplotlib",
"pandas",
"peft",
"pytest",
"sentence_transformers",
"tabulate",
]
dev = ["sglang", "sglang[test]"]
all = []
dev = ["sglang[test]"]
[project.urls]
"Homepage" = "https://github.com/sgl-project/sglang"
......@@ -91,31 +94,33 @@ dev = ["sglang", "sglang[test]"]
[tool.setuptools.package-data]
"sglang" = [
"srt/layers/moe/fused_moe_triton/configs/*/*.json",
"srt/layers/quantization/configs/*.json",
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/layers/moe/fused_moe_triton/configs/*/*.json",
"srt/layers/quantization/configs/*.json",
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/speculative/cpp_ngram/*.cpp",
"srt/speculative/cpp_ngram/*.h",
]
[tool.setuptools.packages.find]
exclude = [
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
]
[tool.wheel]
exclude = [
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
]
[tool.codespell]
......
......@@ -10,76 +10,77 @@ readme = "README.md"
requires-python = ">=3.10"
license = { file = "LICENSE" }
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
]
dependencies = ["aiohttp", "requests", "tqdm", "numpy", "IPython", "setproctitle"]
[project.optional-dependencies]
runtime_common = [
"blobfile==3.0.0",
"build",
"compressed-tensors",
"datasets",
"einops",
"fastapi",
"hf_transfer",
"huggingface_hub",
"interegular",
"llguidance>=0.7.11,<0.8.0",
"modelscope",
"msgspec",
"ninja",
"openai==1.99.1",
"openai-harmony==0.0.4",
"orjson",
"outlines==0.1.11",
"packaging",
"partial_json_parser",
"pillow",
"prometheus-client>=0.20.0",
"psutil",
"pybase64",
"pydantic",
"pynvml",
"python-multipart",
"pyzmq>=25.1.2",
"scipy",
"sentencepiece",
"soundfile==0.13.1",
"timm==1.0.16",
"tiktoken",
"torchao==0.9.0",
"transformers==4.57.1",
"uvicorn",
"uvloop",
"xgrammar==0.1.25",
"IPython",
"aiohttp",
"anthropic>=0.20.0",
"blobfile==3.0.0",
"build",
"compressed-tensors",
"decord2",
"datasets",
"einops",
"fastapi",
"hf_transfer",
"huggingface_hub",
"interegular",
"llguidance>=0.7.11,<0.8.0",
"modelscope",
"msgspec",
"ninja",
"numpy",
"openai-harmony==0.0.4",
"openai==1.99.1",
"orjson",
"outlines==0.1.11",
"packaging",
"partial_json_parser",
"pillow",
"prometheus-client>=0.20.0",
"psutil",
"py-spy",
"pybase64",
"pydantic",
"python-multipart",
"pyzmq>=25.1.2",
"requests",
"scipy",
"sentencepiece",
"setproctitle",
"soundfile==0.13.1",
"tiktoken",
"timm==1.0.16",
"torchao==0.9.0",
"tqdm",
"transformers==4.57.1",
"uvicorn",
"uvloop",
"xgrammar==0.1.25",
"grpcio==1.75.1", # keep it align with compile_proto.py
"grpcio-tools==1.75.1", # keep it align with compile_proto.py
"grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py
]
tracing = [
"opentelemetry-sdk",
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
]
srt = [
"sglang[runtime_common]",
"sgl-kernel==0.3.15",
"torch==2.8.0",
"torchaudio==2.8.0",
"torchvision",
"cuda-python",
"flashinfer_python==0.4.0",
"opentelemetry-sdk",
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
]
# HIP (Heterogeneous-computing Interface for Portability) for AMD
# => base docker rocm/vllm-dev:20250114, not from public vllm whl
srt_hip = [
"sglang[runtime_common]",
"torch",
"petit_kernel==0.0.2",
"wave-lang==3.7.0",
"sglang[runtime_common]",
"torch",
"petit_kernel==0.0.2",
"wave-lang==3.7.0",
]
# https://docs.sglang.ai/platforms/ascend_npu.html
......@@ -89,29 +90,24 @@ srt_npu = ["sglang[runtime_common]"]
# https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
srt_hpu = ["sglang[runtime_common]"]
openai = ["openai==1.99.1", "tiktoken"]
anthropic = ["anthropic>=0.20.0"]
litellm = ["litellm>=1.0.0"]
torch_memory_saver = ["torch_memory_saver==0.0.9rc1"]
decord = ["decord"]
test = [
"accelerate",
"expecttest",
"jsonlines",
"matplotlib",
"pandas",
"peft",
"sentence_transformers",
"pytest",
"tabulate",
"accelerate",
"expecttest",
"gguf",
"jsonlines",
"matplotlib",
"pandas",
"peft",
"pytest",
"sentence_transformers",
"tabulate",
]
all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[torch_memory_saver]", "sglang[decord]"]
all_hip = ["sglang[srt_hip]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"]
all_hpu = ["sglang[srt_hpu]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"]
all_npu = ["sglang[srt_npu]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"]
all_hip = ["sglang[srt_hip]"]
all_npu = ["sglang[srt_npu]"]
all_hpu = ["sglang[srt_hpu]"]
dev = ["sglang[all]", "sglang[test]"]
dev_hip = ["sglang[all_hip]", "sglang[test]"]
dev_npu = ["sglang[all_npu]", "sglang[test]"]
dev_hpu = ["sglang[all_hpu]", "sglang[test]"]
[project.urls]
......@@ -120,31 +116,33 @@ dev_hpu = ["sglang[all_hpu]", "sglang[test]"]
[tool.setuptools.package-data]
"sglang" = [
"srt/layers/moe/fused_moe_triton/configs/*/*.json",
"srt/layers/quantization/configs/*.json",
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/layers/moe/fused_moe_triton/configs/*/*.json",
"srt/layers/quantization/configs/*.json",
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/speculative/cpp_ngram/*.cpp",
"srt/speculative/cpp_ngram/*.h",
]
[tool.setuptools.packages.find]
exclude = [
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
]
[tool.wheel]
exclude = [
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
]
[tool.codespell]
......
......@@ -6,84 +6,87 @@ build-backend = "setuptools.build_meta"
[project]
name = "sglang"
version = "0.5.3rc0"
version = "0.5.3.post3"
description = "SGLang is a fast serving framework for large language models and vision language models."
readme = "README.md"
requires-python = ">=3.10"
license = { file = "LICENSE" }
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
]
dependencies = [
"aiohttp",
"anthropic>=0.20.0",
"blobfile==3.0.0",
"build",
"compressed-tensors",
"datasets",
"decord",
"einops",
"fastapi",
"hf_transfer",
"huggingface_hub",
"interegular",
"IPython",
"llguidance>=0.7.11,<0.8.0",
"modelscope",
"msgspec",
"ninja",
"numpy",
"openai==1.99.1",
"openai-harmony==0.0.4",
"orjson",
"outlines==0.1.11",
"packaging",
"partial_json_parser",
"pillow",
"prometheus-client>=0.20.0",
"psutil",
"pybase64",
"pydantic",
"python-multipart",
"pyzmq>=25.1.2",
"requests",
"scipy",
"sentencepiece",
"setproctitle",
"soundfile==0.13.1",
"tiktoken",
"timm==1.0.16",
"torchao==0.9.0",
"tqdm",
"transformers==4.57.1",
"uvicorn",
"uvloop",
"xgrammar==0.1.25",
"IPython",
"aiohttp",
"anthropic>=0.20.0",
"blobfile==3.0.0",
"build",
"compressed-tensors",
"datasets",
"decord",
"einops",
"fastapi",
"hf_transfer",
"huggingface_hub",
"interegular",
"llguidance>=0.7.11,<0.8.0",
"modelscope",
"msgspec",
"ninja",
"numpy",
"openai-harmony==0.0.4",
"openai==1.99.1",
"orjson",
"outlines==0.1.11",
"packaging",
"partial_json_parser",
"pillow",
"prometheus-client>=0.20.0",
"psutil",
"py-spy",
"pybase64",
"pydantic",
"python-multipart",
"pyzmq>=25.1.2",
"requests",
"scipy",
"sentencepiece",
"setproctitle",
"soundfile==0.13.1",
"tiktoken",
"timm==1.0.16",
"torchao==0.9.0",
"tqdm",
"transformers==4.57.1",
"uvicorn",
"uvloop",
"xgrammar==0.1.25",
"grpcio==1.75.1", # keep it align with compile_proto.py
"grpcio-tools==1.75.1", # keep it align with compile_proto.py
"grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py
]
[project.optional-dependencies]
tracing = [
"opentelemetry-sdk",
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
"opentelemetry-sdk",
"opentelemetry-api",
"opentelemetry-exporter-otlp",
"opentelemetry-exporter-otlp-proto-grpc",
]
test = [
"accelerate",
"expecttest",
"jsonlines",
"matplotlib",
"pandas",
"peft",
"sentence_transformers",
"pytest",
"tabulate",
"accelerate",
"expecttest",
"jsonlines",
"matplotlib",
"pandas",
"peft",
"pytest",
"sentence_transformers",
"tabulate",
]
dev = ["sglang", "sglang[test]"]
all = []
dev = ["sglang[test]"]
[project.urls]
"Homepage" = "https://github.com/sgl-project/sglang"
......@@ -91,31 +94,33 @@ dev = ["sglang", "sglang[test]"]
[tool.setuptools.package-data]
"sglang" = [
"srt/layers/moe/fused_moe_triton/configs/*/*.json",
"srt/layers/quantization/configs/*.json",
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/layers/moe/fused_moe_triton/configs/*/*.json",
"srt/layers/quantization/configs/*.json",
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/speculative/cpp_ngram/*.cpp",
"srt/speculative/cpp_ngram/*.h",
]
[tool.setuptools.packages.find]
exclude = [
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
]
[tool.wheel]
exclude = [
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
"assets*",
"benchmark*",
"docs*",
"dist*",
"playground*",
"scripts*",
"tests*",
]
[tool.codespell]
......
......@@ -623,7 +623,7 @@ class ModelRunner:
server_args.disable_chunked_prefix_cache = True
if not server_args.disable_chunked_prefix_cache:
logger.info("Chunked prefix cache is turned on.")
log_info_on_rank0(logger, "Chunked prefix cache is turned on.")
if server_args.attention_backend == "aiter":
if self.model_config.context_len > 8192:
......
......@@ -253,7 +253,6 @@ class ServerArgs:
log_requests: bool = False
log_requests_level: int = 2
crash_dump_folder: Optional[str] = None
crash_on_nan: bool = False
show_time_cost: bool = False
enable_metrics: bool = False
enable_metrics_for_all_schedulers: bool = False
......@@ -1899,12 +1898,6 @@ class ServerArgs:
default=ServerArgs.crash_dump_folder,
help="Folder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled.",
)
parser.add_argument(
"--crash-on-nan",
type=str,
default=ServerArgs.crash_on_nan,
help="Crash the server on nan logprobs.",
)
parser.add_argument(
"--show-time-cost",
action="store_true",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment