Unverified Commit 2e8e7e35 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Improve docs and developer guide (#9044)

parent 2449a0af
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
## Checklist ## Checklist
- [ ] Format your code according to the [Code formatting with pre-commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit). - [ ] Format your code according to the [Format code with pre-commit](https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit).
- [ ] Add unit tests according to the [Running and adding unit tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci). - [ ] Add unit tests according to the [Run and add unit tests](https://docs.sglang.ai/developer_guide/contribution_guide.html#run-and-add-unit-tests).
- [ ] Update documentation according to [Writing documentations](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci). - [ ] Update documentation according to [Write documentations](https://docs.sglang.ai/developer_guide/contribution_guide.html#write-documentations).
- [ ] Provide accuracy and speed benchmark results according to [Testing the accuracy](https://docs.sglang.ai/references/contribution_guide.html#testing-the-accuracy) and [Benchmark and profiling]() - [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.ai/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.ai/developer_guide/contribution_guide.html#benchmark-the-speed).
...@@ -14,7 +14,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help` ...@@ -14,7 +14,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help`
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
``` ```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../router/router.md) for data parallelism. - To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../advanced_features/router.md) for data parallelism.
```bash ```bash
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2 python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
......
# Sampling Parameters # Sampling Parameters
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb). If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](openai_api_completions.ipynb).
## `/generate` Endpoint ## `/generate` Endpoint
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs. The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
| Argument | Type/Default | Description | | Argument | Type/Default | Description |
|----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------| |----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
...@@ -135,7 +135,7 @@ for chunk in response.iter_lines(decode_unicode=False): ...@@ -135,7 +135,7 @@ for chunk in response.iter_lines(decode_unicode=False):
print("") print("")
``` ```
Detailed example in [openai compatible api](https://docs.sglang.ai/backend/openai_api_completions.html#id2). Detailed example in [openai compatible api](openai_api_completions.ipynb).
### Multimodal ### Multimodal
...@@ -176,7 +176,7 @@ The `image_data` can be a file name, a URL, or a base64 encoded string. See also ...@@ -176,7 +176,7 @@ The `image_data` can be a file name, a URL, or a base64 encoded string. See also
Streaming is supported in a similar manner as [above](#streaming). Streaming is supported in a similar manner as [above](#streaming).
Detailed example in [openai api vision](./openai_api_vision.ipynb). Detailed example in [OpenAI API Vision](openai_api_vision.ipynb).
### Structured Outputs (JSON, Regex, EBNF) ### Structured Outputs (JSON, Regex, EBNF)
......
...@@ -69,9 +69,10 @@ Another effective strategy is to review the file modification history and contac ...@@ -69,9 +69,10 @@ Another effective strategy is to review the file modification history and contac
If you modify files protected by code owners, their approval is required to merge the code. If you modify files protected by code owners, their approval is required to merge the code.
## General Code Style ## General Code Style
- Avoid code duplication. If the same code snippet (more than 5 lines) appears multiple times, extract it into a shared function. - Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function.
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, as much as possible. Use vectorized code instead. - Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code.
- Keep files short. If a file exceeds 2,000 lines of code, please split it into multiple smaller files. - Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files.
- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize every minor overhead as much as possible.
## Tips for newcomers ## Tips for newcomers
......
...@@ -87,7 +87,7 @@ The core features include: ...@@ -87,7 +87,7 @@ The core features include:
references/faq.md references/faq.md
references/environment_variables.md references/environment_variables.md
references/production_metrics.md references/production_metrics.md
references/multi_node_deployment/multi_node_index.rst
references/custom_chat_template.md references/custom_chat_template.md
references/frontend/frontend_index.rst references/frontend/frontend_index.rst
references/multi_node_deployment/multi_node_index.rst
references/learn_more.md references/learn_more.md
...@@ -66,7 +66,7 @@ This enables TorchAO's int4 weight-only quantization with a 128-group size. The ...@@ -66,7 +66,7 @@ This enables TorchAO's int4 weight-only quantization with a 128-group size. The
* * * * * * * * * *
Structured output with XGrammar Structured output with XGrammar
------------------------------- -------------------------------
Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb). Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
* * * * * * * * * *
Thanks to the support from [shahizat](https://github.com/shahizat). Thanks to the support from [shahizat](https://github.com/shahizat).
......
...@@ -25,9 +25,9 @@ in the GitHub search bar. ...@@ -25,9 +25,9 @@ in the GitHub search bar.
| Model Family (Variants) | Example HuggingFace Identifier | Description | | Model Family (Variants) | Example HuggingFace Identifier | Description |
|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------| |-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------|
| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](https://docs.sglang.ai/references/deepseek) and [Reasoning Parser](https://docs.sglang.ai/backend/separate_reasoning)| | **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)|
| **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](https://docs.sglang.ai/backend/separate_reasoning)| | **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)|
| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](https://docs.sglang.ai/references/llama4) | | **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md)) |
| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. | | **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
| **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. | | **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
| **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. | | **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. |
......
...@@ -27,7 +27,7 @@ standard LLM support: ...@@ -27,7 +27,7 @@ standard LLM support:
3. **Multimodal Data Processor**: 3. **Multimodal Data Processor**:
Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
model’s dedicated processor. model’s dedicated processor.
See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors)
for more details. for more details.
4. **Handle Multimodal Tokens**: 4. **Handle Multimodal Tokens**:
......
...@@ -18,7 +18,7 @@ python3 -m sglang.launch_server \ ...@@ -18,7 +18,7 @@ python3 -m sglang.launch_server \
### Quantization ### Quantization
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang. Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang.
### Remote code ### Remote code
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment