Unverified Commit 2449a0af authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Refactor the docs (#9031)

parent 0f229c07
......@@ -5,13 +5,21 @@
"id": "0",
"metadata": {},
"source": [
"# Querying Qwen-VL"
"# Query Vision Language Model"
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {},
"source": [
"## Querying Qwen-VL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1",
"id": "2",
"metadata": {},
"outputs": [],
"source": [
......@@ -26,7 +34,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "2",
"id": "3",
"metadata": {},
"outputs": [
{
......@@ -61,7 +69,6 @@
"import requests\n",
"from PIL import Image\n",
"\n",
"from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest\n",
"from sglang.srt.conversation import chat_templates\n",
"\n",
"image = Image.open(\n",
......@@ -83,16 +90,16 @@
},
{
"cell_type": "markdown",
"id": "3",
"id": "4",
"metadata": {},
"source": [
"## Query via the offline Engine API"
"### Query via the offline Engine API"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4",
"id": "5",
"metadata": {},
"outputs": [
{
......@@ -121,7 +128,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5",
"id": "6",
"metadata": {},
"outputs": [
{
......@@ -139,16 +146,16 @@
},
{
"cell_type": "markdown",
"id": "6",
"id": "7",
"metadata": {},
"source": [
"## Query via the offline Engine API, but send precomputed embeddings"
"### Query via the offline Engine API, but send precomputed embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"id": "8",
"metadata": {},
"outputs": [
{
......@@ -181,7 +188,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "8",
"id": "9",
"metadata": {},
"outputs": [
{
......@@ -212,16 +219,16 @@
},
{
"cell_type": "markdown",
"id": "9",
"id": "10",
"metadata": {},
"source": [
"# Querying Llama 4 (Vision)"
"## Querying Llama 4 (Vision)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10",
"id": "11",
"metadata": {},
"outputs": [],
"source": [
......@@ -236,7 +243,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "11",
"id": "12",
"metadata": {},
"outputs": [
{
......@@ -271,7 +278,6 @@
"import requests\n",
"from PIL import Image\n",
"\n",
"from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest\n",
"from sglang.srt.conversation import chat_templates\n",
"\n",
"image = Image.open(\n",
......@@ -295,16 +301,16 @@
},
{
"cell_type": "markdown",
"id": "12",
"id": "13",
"metadata": {},
"source": [
"## Query via the offline Engine API"
"### Query via the offline Engine API"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13",
"id": "14",
"metadata": {},
"outputs": [
{
......@@ -416,7 +422,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"id": "15",
"metadata": {},
"outputs": [
{
......@@ -435,16 +441,16 @@
},
{
"cell_type": "markdown",
"id": "15",
"id": "16",
"metadata": {},
"source": [
"## Query via the offline Engine API, but send precomputed embeddings"
"### Query via the offline Engine API, but send precomputed embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16",
"id": "17",
"metadata": {},
"outputs": [
{
......@@ -480,7 +486,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "17",
"id": "18",
"metadata": {},
"outputs": [
{
......
......@@ -57,22 +57,20 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
Detailed commands for reference:
- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended)
- [8 x MI300X](https://docs.sglang.ai/references/amd.html#running-deepseek-v3)
- [8 x MI300X](../platforms/amd_gpu.md#running-deepseek-v3)
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
- [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
- [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
- [Xeon 6980P CPU](https://docs.sglang.ai/references/cpu.html#example-running-deepseek-r1)
- [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1)
### Download Weights
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
### Caching `torch.compile`
The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.
### Launch with one node of 8 x H200
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch).
**Note that Deepseek V3 is already in FP8**, so we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
### Running examples on Multi-node
......@@ -221,7 +219,6 @@ Important Notes:
2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt.
## FAQ
**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?**
......
# GPT OSS Usage
Please refer to [https://github.com/sgl-project/sglang/issues/8833](https://github.com/sgl-project/sglang/issues/8833).
......@@ -6,7 +6,7 @@
"source": [
"# SGLang Native APIs\n",
"\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
"\n",
"- `/generate` (text generation model)\n",
"- `/get_model_info`\n",
......@@ -21,8 +21,9 @@
"- `/start_expert_distribution_record`\n",
"- `/stop_expert_distribution_record`\n",
"- `/dump_expert_distribution_record`\n",
"- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
"\n",
"We mainly use `requests` to test these APIs in the following examples. You can also use `curl`."
"We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
]
},
{
......@@ -38,24 +39,12 @@
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
")\n",
"## To run qwen2.5-0.5b-instruct model on the Ascend-Npu, you can execute the following command:\n",
"# server_process, port = launch_server_cmd(\n",
"# \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --device npu --tp 2 --attention-backend torch_native\"\n",
"# )\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
......@@ -65,7 +54,7 @@
"metadata": {},
"source": [
"## Generate (text generation model)\n",
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](./sampling_params.md)."
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)."
]
},
{
......@@ -74,6 +63,8 @@
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = f\"http://localhost:{port}/generate\"\n",
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
......@@ -81,11 +72,6 @@
"print_highlight(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
......@@ -141,8 +127,6 @@
"metadata": {},
"outputs": [],
"source": [
"# get_server_info\n",
"\n",
"url = f\"http://localhost:{port}/get_server_info\"\n",
"\n",
"response = requests.get(url)\n",
......@@ -197,8 +181,6 @@
"metadata": {},
"outputs": [],
"source": [
"# flush cache\n",
"\n",
"url = f\"http://localhost:{port}/flush_cache\"\n",
"\n",
"response = requests.post(url)\n",
......@@ -270,7 +252,7 @@
"source": [
"## Encode (embedding model)\n",
"\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.html#openai-apis-embedding) and will raise an error for generation models.\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model."
]
},
......
......@@ -64,25 +64,11 @@
"source": [
"# launch the offline engine\n",
"import asyncio\n",
"import io\n",
"import os\n",
"\n",
"from PIL import Image\n",
"import requests\n",
"import sglang as sgl\n",
"\n",
"from sglang.srt.conversation import chat_templates\n",
"from sglang.test.test_utils import is_in_ci\n",
"import sglang.test.doc_patch\n",
"from sglang.utils import async_stream_and_merge, stream_and_merge\n",
"\n",
"if is_in_ci():\n",
" import patch\n",
"else:\n",
" import nest_asyncio\n",
"\n",
" nest_asyncio.apply()\n",
"\n",
"\n",
"llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")"
]
},
......
OpenAI-Compatible APIs
======================
.. toctree::
:maxdepth: 1
openai_api_completions.ipynb
openai_api_vision.ipynb
openai_api_embeddings.ipynb
......@@ -14,7 +14,7 @@
"- `chat/completions`\n",
"- `completions`\n",
"\n",
"Check out other tutorials to learn about [vision APIs](https://docs.sglang.ai/backend/openai_api_vision.html) for vision-language models and [embedding APIs](https://docs.sglang.ai/backend/openai_api_embeddings.html) for embedding models."
"Check out other tutorials to learn about [vision APIs](openai_api_vision.ipynb) for vision-language models and [embedding APIs](openai_api_embeddings.ipynb) for embedding models."
]
},
{
......@@ -32,18 +32,11 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8\"\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
......@@ -93,9 +86,69 @@
"\n",
"The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
"\n",
"SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor.\n",
"SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
" ],\n",
" temperature=0.3, # Lower temperature for more focused responses\n",
" max_tokens=128, # Reasonable length for a concise response\n",
" top_p=0.95, # Slightly higher for better fluency\n",
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
" n=1, # Single response is usually more stable\n",
" seed=42, # Keep for reproducibility\n",
")\n",
"\n",
"#### Enabling Model Thinking/Reasoning\n",
"print_highlight(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Streaming mode is also supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stream = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
" stream=True,\n",
")\n",
"for chunk in stream:\n",
" if chunk.choices[0].delta.content is not None:\n",
" print(chunk.choices[0].delta.content, end=\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enabling Model Thinking/Reasoning\n",
"\n",
"You can use `chat_template_kwargs` to enable or disable the model's internal thinking or reasoning process output. Set `\"enable_thinking\": True` within `chat_template_kwargs` to include the reasoning steps in the response. This requires launching the server with a compatible reasoning parser.\n",
"\n",
......@@ -160,61 +213,6 @@
"Here is an example of a detailed chat completion request using standard OpenAI parameters:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
" ],\n",
" temperature=0.3, # Lower temperature for more focused responses\n",
" max_tokens=128, # Reasonable length for a concise response\n",
" top_p=0.95, # Slightly higher for better fluency\n",
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
" n=1, # Single response is usually more stable\n",
" seed=42, # Keep for reproducibility\n",
")\n",
"\n",
"print_highlight(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Streaming mode is also supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stream = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
" stream=True,\n",
")\n",
"for chunk in stream:\n",
" if chunk.choices[0].delta.content is not None:\n",
" print(chunk.choices[0].delta.content, end=\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
......@@ -282,7 +280,7 @@
"source": [
"## Structured Outputs (JSON, Regex, EBNF)\n",
"\n",
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API) for more details.\n"
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
]
},
{
......
......@@ -9,7 +9,7 @@
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n",
"\n",
"This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](https://docs.sglang.ai/supported_models/embedding_models.html)\n"
"This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/embedding_models.md)\n"
]
},
{
......@@ -27,13 +27,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"embedding_process, port = launch_server_cmd(\n",
......
......@@ -10,7 +10,7 @@
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
"This tutorial covers the vision APIs for vision language models.\n",
"\n",
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/supported_models/multimodal_language_models).\n",
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/multimodal_language_models.md).\n",
"\n",
"As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)."
]
......@@ -30,13 +30,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"vision_process, port = launch_server_cmd(\n",
......
# Sampling Parameters
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb).
## `/generate` Endpoint
......@@ -53,7 +52,7 @@ The object is defined at `sampling_params.py::SamplingParams`. You can also read
### Constrained decoding
Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters.
Please refer to our dedicated guide on [constrained decoding](../advanced_features/structured_outputs.ipynb) for the following parameters.
| Argument | Type/Default | Description |
|-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
......@@ -185,9 +184,9 @@ You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia
SGLang supports two grammar backends:
- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.
- [XGrammar](https://github.com/mlc-ai/xgrammar) (default): Supports JSON schema, regular expression, and EBNF constraints.
- XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.
If instead you want to initialize the Outlines backend, you can use `--grammar-backend outlines` flag:
......@@ -252,7 +251,7 @@ response = requests.post(
print(response.json())
```
Detailed example in [structured outputs](./structured_outputs.ipynb).
Detailed example in [structured outputs](../advanced_features/structured_outputs.ipynb).
### Custom logit processor
......
......@@ -7,9 +7,9 @@
"# Sending Requests\n",
"This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n",
"\n",
"- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).\n",
"- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).\n",
"- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model))."
"- For Vision Language Models, see [OpenAI APIs - Vision](openai_api_vision.ipynb).\n",
"- For Embedding Models, see [OpenAI APIs - Embedding](openai_api_embeddings.ipynb) and [Encode (embedding model)](native_api.html#Encode-(embedding-model)).\n",
"- For Reward Models, see [Classify (reward model)](native_api.html#Classify-(reward-model))."
]
},
{
......@@ -25,16 +25,10 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"# This is equivalent to running the following command in your terminal\n",
"\n",
"# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n",
"\n",
"server_process, port = launch_server_cmd(\n",
......
......@@ -30,62 +30,76 @@
[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.
- To profile a server
### Profile a server with `sglang.bench_serving`
```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
```
# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
```
Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
### Profile a server with `sglang.bench_offline_throughput`
```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
- To profile offline
```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
### Profile a server with `sglang.profiler`
- Possible PyTorch Bug
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
```bash
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
```
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
```bash
export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.
- View Traces
You can do this by running `python3 -m sglang.profiler`. For example:
Trace files can be loaded and visualized from:
```
# Terminal 1: Send a generation request
python3 -m sglang.test.send_one
1. https://ui.perfetto.dev/ (any browser)
2. chrome://tracing (Chrome browser only)
# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
# It will generate a profile of the above request for several decoding batches.
python3 -m sglang.profiler
```
If browser cannot open trace file due to its large size,
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
For example, when profiling a server,
### Possible PyTorch bugs
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
```bash
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
```
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
```bash
export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
```bash
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
```
### View traces
Trace files can be loaded and visualized from:
1. https://ui.perfetto.dev/ (any browser)
2. chrome://tracing (Chrome browser only)
If browser cannot open trace file due to its large size,
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
For example, when profiling a server,
```bash
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
```
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
## Profile with Nsight
......
......@@ -2,9 +2,9 @@
Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
## Setting Up & Building from Source
## Install SGLang from Source
### Fork and Clone the Repository
### Fork and clone the repository
**Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.
......@@ -12,11 +12,11 @@ Welcome to **SGLang**! We appreciate your interest in contributing. This guide p
git clone https://github.com/<your_user_name>/sglang.git
```
### Install Dependencies & Build
### Build from source
Refer to [Install SGLang from Source](https://docs.sglang.ai/start/install.html#method-2-from-source) documentation for more details on setting up the necessary dependencies.
Refer to [Install SGLang from Source](../get_started/install.md#method-2-from-source).
## Code Formatting with Pre-Commit
## Format code with pre-commit
We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
......@@ -29,18 +29,54 @@ pre-commit run --all-files
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
## Running Unit Tests & Adding to CI
## Run and add unit tests
SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. For detailed instructions on running tests and adding them to CI, please refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md).
If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression.
SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework.
For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md).
## Writing Documentation & Running Docs CI
## Write documentations
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase.
For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
## Tips for Newcomers
## Test the accuracy
If your code changes the model output, please run the accuracy tests. A quick sanity check is the few-shot GSM8K.
```
# Launch a server
python3 -m sglang.launch_server --model Qwen/Qwen2-7B-Instruct
# Evaluate
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
```
Please note that the above script is primarily a sanity check, not a rigorous accuracy or speed test.
This test can have significant variance (1%–5%) in accuracy due to batching and the non-deterministic nature of the inference engine.
Also, do not rely on the "Latency/Output throughput" from this script, as it is not a proper speed test.
GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests.
You can find additional accuracy eval examples in:
- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_eval_accuracy_large.py)
- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_gpt_oss_1gpu.py)
## Benchmark the speed
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md).
## Request a Review
You can identify potential reviewers for your code by checking the [code owners](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and [reviewers](https://github.com/sgl-project/sglang/blob/main/.github/REVIEWERS.md) files.
Another effective strategy is to review the file modification history and contact individuals who have frequently edited the files.
If you modify files protected by code owners, their approval is required to merge the code.
## General Code Style
- Avoid code duplication. If the same code snippet (more than 5 lines) appears multiple times, extract it into a shared function.
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, as much as possible. Use vectorized code instead.
- Keep files short. If a file exceeds 2,000 lines of code, please split it into multiple smaller files.
## Tips for newcomers
If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow.
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2um0ad92q-LkU19KQTxCGzlCgRiOiQEw).
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.ai).
Thank you for your interest in SGLang. Happy coding!
import weakref
from sglang.utils import execute_shell_command, reserve_port
DEFAULT_MAX_RUNNING_REQUESTS = 200
DEFAULT_MAX_TOTAL_TOKENS = 20480
import sglang.srt.server_args as server_args_mod
_original_post_init = server_args_mod.ServerArgs.__post_init__
def patched_post_init(self):
_original_post_init(self)
if self.max_running_requests is None:
self.max_running_requests = DEFAULT_MAX_RUNNING_REQUESTS
if self.max_total_tokens is None:
self.max_total_tokens = DEFAULT_MAX_TOTAL_TOKENS
self.disable_cuda_graph = True
server_args_mod.ServerArgs.__post_init__ = patched_post_init
process_socket_map = weakref.WeakKeyDictionary()
def launch_server_cmd(command: str, host: str = "0.0.0.0", port: int = None):
"""
Launch the server using the given command.
If no port is specified, a free port is reserved.
"""
if port is None:
port, lock_socket = reserve_port(host)
else:
lock_socket = None
extra_flags = (
f"--max-running-requests {DEFAULT_MAX_RUNNING_REQUESTS} "
f"--max-total-tokens {DEFAULT_MAX_TOTAL_TOKENS} "
f"--disable-cuda-graph"
)
full_command = f"{command} --port {port} {extra_flags}"
process = execute_shell_command(full_command)
if lock_socket is not None:
process_socket_map[process] = lock_socket
return process, port
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment