Unverified Commit 062c48d2 authored by Shi Shuai's avatar Shi Shuai Committed by GitHub
Browse files

[Docs] Add Support for Pydantic Structured Output Format (#2697)

parent b6e0cfb5
# Backend: SGLang Runtime (SRT) # Server Arguments
The SGLang Runtime (SRT) is an efficient serving engine.
## Quick Start
Launch a server
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
```
Send a request
```
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Once upon a time,",
"sampling_params": {
"max_new_tokens": 16,
"temperature": 0
}
}'
```
Learn more about the argument specification, streaming, and multi-modal support [here](../references/sampling_params.md).
## OpenAI Compatible API
In addition, the server supports OpenAI-compatible APIs.
```python
import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Text completion
response = client.completions.create(
model="default",
prompt="The capital of France is",
temperature=0,
max_tokens=32,
)
print(response)
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)
# Text embedding
response = client.embeddings.create(
model="default",
input="How are you today",
)
print(response)
```
It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
## Additional Server Arguments
- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command. - To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
...@@ -94,35 +32,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -94,35 +32,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
``` ```
## Engine Without HTTP Server
We also provide an inference engine **without a HTTP server**. For example,
```python
import sglang as sgl
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
if __name__ == "__main__":
main()
```
This can be used for offline batch inference and building custom servers.
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).
## Use Models From ModelScope ## Use Models From ModelScope
<details> <details>
<summary>More</summary> <summary>More</summary>
......
...@@ -16,16 +16,11 @@ ...@@ -16,16 +16,11 @@
"SGLang supports two grammar backends:\n", "SGLang supports two grammar backends:\n",
"\n", "\n",
"- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.\n", "- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.\n",
"- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints.\n", "- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints and currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).\n",
" - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)\n",
"\n", "\n",
"Initialize the XGrammar backend using `--grammar-backend xgrammar` flag\n", "We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n",
"```bash\n",
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
"--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)\n",
"```\n",
"\n", "\n",
"We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar)." "To use Xgrammar, simply add `--grammar-backend` xgrammar when launching the server. If no backend is specified, Outlines will be used as the default."
] ]
}, },
{ {
...@@ -35,13 +30,6 @@ ...@@ -35,13 +30,6 @@
"## OpenAI Compatible API" "## OpenAI Compatible API"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use Xgrammar, simply add `--grammar-backend xgrammar` when launching the server. If no backend is specified, Outlines will be used as the default."
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
...@@ -68,7 +56,64 @@ ...@@ -68,7 +56,64 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### JSON" "### JSON\n",
"\n",
"you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Using Pydantic**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"\n",
"\n",
"# Define the schema using Pydantic\n",
"class CapitalInfo(BaseModel):\n",
" name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
" population: int = Field(..., description=\"Population of the capital city\")\n",
"\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"Give me the information of the capital of France in the JSON format.\",\n",
" },\n",
" ],\n",
" temperature=0,\n",
" max_tokens=128,\n",
" response_format={\n",
" \"type\": \"json_schema\",\n",
" \"json_schema\": {\n",
" \"name\": \"foo\",\n",
" # convert the pydantic model to json schema\n",
" \"schema\": CapitalInfo.model_json_schema(),\n",
" },\n",
" },\n",
")\n",
"\n",
"response_content = response.choices[0].message.content\n",
"# validate the JSON response by the pydantic model\n",
"capital_info = CapitalInfo.model_validate_json(response_content)\n",
"print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**JSON Schema Directly**\n"
] ]
}, },
{ {
...@@ -225,15 +270,64 @@ ...@@ -225,15 +270,64 @@
"### JSON" "### JSON"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Using Pydantic**"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"import json\n",
"import requests\n", "import requests\n",
"import json\n",
"from pydantic import BaseModel, Field\n",
"\n",
"\n",
"# Define the schema using Pydantic\n",
"class CapitalInfo(BaseModel):\n",
" name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
" population: int = Field(..., description=\"Population of the capital city\")\n",
"\n",
"\n",
"# Make API request\n",
"response = requests.post(\n",
" \"http://localhost:30010/generate\",\n",
" json={\n",
" \"text\": \"Here is the information of the capital of France in the JSON format.\\n\",\n",
" \"sampling_params\": {\n",
" \"temperature\": 0,\n",
" \"max_new_tokens\": 64,\n",
" \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n",
" },\n",
" },\n",
")\n",
"print_highlight(response.json())\n",
"\n",
"\n", "\n",
"response_data = json.loads(response.json()[\"text\"])\n",
"# validate the response by the pydantic model\n",
"capital_info = CapitalInfo.model_validate(response_data)\n",
"print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**JSON Schema Directly**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"json_schema = json.dumps(\n", "json_schema = json.dumps(\n",
" {\n", " {\n",
" \"type\": \"object\",\n", " \"type\": \"object\",\n",
...@@ -379,6 +473,13 @@ ...@@ -379,6 +473,13 @@
"### JSON" "### JSON"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Using Pydantic**"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
...@@ -386,7 +487,49 @@ ...@@ -386,7 +487,49 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"import json\n", "import json\n",
"from pydantic import BaseModel, Field\n",
"\n",
"\n", "\n",
"prompts = [\n",
" \"Give me the information of the capital of China in the JSON format.\",\n",
" \"Give me the information of the capital of France in the JSON format.\",\n",
" \"Give me the information of the capital of Ireland in the JSON format.\",\n",
"]\n",
"\n",
"\n",
"# Define the schema using Pydantic\n",
"class CapitalInfo(BaseModel):\n",
" name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
" population: int = Field(..., description=\"Population of the capital city\")\n",
"\n",
"\n",
"sampling_params = {\n",
" \"temperature\": 0.1,\n",
" \"top_p\": 0.95,\n",
" \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n",
"}\n",
"\n",
"outputs = llm_xgrammar.generate(prompts, sampling_params)\n",
"for prompt, output in zip(prompts, outputs):\n",
" print_highlight(\"===============================\")\n",
" print_highlight(f\"Prompt: {prompt}\") # validate the output by the pydantic model\n",
" capital_info = CapitalInfo.model_validate_json(output[\"text\"])\n",
" print_highlight(f\"Validated output: {capital_info.model_dump_json()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**JSON Schema Directly**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prompts = [\n", "prompts = [\n",
" \"Give me the information of the capital of China in the JSON format.\",\n", " \"Give me the information of the capital of China in the JSON format.\",\n",
" \"Give me the information of the capital of France in the JSON format.\",\n", " \"Give me the information of the capital of France in the JSON format.\",\n",
......
...@@ -29,7 +29,7 @@ The core features include: ...@@ -29,7 +29,7 @@ The core features include:
backend/native_api.ipynb backend/native_api.ipynb
backend/offline_engine_api.ipynb backend/offline_engine_api.ipynb
backend/structured_outputs.ipynb backend/structured_outputs.ipynb
backend/backend.md backend/server_arguments.md
.. toctree:: .. toctree::
......
...@@ -2,9 +2,9 @@ ...@@ -2,9 +2,9 @@
Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process. Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
## 1. Setting Up & Building from Source ## Setting Up & Building from Source
### 1.1 Fork and Clone the Repository ### Fork and Clone the Repository
**Note**: SGLang does **not** accept PRs on the main repo. Please fork the repository under your GitHub account, then clone your fork locally. **Note**: SGLang does **not** accept PRs on the main repo. Please fork the repository under your GitHub account, then clone your fork locally.
...@@ -13,7 +13,7 @@ git clone https://github.com/<your_user_name>/sglang.git ...@@ -13,7 +13,7 @@ git clone https://github.com/<your_user_name>/sglang.git
cd sglang cd sglang
``` ```
### 1.2 Install Dependencies & Build ### Install Dependencies & Build
Refer to [Install SGLang](https://sgl-project.github.io/start/install.html) documentation for more details on setting up the necessary dependencies. Refer to [Install SGLang](https://sgl-project.github.io/start/install.html) documentation for more details on setting up the necessary dependencies.
...@@ -32,7 +32,7 @@ cd sglang/python ...@@ -32,7 +32,7 @@ cd sglang/python
pip install . pip install .
``` ```
## 2. Code Formatting with Pre-Commit ## Code Formatting with Pre-Commit
We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run: We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
...@@ -45,11 +45,11 @@ pre-commit run --all-files ...@@ -45,11 +45,11 @@ pre-commit run --all-files
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request. - **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch. - **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
## 3. Writing Documentation & Running Docs CI ## Writing Documentation & Running Docs CI
Most documentation files are located under the `docs/` folder. We prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline. Most documentation files are located under the `docs/` folder. We prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline.
### 3.1 Docs Workflow ### Docs Workflow
Add or update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly. Add or update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly.
...@@ -114,11 +114,11 @@ llm.shutdown() ...@@ -114,11 +114,11 @@ llm.shutdown()
``` ```
## 4. Running Unit Tests & Adding to CI ## Running Unit Tests & Adding to CI
SGLang uses Python’s built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. You can run tests either individually or in suites. SGLang uses Python’s built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. You can run tests either individually or in suites.
### 4.1 Test Backend Runtime ### Test Backend Runtime
```bash ```bash
cd sglang/test/srt cd sglang/test/srt
...@@ -133,7 +133,7 @@ python3 -m unittest test_srt_endpoint.TestSRTEndpoint.test_simple_decode ...@@ -133,7 +133,7 @@ python3 -m unittest test_srt_endpoint.TestSRTEndpoint.test_simple_decode
python3 run_suite.py --suite minimal python3 run_suite.py --suite minimal
``` ```
### 4.2 Test Frontend Language ### Test Frontend Language
```bash ```bash
cd sglang/test/lang cd sglang/test/lang
...@@ -149,13 +149,13 @@ python3 -m unittest test_openai_backend.TestOpenAIBackend.test_few_shot_qa ...@@ -149,13 +149,13 @@ python3 -m unittest test_openai_backend.TestOpenAIBackend.test_few_shot_qa
python3 run_suite.py --suite minimal python3 run_suite.py --suite minimal
``` ```
### 4.3 Adding or Updating Tests in CI ### Adding or Updating Tests in CI
- Create new test files under `test/srt` or `test/lang` depending on the type of test. - Create new test files under `test/srt` or `test/lang` depending on the type of test.
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so they’re picked up in CI. - Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so they’re picked up in CI.
- In CI, all tests run automatically. You may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows) to add custom test groups or extra checks. - In CI, all tests run automatically. You may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows) to add custom test groups or extra checks.
### 4.4 Writing Elegant Test Cases ### Writing Elegant Test Cases
- Examine existing tests in [sglang/test](https://github.com/sgl-project/sglang/tree/main/test) for practical examples. - Examine existing tests in [sglang/test](https://github.com/sgl-project/sglang/tree/main/test) for practical examples.
- Keep each test function focused on a single scenario or piece of functionality. - Keep each test function focused on a single scenario or piece of functionality.
...@@ -164,7 +164,7 @@ python3 run_suite.py --suite minimal ...@@ -164,7 +164,7 @@ python3 run_suite.py --suite minimal
- Clean up resources to avoid side effects and preserve test independence. - Clean up resources to avoid side effects and preserve test independence.
## 5. Tips for Newcomers ## Tips for Newcomers
If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow. If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment