The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also has supported[MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models.
The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports[MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended).
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
## Hardware Recommendation
- 8 x NVIDIA H200 GPUs
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95)[help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)). Here is an example serving with [2 H20 node](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208)
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below.
## Installation & Launch
...
...
@@ -61,10 +61,10 @@ For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `
"You can specify a JSON schema, Regular Expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. \n",
"You can specify a JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.\n",
"\n",
"SGLang supports two grammar backends:\n",
"\n",
"- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and Regular Expression constraints.\n",
"- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.\n",
"- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints.\n",
" - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)\n",
"\n",
"> 🔔 Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified at a time.\n",
"\n",
"Initialise xgrammar backend using `--grammar-backend xgrammar` flag\n",
"Initialize the XGrammar backend using `--grammar-backend xgrammar` flag\n",
"With SGLang, You can define a JSON schema, EBNF or regular expression to constrain the model's output.\n",
"## Structured Outputs (JSON, Regex, EBNF)\n",
"You can specify a JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.\n",
"\n",
"[JSON Schema](https://json-schema.org/): Formats output into structured JSON objects with validation rules.\n",
"SGLang supports two grammar backends:\n",
"\n",
"[EBNF (Extended Backus-Naur Form)](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form): Defines complex syntax rules, especially for recursive patterns like nested structures.\n",
"- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.\n",
"- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints.\n",
" - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)\n",
"\n",
"[Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression): Matches text patterns for simple validation and formatting.\n",
"Initialize the XGrammar backend using `--grammar-backend xgrammar` flag\n",
"SGLang has two backends: [Outlines](https://github.com/dottxt-ai/outlines) (default) and [XGrammar](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n",
"\n",
"* Xgrammar Backend: JSON and EBNF\n",
"* Outlines Backend: JSON and regular expressions"
"We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar)."
@@ -39,10 +39,9 @@ The `sampling_params` follows this format
```python
# The maximum number of output tokens
max_new_tokens:int=128,
# Stop when hitting any of the strings in this list.
# Stop when hitting any of the strings in this list
stop:Optional[Union[str,List[str]]]=None,
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
# `min_new_tokens`.
# Stop when hitting any of the token_ids in this list
stop_token_ids:Optional[List[int]]=[],
# Sampling temperature
temperature:float=1.0,
...
...
@@ -52,26 +51,26 @@ top_p: float = 1.0,
top_k:int=-1,
# Min-p sampling
min_p:float=0.0,
# Whether to ignore EOS token.
# Whether to ignore EOS token
ignore_eos:bool=False,
# Whether to skip the special tokens during detokenization.
# Whether to skip the special tokens during detokenization
skip_special_tokens:bool=True,
# Whether to add spaces between special tokens during detokenization.
# Whether to add spaces between special tokens during detokenization
spaces_between_special_tokens:bool=True,
# Do parallel sampling and return `n` outputs.
n:int=1,
## Structured Outputs
# Only one of the below three can be set at a time:
# Only one of the below three can be set for a request.
# Constrains the output to follow a given regular expression.
regex:Optional[str]=None,
# Constrains the output to follow a given JSON schema.
# Constrain the output to follow a given JSON schema.
json_schema:Optional[str]=None,
# Constrains the output to follow a given EBNF Grammar.
# Constrain the output to follow a given regular expression.
regex:Optional[str]=None,
# Constrain the output to follow a given EBNF grammar.
ebnf:Optional[str]=None,
## Penalties. See [Performance Implications on Penalties] section below for more informations.
## Penalties.
# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
...
...
@@ -185,17 +184,15 @@ The `image_data` can be a file name, a URL, or a base64 encoded string. See also
Streaming is supported in a similar manner as [above](#streaming).
### Structured Outputs (JSON, Regex, EBNF)
You can specify a JSON schema, Regular Expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints.
You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.
SGLang supports two grammar backends:
-[Outlines](https://github.com/dottxt-ai/outlines)(default): Supports JSON schema and Regular Expression constraints.
-[Outlines](https://github.com/dottxt-ai/outlines)(default): Supports JSON schema and regular expression constraints.
-[XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints.
- XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)
> 🔔 Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified at a time.
Initialise xgrammar backend using `--grammar-backend xgrammar` flag
Initialize the XGrammar backend using `--grammar-backend xgrammar` flag