Unverified Commit a97df791 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Clean up readme and arguments of chunked prefill (#1022)

parent 33d61356
...@@ -139,23 +139,23 @@ print(response) ...@@ -139,23 +139,23 @@ print(response)
It supports streaming, vision, and most features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/). It supports streaming, vision, and most features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
### Additional Server Arguments ### Additional Server Arguments
- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option. - Add `--tp 2` to enable multi-GPU tensor parallelism. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --tp 2 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --tp 2
``` ```
- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory. - Add `--dp 2` to enable multi-GPU data parallelism. It can also be used together with tensor parallelism. Data parallelism is better for throughput if there is enough memory.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --dp 2 --tp 2 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --dp 2 --tp 2
``` ```
- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`. - If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7
``` ```
- If you see out-of-memory errors during prefill for long prompts on a model that supports long context, consider using chunked prefill. - See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --chunked-prefill-size 8192 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --chunked-prefill-size 2048
``` ```
- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port. - Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
``` ```
# Node 0 # Node 0
...@@ -165,13 +165,13 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -165,13 +165,13 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
``` ```
- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md). - If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
- To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes. - To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
- To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
### Supported Models ### Supported Models
- Llama / Llama 2 / Llama 3 / Llama 3.1 - Llama / Llama 2 / Llama 3 / Llama 3.1
- Mistral / Mixtral - Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2 - Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE - Qwen / Qwen 2 / Qwen 2 MoE
- DeepSeek / DeepSeek 2 - DeepSeek / DeepSeek 2
...@@ -189,7 +189,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -189,7 +189,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- Grok - Grok
- ChatGLM - ChatGLM
- InternLM 2 - InternLM 2
- Mistral NeMo
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md). Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md).
...@@ -231,7 +230,7 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/ ...@@ -231,7 +230,7 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
``` ```
## Frontend: Structured Generation Language (SGLang) ## Frontend: Structured Generation Language (SGLang)
The frontend language can be used with local models or API models. The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow.
### Quick Start ### Quick Start
The example below shows how to use sglang to answer a mulit-turn question. The example below shows how to use sglang to answer a mulit-turn question.
......
...@@ -118,11 +118,7 @@ class ModelTpServer: ...@@ -118,11 +118,7 @@ class ModelTpServer:
trust_remote_code=server_args.trust_remote_code, trust_remote_code=server_args.trust_remote_code,
) )
self.max_total_num_tokens = self.model_runner.max_total_num_tokens self.max_total_num_tokens = self.model_runner.max_total_num_tokens
self.max_prefill_tokens = ( self.max_prefill_tokens = server_args.max_prefill_tokens
16384
if server_args.max_prefill_tokens is None
else server_args.max_prefill_tokens
)
self.max_running_requests = min( self.max_running_requests = min(
( (
self.max_total_num_tokens // 2 self.max_total_num_tokens // 2
......
...@@ -43,10 +43,11 @@ class ServerArgs: ...@@ -43,10 +43,11 @@ class ServerArgs:
# Memory and scheduling # Memory and scheduling
mem_fraction_static: Optional[float] = None mem_fraction_static: Optional[float] = None
max_prefill_tokens: Optional[int] = None
max_running_requests: Optional[int] = None max_running_requests: Optional[int] = None
max_num_reqs: Optional[int] = None max_num_reqs: Optional[int] = None
max_total_tokens: Optional[int] = None max_total_tokens: Optional[int] = None
chunked_prefill_size: int = -1
max_prefill_tokens: int = 16384
schedule_policy: str = "lpm" schedule_policy: str = "lpm"
schedule_conservativeness: float = 1.0 schedule_conservativeness: float = 1.0
...@@ -69,9 +70,6 @@ class ServerArgs: ...@@ -69,9 +70,6 @@ class ServerArgs:
dp_size: int = 1 dp_size: int = 1
load_balance_method: str = "round_robin" load_balance_method: str = "round_robin"
# Chunked Prefill
chunked_prefill_size: Optional[int] = None
# Optimization/debug options # Optimization/debug options
disable_flashinfer: bool = False disable_flashinfer: bool = False
disable_flashinfer_sampling: bool = False disable_flashinfer_sampling: bool = False
...@@ -97,6 +95,10 @@ class ServerArgs: ...@@ -97,6 +95,10 @@ class ServerArgs:
if self.served_model_name is None: if self.served_model_name is None:
self.served_model_name = self.model_path self.served_model_name = self.model_path
if self.chunked_prefill_size <= 0:
# Disable chunked prefill
self.chunked_prefill_size = None
if self.mem_fraction_static is None: if self.mem_fraction_static is None:
if self.tp_size >= 16: if self.tp_size >= 16:
self.mem_fraction_static = 0.79 self.mem_fraction_static = 0.79
...@@ -108,6 +110,7 @@ class ServerArgs: ...@@ -108,6 +110,7 @@ class ServerArgs:
self.mem_fraction_static = 0.87 self.mem_fraction_static = 0.87
else: else:
self.mem_fraction_static = 0.88 self.mem_fraction_static = 0.88
if isinstance(self.additional_ports, int): if isinstance(self.additional_ports, int):
self.additional_ports = [self.additional_ports] self.additional_ports = [self.additional_ports]
elif self.additional_ports is None: elif self.additional_ports is None:
...@@ -232,12 +235,6 @@ class ServerArgs: ...@@ -232,12 +235,6 @@ class ServerArgs:
default=ServerArgs.mem_fraction_static, default=ServerArgs.mem_fraction_static,
help="The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors.", help="The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors.",
) )
parser.add_argument(
"--max-prefill-tokens",
type=int,
default=ServerArgs.max_prefill_tokens,
help="The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model's maximum context length.",
)
parser.add_argument( parser.add_argument(
"--max-running-requests", "--max-running-requests",
type=int, type=int,
...@@ -256,6 +253,18 @@ class ServerArgs: ...@@ -256,6 +253,18 @@ class ServerArgs:
default=ServerArgs.max_total_tokens, default=ServerArgs.max_total_tokens,
help="The maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes.", help="The maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes.",
) )
parser.add_argument(
"--chunked-prefill-size",
type=int,
default=ServerArgs.chunked_prefill_size,
help="The maximum number of tokens in a chunk for the chunked prefill. Setting this to -1 means disabling chunked prefill",
)
parser.add_argument(
"--max-prefill-tokens",
type=int,
default=ServerArgs.max_prefill_tokens,
help="The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model's maximum context length.",
)
parser.add_argument( parser.add_argument(
"--schedule-policy", "--schedule-policy",
type=str, type=str,
...@@ -353,14 +362,6 @@ class ServerArgs: ...@@ -353,14 +362,6 @@ class ServerArgs:
) )
parser.add_argument("--node-rank", type=int, help="The node rank.") parser.add_argument("--node-rank", type=int, help="The node rank.")
# Chunked prefill
parser.add_argument(
"--chunked-prefill-size",
type=int,
default=ServerArgs.chunked_prefill_size,
help="The size of the chunked prefill.",
)
# Optimization/debug options # Optimization/debug options
parser.add_argument( parser.add_argument(
"--disable-flashinfer", "--disable-flashinfer",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment