Unverified Commit 966eb908 authored by Michael Yao's avatar Michael Yao Committed by GitHub
Browse files

[Docs] Replace lists with tables for cleanup and readability in server_arguments (#5276)

parent 644ed409
# Server Arguments # Server Arguments
This page provides a list of server arguments used in the command line to configure the behavior
and performance of the language model server during deployment. These arguments enable users to
customize key aspects of the server, including model selection, parallelism policies,
memory management, and optimization techniques.
## Common launch commands ## Common launch commands
- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command. - To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
```bash ```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
``` ```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../router/router.md) for data parallelism. - To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../router/router.md) for data parallelism.
```bash ```bash
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2 python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
``` ```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`. - If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
```bash ```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
``` ```
- See [hyperparameter tuning](hyperparameter_tuning.md) on tuning hyperparameters for better performance. - See [hyperparameter tuning](hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See `--shm-size` for docker and `/dev/shm` size update for Kubernetes manifests. - For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See `--shm-size` for docker and `/dev/shm` size update for Kubernetes manifests.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size. - If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```bash ```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
``` ```
- To enable `torch.compile` acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. By default, the cache path is located at `/tmp/torchinductor_root`, you can customize it using environment variable `TORCHINDUCTOR_CACHE_DIR`. For more details, please refer to [PyTorch official documentation](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) and [Enabling cache for torch.compile](https://docs.sglang.ai/backend/hyperparameter_tuning.html#enabling-cache-for-torch-compile). - To enable `torch.compile` acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. By default, the cache path is located at `/tmp/torchinductor_root`, you can customize it using environment variable `TORCHINDUCTOR_CACHE_DIR`. For more details, please refer to [PyTorch official documentation](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) and [Enabling cache for torch.compile](https://docs.sglang.ai/backend/hyperparameter_tuning.html#enabling-cache-for-torch-compile).
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
...@@ -34,162 +40,196 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -34,162 +40,196 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](custom_chat_template.md). - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](custom_chat_template.md).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph` - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
```bash ```bash
# Node 0 # Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0
# Node 1 # Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1
``` ```
Please consult the documentation below to learn more about the parameters you may provide when launching a server. Please consult the documentation below and [server_args.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py) to learn more about the arguments you may provide when launching a server.
## Model, processor and tokenizer
## Model, processor and tokenizer
| Arguments | Description | Defaults |
* `model_path`: Path to the model that will be served. |----------|-------------|---------|
* `tokenizer_path`: Defaults to the `model_path`. | `model_path` | Path to the model that will be served. | None |
* `tokenizer_mode`: By default `auto`, see [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer) for different mode. | `tokenizer_path` | Defaults to the `model_path`. | None |
* `load_format`: The format the weights are loaded in. Defaults to `*.safetensors`/`*.bin`. | `tokenizer_mode` | See [different mode](https://huggingface.co/docs/transformers/en/main_classes/tokenizer). | `auto` |
* `trust_remote_code`: If `True`, will use locally cached config files, otherwise use remote configs in HuggingFace. | `load_format` | The format the weights are loaded in. | `auto` |
* `dtype`: Dtype used for the model, defaults to `bfloat16`. | `trust_remote_code` | If `true`, will use locally cached config files, otherwise use remote configs in HuggingFace. | `False` |
* `kv_cache_dtype`: Dtype of the kv cache, defaults to the `dtype`. | `dtype` | Dtype used for the model. | `bfloat16` |
* `context_length`: The number of tokens our model can process *including the input*. Note that extending the default might lead to strange behavior. | `kv_cache_dtype` | Dtype of the kv cache. | `dtype` |
* `device`: The device we put the model, defaults to `cuda`. | `context_length` | The number of tokens our model can process *including the input*. Note that extending the default might lead to strange behavior. | None |
* `chat_template`: The chat template to use. Deviating from the default might lead to unexpected responses. For multi-modal chat templates, refer to [here](https://docs.sglang.ai/backend/openai_api_vision.ipynb#Chat-Template). **Make sure the correct** `chat_template` **is passed, or performance degradation may occur!!!!** | `device` | The device we put the model. | None |
* `is_embedding`: Set to true to perform [embedding](./openai_api_embeddings.ipynb) / [encode](https://docs.sglang.ai/backend/native_api#Encode-(embedding-model)) and [reward](https://docs.sglang.ai/backend/native_api#Classify-(reward-model)) tasks. | `chat_template` | The chat template to use. See [multi-modal templates](https://docs.sglang.ai/backend/openai_api_vision.ipynb#Chat-Template). **Make sure the correct `chat_template` is passed, or performance degradation may occur!!!!** | None |
* `revision`: Adjust if a specific version of the model should be used. | `is_embedding` | Set to `true` to perform [embedding](./openai_api_embeddings.ipynb) / [encode](https://docs.sglang.ai/backend/native_api#Encode-(embedding-model)) and [reward](https://docs.sglang.ai/backend/native_api#Classify-(reward-model)) tasks. | `False` |
* `skip_tokenizer_init`: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF. Please see this [example for reference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/token_in_token_out/). | `revision` | Adjust if a specific version of the model should be used. | None |
* `json_model_override_args`: Override model config with the provided JSON. | `skip_tokenizer_init` | Set to `true` to provide the tokens to the engine and get the output tokens directly, typically used in RLHF. See [example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/token_in_token_out/). | `False` |
* `disable_fast_image_processor`: Adopt base image processor instead of fast image processor(which is by default). For more detail, see: https://huggingface.co/docs/transformers/main/en/main_classes/image_processor#image-processor | `json_model_override_args` | Override model config with the provided JSON. | `"{}"` |
| `delete_ckpt_after_loading` | Delete the model checkpoint after loading the model. | `False` |
| `disable_fast_image_processor` | Adopt base image processor instead of fast image processor (which is by default). See [details](https://huggingface.co/docs/transformers/main/en/main_classes/image_processor#image-processor). | `False` |
## Serving: HTTP & API ## Serving: HTTP & API
### HTTP Server configuration ### HTTP Server configuration
* `port` and `host`: Setup the host for HTTP server. By default `host: str = "127.0.0.1"` and `port: int = 30000` | Arguments | Description | Defaults |
|----------|-------------|---------|
| `host` | Host for the HTTP server. | `"127.0.0.1"` |
| `port` | Port for the HTTP server. | `30000` |
### API configuration ### API configuration
* `api_key`: Sets an API key for the server and the OpenAI-compatible API. | Arguments | Description | Defaults |
* `file_storage_path`: Directory for storing uploaded or generated files from API calls. |-----------|-------------|---------|
* `enable_cache_report`: If set, includes detailed usage of cached tokens in the response usage. | `api_key` | Sets an API key for the server and the OpenAI-compatible API. | None |
| `file_storage_path` | Directory for storing uploaded or generated files from API calls. | `"sglang_storage"` |
| `enable_cache_report` | If set, includes detailed usage of cached tokens in the response usage. | `False` |
## Parallelism ## Parallelism
### Tensor parallelism ### Tensor parallelism
* `tp_size`: The number of GPUs the model weights get sharded over. Mainly for saving memory rather than for high throughput, see [this blogpost](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works). | Argument | Description | Default |
|----------|-------------|---------|
| `tp_size` | The number of GPUs the model weights get sharded over. Mainly for saving memory rather than for high throughput, see [this tutorial: How Tensor Parallel works?](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works). | `1` |
### Data parallelism ### Data parallelism
* `dp_size`: Will be deprecated. The number of data-parallel copies of the model. [SGLang router](../router/router.md) is recommended instead of the current naive data parallel. | Arguments | Description | Defaults |
* `load_balance_method`: Will be deprecated. Load balancing strategy for data parallel requests. |-----------|-------------|---------|
| `dp_size` | Will be deprecated. The number of data-parallel copies of the model. [SGLang router](../router/router.md) is recommended instead of the current naive data parallel. | `1` |
| `load_balance_method` | Will be deprecated. Load balancing strategy for data parallel requests. | `"round_robin"` |
### Expert parallelism ### Expert parallelism
* `enable_ep_moe`: Enables expert parallelism that distributes the experts onto multiple GPUs for MoE models.
* `ep_size`: The size of EP. Please shard the model weights with `tp_size=ep_size`, for detailed benchmarking refer to [this PR](https://github.com/sgl-project/sglang/pull/2203). If not set, `ep_size` will be automatically set to `tp_size`. | Arguments | Description | Defaults |
* `enable_deepep_moe`: Enables expert parallelism that distributes the experts onto multiple GPUs for DeepSeek-V3 model based on deepseek-ai/DeepEP. |-----------|-------------|----------|
* `deepep_mode`: Select the mode when enable DeepEP MoE, could be `normal`, `low_latency` or `auto`. Default is `auto`, which means `low_latency` for decode batch and `normal` for prefill batch. | `enable_ep_moe` | Enables expert parallelism that distributes the experts onto multiple GPUs for MoE models. | `False` |
| `ep_size` | The size of EP. Please shard the model weights with `tp_size=ep_size`. For benchmarking, refer to [this PR](https://github.com/sgl-project/sglang/pull/2203). | `1` |
| `enable_deepep_moe` | Enables expert parallelism that distributes the experts onto multiple GPUs for DeepSeek-V3 model based on `deepseek-ai/DeepEP`. | `False` |
| `deepep_mode` | Select the mode when using DeepEP MoE: can be `normal`, `low_latency`, or `auto`. `auto` means `low_latency` for decode batch and `normal` for prefill batch. | `auto` |
## Memory and scheduling ## Memory and scheduling
* `mem_fraction_static`: Fraction of the free GPU memory used for static memory like model weights and KV cache. If building KV cache fails, it should be increased. If CUDA runs out of memory, it should be decreased. | Arguments | Description | Defaults |
* `max_running_requests`: The maximum number of requests to run concurrently. |----------|-------------|----------|
* `max_total_tokens`: The maximum number of tokens that can be stored into the KV cache. Use mainly for debugging. | `mem_fraction_static` | Fraction of the free GPU memory used for static memory like model weights and KV cache. Increase it if KV cache building fails. Decrease it if CUDA runs out of memory. | None |
* `chunked_prefill_size`: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the VRAM consumption. If CUDA runs out of memory, it should be decreased. | `max_running_requests` | The maximum number of requests to run concurrently. | None |
* `max_prefill_tokens`: Token budget of how many tokens to accept in one prefill batch. The actual number is the max of this parameter and the `context_length`. | `max_total_tokens` | The maximum number of tokens that can be stored in the KV cache. Mainly used for debugging. | None |
* `schedule_policy`: The scheduling policy to control the processing order of waiting prefill requests in a single engine. | `chunked_prefill_size` | Perform prefill in chunks of this size. Larger sizes speed up prefill but increase VRAM usage. Decrease if CUDA runs out of memory. | None |
* `schedule_conservativeness`: Can be used to decrease/increase the conservativeness of the server when taking new requests. Highly conservative behavior leads to starvation, but low conservativeness leads to slowed-down performance. | `max_prefill_tokens` | Token budget for how many tokens can be accepted in one prefill batch. The actual limit is the max of this value and `context_length`. | `16384` |
* `cpu_offload_gb`: Reserve this amount of RAM in GB for offloading of model parameters to the CPU. | `schedule_policy` | The scheduling policy to control how waiting prefill requests are processed by a single engine. | `"fcfs"` |
| `schedule_conservativeness` | Controls how conservative the server is when accepting new requests. High conservativeness may cause starvation; low conservativeness may reduce performance. | `1.0` |
| `cpu_offload_gb` | Amount of RAM (in GB) to reserve for offloading model parameters to the CPU. | `0` |
## Other runtime options ## Other runtime options
* `stream_interval`: Interval (in tokens) for streaming responses. Smaller values lead to smoother streaming, and larger values lead to better throughput. | Arguments | Description | Defaults |
* `random_seed`: Can be used to enforce more deterministic behavior. |-----------|-------------|---------|
* `watchdog_timeout`: Adjusts the watchdog thread's timeout before killing the server if batch generation takes too long. | `stream_interval` | Interval (in tokens) for streaming responses. Smaller values lead to smoother streaming; larger values improve throughput. | `1` |
* `download_dir`: Use to override the default Hugging Face cache directory for model weights. | `random_seed` | Can be used to enforce more deterministic behavior. | None |
* `base_gpu_id`: Use to adjust first GPU used to distribute the model across available GPUs. | `watchdog_timeout` | Timeout setting for the watchdog thread before it kills the server if batch generation takes too long. | `300` |
* `allow_auto_truncate`: Automatically truncate requests that exceed the maximum input length. | `download_dir` | Overrides the default Hugging Face cache directory for model weights. | None |
| `base_gpu_id` | Sets the first GPU to use when distributing the model across multiple GPUs. | `0` |
| `allow_auto_truncate`| Automatically truncate requests that exceed the maximum input length. | `False` |
## Logging ## Logging
* `log_level`: Global log verbosity. | Arguments | Description | Defaults |
* `log_level_http`: Separate verbosity level for the HTTP server logs (if unset, defaults to `log_level`). |-----------|-------------|---------|
* `log_requests`: Logs the inputs and outputs of all requests for debugging. | `log_level` | Global log verbosity. | `"info"` |
* `log_requests_level`: Ranges from 0 to 2: level 0 only shows some basic metadata in requests, level 1 and 2 show request details (e.g., text, images), and level 1 limits output to 2048 characters (if unset, defaults to `0`). | `log_level_http` | Separate verbosity level for the HTTP server logs. | None |
* `show_time_cost`: Prints or logs detailed timing info for internal operations (helpful for performance tuning). | `log_requests` | Logs the inputs and outputs of all requests for debugging. | `False` |
* `enable_metrics`: Exports Prometheus-like metrics for request usage and performance. | `log_requests_level` | Ranges from 0 to 2: level 0 only shows some basic metadata in requests, level 1 and 2 show request details (e.g., text, images), and level 1 limits output to 2048 characters. | `0` |
* `decode_log_interval`: How often (in tokens) to log decode progress. | `show_time_cost` | Prints or logs detailed timing info for internal operations (helpful for performance tuning). | `False` |
| `enable_metrics` | Exports Prometheus-like metrics for request usage and performance. | `False` |
| `decode_log_interval` | How often (in tokens) to log decode progress. | `40` |
## Multi-node distributed serving ## Multi-node distributed serving
* `dist_init_addr`: The TCP address used for initializing PyTorch's distributed backend (e.g. `192.168.0.2:25000`). | Arguments | Description | Defaults |
* `nnodes`: Total number of nodes in the cluster. Refer to how to run the [Llama 405B model](https://docs.sglang.ai/references/multi_node.html#llama-3-1-405b). |----------|-------------|---------|
* `node_rank`: Rank (ID) of this node among the `nnodes` in the distributed setup. | `dist_init_addr` | The TCP address used for initializing PyTorch's distributed backend (e.g. `192.168.0.2:25000`). | None |
| `nnodes` | Total number of nodes in the cluster. See [Llama 405B guide](https://docs.sglang.ai/references/llama_405B.html#run-405b-fp16-on-two-nodes). | `1` |
| `node_rank` | Rank (ID) of this node among the `nnodes` in the distributed setup. | `0` |
## LoRA ## LoRA
* `lora_paths`: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently `cuda_graph` and `radix_attention` are not supported with this option so you need to disable them manually. We are still working on through these [issues](https://github.com/sgl-project/sglang/issues/2929). | Arguments | Description | Defaults |
* `max_loras_per_batch`: Maximum number of LoRAs in a running batch including base model. |----------|-------------|---------|
* `lora_backend`: The backend of running GEMM kernels for Lora modules, can be one of `triton` or `flashinfer`. Defaults to be `triton`. | `lora_paths` | List of adapters to apply to your model. Each batch element uses the proper LoRA adapter. `cuda_graph` and `radix_attention` are not supported with this, so they must be disabled manually. See related [issues](https://github.com/sgl-project/sglang/issues/2929). | None |
| `max_loras_per_batch` | Maximum number of LoRAs allowed in a running batch, including the base model. | `8` |
| `lora_backend` | Backend used to run GEMM kernels for LoRA modules. Can be `triton` or `flashinfer`. | `triton` |
## Kernel backend ## Kernel backend
* `attention_backend`: This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, `cutlass_mla`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend. | Arguments | Description | Defaults |
* `sampling_backend`: The backend for sampling. |----------|-------------|---------|
| `attention_backend` | This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend. | None |
| `sampling_backend` | Specifies the backend used for sampling. | None |
## Constrained Decoding ## Constrained Decoding
* `grammar_backend`: The grammar backend for constraint decoding. Detailed usage can be found in this [document](https://docs.sglang.ai/backend/structured_outputs.html). | Arguments | Description | Defaults |
* `constrained_json_whitespace_pattern`: Use with `Outlines` grammar backend to allow JSON with syntatic newlines, tabs or multiple spaces. Details can be found [here](https://dottxt-ai.github.io/outlines/latest/reference/generation/json/#using-pydantic). |----------|-------------| ----------|
| `grammar_backend` | The grammar backend for constraint decoding. See [detailed usage](https://docs.sglang.ai/backend/structured_outputs.html). | None |
| `constrained_json_whitespace_pattern` | Use with `Outlines` grammar backend to allow JSON with syntactic newlines, tabs, or multiple spaces. See [details](https://dottxt-ai.github.io/outlines/latest/reference/generation/json/#using-pydantic). |
## Speculative decoding ## Speculative decoding
* `speculative_draft_model_path`: The draft model path for speculative decoding. | Arguments | Description | Defaults |
* `speculative_algorithm`: The algorithm for speculative decoding. Currently only [Eagle](https://arxiv.org/html/2406.16858v1) is supported. Note that the radix cache, chunked prefill, and overlap scheduler are disabled when using eagle speculative decoding. |----------|-------------|---------|
* `speculative_num_steps`: How many draft passes we run before verifying. | `speculative_draft_model_path` | The draft model path for speculative decoding. | None |
* `speculative_num_draft_tokens`: The number of tokens proposed in a draft. | `speculative_algorithm` | The algorithm for speculative decoding. Currently [EAGLE](https://arxiv.org/html/2406.16858v1) and [EAGLE3](https://arxiv.org/pdf/2503.01840) are supported. Note that the radix cache, chunked prefill, and overlap scheduler are disabled when using eagle speculative decoding. | None |
* `speculative_eagle_topk`: The number of top candidates we keep for verification at each step for [Eagle](https://arxiv.org/html/2406.16858v1). | `speculative_num_steps` | How many draft passes we run before verifying. | None |
* `speculative_token_map`: Optional, the path to the high frequency token list of [FR-Spec](https://arxiv.org/html/2502.14856v1), used for accelerating [Eagle](https://arxiv.org/html/2406.16858v1). | `speculative_num_draft_tokens` | The number of tokens proposed in a draft. | None |
| `speculative_eagle_topk` | The number of top candidates we keep for verification at each step for [Eagle](https://arxiv.org/html/2406.16858v1). | None |
| `speculative_token_map` | Optional, the path to the high frequency token list of [FR-Spec](https://arxiv.org/html/2502.14856v1), used for accelerating [Eagle](https://arxiv.org/html/2406.16858v1). | None |
## Double Sparsity ## Double Sparsity
* `enable_double_sparsity`: Enables [double sparsity](https://arxiv.org/html/2408.07092v2) which increases throughput. | Arguments | Description | Defaults |
* `ds_channel_config_path`: The double sparsity config. For a guide on how to generate the config for your model see [this repo](https://github.com/andy-yang-1/DoubleSparse/tree/main/config). |----------|-------------|---------|
* `ds_heavy_channel_num`: Number of channel indices to keep for each layer. | `enable_double_sparsity` | Enables [double sparsity](https://arxiv.org/html/2408.07092v2) which increases throughput. | `False` |
* `ds_heavy_token_num`: Number of tokens used for attention during decode. Skip sparse decoding if `min_seq_len` in batch < this number. | `ds_channel_config_path` | The double sparsity config. See [a guide on how to generate the config for your model](https://github.com/andy-yang-1/DoubleSparse/tree/main/config). | None |
* `ds_heavy_channel_type`: The type of heavy channels. Either `q`, `k` or `qk`. | `ds_heavy_channel_num` | Number of channel indices to keep for each layer. | `32` |
* `ds_sparse_decode_threshold`: Don't apply sparse decoding if `max_seq_len` in batch < this threshold. | `ds_heavy_token_num` | Number of tokens used for attention during decode. Skip sparse decoding if `min_seq_len` in batch is less than this number. | `256` |
| `ds_heavy_channel_type` | The type of heavy channels. Options are `q`, `k` or `qk`. | `qk` |
| `ds_sparse_decode_threshold` | Don't apply sparse decoding if `max_seq_len` in batch < this threshold. | `4096` |
## Debug options ## Debug options
*Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.* *Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.*
* `disable_radix_cache`: Disable [Radix](https://lmsys.org/blog/2024-01-17-sglang/) backend for prefix caching. | Arguments | Description | Defaults |
* `disable_cuda_graph`: Disable [cuda graph](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) for model forward. Use if encountering uncorrectable CUDA ECC errors. |----------|-------------|---------|
* `disable_cuda_graph_padding`: Disable cuda graph when padding is needed. In other case still use cuda graph. | `disable_radix_cache` | Disable [Radix](https://lmsys.org/blog/2024-01-17-sglang/) backend for prefix caching. | `False` |
* `disable_outlines_disk_cache`: Disable disk cache for outlines grammar backend. | `disable_cuda_graph` | Disable [CUDA Graph](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) for model forward. Use if encountering uncorrectable CUDA ECC errors. | `False` |
* `disable_custom_all_reduce`: Disable usage of custom all reduce kernel. | `disable_cuda_graph_padding` | Disable CUDA Graph when padding is needed; otherwise, still use CUDA Graph. | `False` |
* `disable_overlap_schedule`: Disable the [Overhead-Scheduler](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-scheduler). | `disable_outlines_disk_cache` | Disable disk cache for outlines grammar backend. | `False` |
* `enable_nan_detection`: Turning this on makes the sampler print a warning if the logits contain `NaN`. | `disable_custom_all_reduce` | Disable usage of custom all-reduce kernel. | `False` |
* `enable_p2p_check`: Turns off the default of allowing always p2p check when accessing GPU. | `disable_mla` | Disable [Multi-Head Latent Attention](https://arxiv.org/html/2405.04434v5) for Deepseek model. | `False` |
* `triton_attention_reduce_in_fp32`: In triton kernels this will cast the intermediate attention result to `float32`. | `disable_overlap_schedule` | Disable the [Overhead-Scheduler](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-scheduler). | `False` |
| `enable_nan_detection` | Enable warning if the logits contain `NaN`. | `False` |
| `enable_p2p_check` | Turns off the default of always allowing P2P checks when accessing GPU. | `False` |
| `triton_attention_reduce_in_fp32` | In Triton kernels, cast the intermediate attention result to `float32`. | `False` |
## Optimization ## Optimization
*Note: Some of these options are still in experimental stage.* *Note: Some of these options are still in experimental stage.*
* `enable_mixed_chunk`: Enables mixing prefill and decode, see [this discussion](https://github.com/sgl-project/sglang/discussions/1163). | Arguments | Description | Defaults |
* `enable_dp_attention`: Enable [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) for Deepseek models. |-----------|-------------|----------|
* `enable_torch_compile`: Torch compile the model. Note that compiling a model takes a long time but have a great performance boost. The compiled model can also be [cached for future use](https://docs.sglang.ai/backend/hyperparameter_tuning.html#enabling-cache-for-torch-compile). | `enable_mixed_chunk` | Enables mixing prefill and decode, see [this discussion](https://github.com/sgl-project/sglang/discussions/1163). | `False` |
* `torch_compile_max_bs`: The maximum batch size when using `torch_compile`. | `enable_dp_attention` | Enable [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) for Deepseek models. | `False` |
* `cuda_graph_max_bs`: Adjust the maximum batchsize when using cuda graph. By default this is chosen for you based on GPU specifics. | `enable_torch_compile` | Torch compile the model. Note that compiling a model takes a long time but has a great performance boost. The compiled model can also be [cached for future use](https://docs.sglang.ai/backend/hyperparameter_tuning.html#enabling-cache-for-torch-compile). | `False` |
* `cuda_graph_bs`: The batch sizes to capture by `CudaGraphRunner`. By default this is done for you. | `torch_compile_max_bs` | The maximum batch size when using `torch_compile`. | `32` |
* `torchao_config`: Experimental feature that optimizes the model with [torchao](https://github.com/pytorch/ao). Possible choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row. | `cuda_graph_max_bs` | Adjust the maximum batchsize when using CUDA graph. By default this is chosen for you based on GPU specifics. | None |
* `triton_attention_num_kv_splits`: Use to adjust the number of KV splits in triton kernels. Default is 8. | `cuda_graph_bs` | The batch sizes to capture by `CudaGraphRunner`. By default this is done for you. | None |
* `flashinfer_mla_disable_ragged`: Disable the use of the ragged prefill wrapper for the FlashInfer MLA attention backend. Only use it when FlashInfer is being used as the MLA backend. | `torchao_config` | Experimental feature that optimizes the model with [torchao](https://github.com/pytorch/ao). Possible choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row. | `int8dq` |
* `disable_chunked_prefix_cache`: Disable the use of chunked prefix cache for DeepSeek models. Only use it when FA3 is attention backend. | `triton_attention_num_kv_splits` | Use to adjust the number of KV splits in triton kernels. | `8` |
| `enable_flashinfer_mla` | Use the attention backend with FlashInfer MLA wrapper for DeepSeek models. **This argument will be deprecated in the next release. Please use `--attention_backend flashinfer` instead to enable FlashInfer MLA.** | `False` |
| `flashinfer_mla_disable_ragged` | Disable the use of the ragged prefill wrapper for the FlashInfer MLA attention backend. Only use it when FlashInfer is being used as the MLA backend. | `False` |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment