| CUDA Graph | Capturing and replaying entire sequences of GPU operations as a single graph, thereby reducing kernel launch overhead and synchronization delays | ✅ | `--disable-cuda-graph` |
| Radix Cache | Organizes the KV cache in a radix tree, enabling automatic detection and reuse of shared prompt prefixes across multiple generation calls, thereby reducing redundant computations | ✅ | `--disable-radix-cache` |
| Flashinfer MLA | Multi-latent Attention implemented by Flashinfer that replaces the default Triton backend | ❌ | `--enable-flashinfer-mla` |
| Speculative Decoding (`Next-N`) | Dynamically generating a context-aware draft token tree with a smaller, well-calibrated model and then verifying these tokens in parallel with the original LLM, thereby reducing expensive forward passes while preserving output quality. | ❌ | `--speculative-algorithm`,<br>`--speculative-draft`,<br>`--speculative-num-steps`,<br>`--speculative-eagle-topk`,<br>`--speculative-num-draft-tokens` |
| Tensor Parallelism (`tp`) | Splitting the heavy tensor operations—such as the matrix multiplications in self-attention and feedforward layers—across multiple GPUs, thereby lowering the per-device memory burden and enabling simultaneous computation for reduced latency | ✅ (=1) | `--tp-size` |
| Expert Parallelism (`EP-MoE`) | Distributing the computation of different expert subnetworks across multiple devices, thereby reducing memory constraints and communication overhead while enabling simultaneous, efficient processing of input tokens. | ❌ | `--enable-ep-moe`,<br>`--ep-size` |
| Data Parallelism Attention (`DP-Attention`) | Partitioning the MLA attention across DP workers—each handling independent prefill, decode, and idle batches—to significantly reduce per-worker KV cache size and enable larger, more efficient batch processing | ❌ | `--enable-dp-attention` |
## General Advice
* Speculative Decoding is great for small concurrency (less than 32), but its performance degrades quickly as the concurrency increases.
*`CUDA Graph` boosts inference performance significantly, at the cost of increased memory usage. Sometimes it's a good trade-off to disable `CUDA Graph` to further increase concurrency to get better throughput.
*`DP-Attention` is a must for large concurrency (greater than 256), but it hurts per-request decoding speed.
## Known Issues
* Speculative Decoding is not compatible with:
-`Flashinfer-mla`
-`Radix Cache`
-`DP-Attention`
- Both `CUDA Graph` and `Torch Compile` enabled simultaneously
*`EP-MoE` is not supported with both `CUDA Graph` and `Torch Compile` enabled
* To run `DP-Attention` with large concurrency, you must first run a warmup phase with small concurrency (e.g. `bs=16`, `total req=32`) to avoid CUDA out of memory error.
-[^1]: DeepSeek-R1 cannot give the correct output if quantization is used or has precision issues (fixed in [b110084](https://github.com/sgl-project/sglang/commit/b110084654a1986f0148901190e2f280c605476f))
- [^2]: TPS@1 (Tokens Per Second for single request) is read directly from SGLang's logging.
- [^3]: CUDA error at graph capture.
- [^4]: CUDA out of memory.
- [^5]: Requires setting `mem-fraction-static=0.7` to avoid OOM errors.
- [^6]: TypeError: object of type 'NoneType' has no len().
- [^7]: All statistics are collected from the test bench. Token count is calculated using the same tokenizer used in inference.
- [^8]: Average Throughput(prefill+decode, token/s) = (total tokens)/(total time).
- [^9]: Average Decoding Throughput = (sum of (output tokens/duration) for each successful request)/(number of successful requests).
- [^10]: The maximum number of requests to run concurrently at a SGLang backend, controlled by `--max-running-requests`.
-[^11]: Tested by [Lzhang-Hub](https://github.com/sgl-project/sglang/issues/3956#issuecomment-2700514223).