Unverified Commit 7a30fa87 authored by Zazzle516's avatar Zazzle516 Committed by GitHub
Browse files

[Doc] Clarify cudagraph capture size logic and default behavior in scheduler (#18698)


Signed-off-by: default avatarZazzle516 <2405677060@qq.com>
Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent f82f7a89
...@@ -3579,30 +3579,40 @@ class VllmConfig: ...@@ -3579,30 +3579,40 @@ class VllmConfig:
def _set_cudagraph_sizes(self): def _set_cudagraph_sizes(self):
""" """
cudagraph batchsize padding logic: vLLM defines the default candidate list of batch sizes for CUDA graph
capture as:
`[1, 2, 4] + [8 * i for i in range(1, 1025)]` is a list of all possible ```python
batch sizes that cudagraph will capture. max_graph_size = min(max_num_seqs * 2, 512)
# 1, 2, 4, then multiples of 8 up to max_graph_size
Depending on the engine's configuration of `max_num_seqs`, the cuda_graph_sizes = [1, 2, 4, 8, 16, 24, 32, 40, ..., max_graph_size]
candidate batch sizes to capture cudagraph will shrink to the subset
which just cover the range of `[1, max_num_seqs]`. In the common case,
`max_num_seqs` is 256, and the cudagraph batch sizes will be
`[1, 2, 4, 8, 16, 24, 32, 40, ..., 256]`.
However, if users specify the cudagraph capture sizes through
compilation config, we will use the specified sizes instead.
In the end, `vllm_config.compilation_config.cudagraph_capture_sizes` In the end, `vllm_config.compilation_config.cudagraph_capture_sizes`
will be the final sizes to capture cudagraph (in descending order). will be the final sizes to capture cudagraph (in descending order).
During runtime, if batchsize is larger than These sizes are used to capture and reuse CUDA graphs for
`vllm_config.compilation_config.cudagraph_capture_sizes`, performance-critical paths (e.g., decoding). Capturing enables
no cudagraph will be used. significantly faster kernel dispatch by avoiding Python overhead. The
If the batch size is no larger than list is then filtered based on `max_num_batched_tokens` (e.g., 8192 on
`vllm_config.compilation_config.cudagraph_capture_sizes`, most GPUs), which controls the total allowed number of tokens in a
we can quickly find the padded graph size for a given batch size by batch. Since each sequence may have a variable number of tokens, the
looking up `vllm_config.compilation_config.bs_to_padded_graph_size`. maximum usable batch size will depend on actual sequence lengths.
Example:
With `max_num_batched_tokens = 8192`, and typical sequences
averaging ~32 tokens, most practical batch sizes fall below 256.
However, the system will still allow capture sizes up to 512 if
shape and memory permit.
Note:
If users explicitly specify cudagraph capture sizes in the
compilation config, those will override this default logic.
At runtime:
- If batch size <= one of the `cudagraph_capture_sizes`, the closest
padded CUDA graph will be used.
- If batch size > largest `cudagraph_capture_sizes`, cudagraph will
not be used.
""" """
# calculate the default `batch_size_capture_list` # calculate the default `batch_size_capture_list`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment