[Feature] support sequence parallelism using compilation pass (#16155)

Signed-off-by: cascade812 <cascade812@outlook.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[Feature] support sequence parallelism using compilation pass (#16155)
Signed-off-by: cascade812 <cascade812@outlook.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
690fe019 · cascade · GitHub · ed7a29d9 · 690fe019
Unverified Commit 690fe019 authored Apr 27, 2025 by cascade Committed by GitHub Apr 27, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 9 additions and 1 deletion

vllm/v1/worker/gpu_model_runner.py vllm/v1/worker/gpu_model_runner.py +9 -1

No files found.
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -1027,7 +1027,15 @@ class GPUModelRunner(LoRAModelRunnerMixin):
                num_scheduled_tokens)
        else:
            # Eager mode.
-            num_input_tokens = num_scheduled_tokens
+            # Pad tokens to multiple of tensor_parallel_size when
+            # enabled collective fusion for SP
+            tp_size = self.vllm_config.parallel_config.tensor_parallel_size
+            if self.vllm_config.compilation_config.pass_config. \
+                enable_sequence_parallelism and tp_size > 1:
+                from vllm.utils import round_up
+                num_input_tokens = round_up(num_scheduled_tokens, tp_size)
+            else:
+                num_input_tokens = num_scheduled_tokens
        attn_metadata.num_input_tokens = num_input_tokens

        # _prepare_inputs may reorder the batch, so we must gather multi