[V1][TPU] Do not compile sampling more than needed (#15883)

Signed-off-by: NickLucche <nlucches@redhat.com>

[V1][TPU] Do not compile sampling more than needed (#15883)
Signed-off-by: NickLucche <nlucches@redhat.com>
bd7599d3 · Nicolò Lucchesi · GitHub · 01b61136 · bd7599d3
Unverified Commit bd7599d3 authored Apr 03, 2025 by Nicolò Lucchesi Committed by GitHub Apr 03, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 1 deletion

vllm/v1/worker/tpu_model_runner.py vllm/v1/worker/tpu_model_runner.py +3 -1

No files found.
--- a/vllm/v1/worker/tpu_model_runner.py
+++ b/vllm/v1/worker/tpu_model_runner.py
@@ -862,7 +862,9 @@ class TPUModelRunner:
                out = self.model.sample_from_hidden(dummy_hidden,
                                                    sampling_meta)
                out = out.cpu()
-                if num_reqs_to_sample >= self.max_num_reqs:
+                # Requests can't be more than tokens. But do compile for the
+                # next bigger value in case num_tokens uses bucketed padding.
+                if num_reqs_to_sample >= min(num_tokens, self.max_num_reqs):
                    break
                # Make sure to compile the `max_num_reqs` upper-limit case
                num_reqs_to_sample = _get_padded_num_reqs_with_upper_limit(