[Model] Fix Gemma 4 token repetition by dynamic BOS injection for PT models (#39842)

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

[Model] Fix Gemma 4 token repetition by dynamic BOS injection for PT models (#39842)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
6dc94914 · Luciano Martins · GitHub · 27c0ca50 · 6dc94914
Unverified Commit 6dc94914 authored Apr 15, 2026 by Luciano Martins Committed by GitHub Apr 15, 2026
Show whitespace changes
Inline Side-by-side

Showing with 7 additions and 2 deletions

vllm/model_executor/models/gemma4_mm.py vllm/model_executor/models/gemma4_mm.py +7 -2

No files found.
--- a/vllm/model_executor/models/gemma4_mm.py
+++ b/vllm/model_executor/models/gemma4_mm.py
@@ -167,9 +167,14 @@ class Gemma4ProcessingInfo(BaseProcessingInfo):
        Setting ``add_special_tokens=False`` here prevents the duplicate and
        ensures both ``llm.generate()`` and the chat/completions API behave
-        correctly.
+        correctly for IT models. For PT models (without chat template), we
+        keep the default (True) to ensure BOS is added for raw prompts.
        """
+        tokenizer = self.ctx.get_tokenizer()
+        has_chat_template = getattr(tokenizer, "chat_template", None) is not None
        params = super().get_default_tok_params()
+        if has_chat_template:
            params = params.with_kwargs(add_special_tokens=False)
        return params