[Model] Fix Gemma 4 token repetition by dynamic BOS injection for PT models (#39842)

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> (cherry picked from commit 6dc94914)

[Model] Fix Gemma 4 token repetition by dynamic BOS injection for PT models (#39842)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> (cherry picked from commit 6dc94914)
b1568cf4 · Luciano Martins · khluu · a4ac72ce · b1568cf4
Commit b1568cf4 authored Apr 15, 2026 by Luciano Martins Committed by khluu Apr 15, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 2 deletions

vllm/model_executor/models/gemma4_mm.py vllm/model_executor/models/gemma4_mm.py +7 -2

No files found.
--- a/vllm/model_executor/models/gemma4_mm.py
+++ b/vllm/model_executor/models/gemma4_mm.py
@@ -168,10 +168,15 @@ class Gemma4ProcessingInfo(BaseProcessingInfo):
        Setting ``add_special_tokens=False`` here prevents the duplicate and
        ensures both ``llm.generate()`` and the chat/completions API behave
-        correctly.
+        correctly for IT models. For PT models (without chat template), we
+        keep the default (True) to ensure BOS is added for raw prompts.
        """
+        tokenizer = self.ctx.get_tokenizer()
+        has_chat_template = getattr(tokenizer, "chat_template", None) is not None
        params = super().get_default_tok_params()
-        params = params.with_kwargs(add_special_tokens=False)
+        if has_chat_template:
+            params = params.with_kwargs(add_special_tokens=False)
        return params
    def get_hf_processor(self, **kwargs: object) -> Gemma4Processor: