fix: qwen25vl assign samebatch in multimodal input (#10789)

setting samebatch on the vision start token is problematic because it will be shared with other inputs that also use images. this will cause the input to be cached and the runner will not see SameBatch. SameBatch will also be incorrect since it may be for a different image. assigning samebatch to the input tokens resolves this by ensure it's assigned correctly to inputs corresponding to the image. not setting same batch correctly may cause panics during inference since images are no longer guaranteed to be in the same batch.

fix: qwen25vl assign samebatch in multimodal input (#10789)
setting samebatch on the vision start token is problematic because it will be shared with other inputs that also use images. this will cause the input to be cached and the runner will not see SameBatch. SameBatch will also be incorrect since it may be for a different image. assigning samebatch to the input tokens resolves this by ensure it's assigned correctly to inputs corresponding to the image. not setting same batch correctly may cause panics during inference since images are no longer guaranteed to be in the same batch.
69b2fe92 · Michael Yang · GitHub · 9ed8bf14 · 69b2fe92
Unverified Commit 69b2fe92 authored May 21, 2025 by Michael Yang Committed by GitHub May 21, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

model/models/qwen25vl/model.go model/models/qwen25vl/model.go +2 -1

No files found.
--- a/model/models/qwen25vl/model.go
+++ b/model/models/qwen25vl/model.go
@@ -121,13 +121,14 @@ func (m *Model) PostTokenize(inputs []input.Input) ([]input.Input, error) {
 			patchesPerChunk := inp.Multimodal[0].Tensor.Dim(1)
 			// First add the vision start token
-			result = append(result, input.Input{Token: visionStartToken, SameBatch: patchesPerChunk + 1})
+			result = append(result, input.Input{Token: visionStartToken})
 			// Add the image token with the multimodal tensor data at the first position
 			result = append(result, input.Input{
 				Token:          imageToken,
 				Multimodal:     inp.Multimodal,
 				MultimodalHash: inp.MultimodalHash,
+				SameBatch:      patchesPerChunk,
 			})
 			// Add the placeholder tokens for the remaining positions (tokensPerGrid-1)