model/parsers/qwen3coder.go · 34efbbd3f02c024fc3256ae7d7799abb9cb98e8f · OpenDAS / ollama

parsers: fix unicode handling for qwen3-coder · 05ba4ca1

Devon Rifkin authored Sep 25, 2025

When trimming whitespace at the end of every chunk, we were iterating
backwards over the string byte-by-byte instead of rune-by-rune.

As an example of how this can cause corruption, suppose we have the
multi-byte character ✅ (`"\u2705"`), which is represented in utf-8 as
the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which
passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this
caused us to mistakenly slice in the middle of the rune, removing `0x85`
and leaving `0xE2 0x9C`, which beyond being the incorrect place to
slice, is not even a valid utf-8 character.

`trailingWhitespaceLen()` was modified to count from the end in a
rune-aware way. Tests with various multibyte unicode characters were
also added.


Fixes: #12414

05ba4ca1

qwen3coder.go 13.3 KB

Replace qwen3coder.go