• Devon Rifkin's avatar
    parsers: fix unicode handling for qwen3-coder · 05ba4ca1
    Devon Rifkin authored
    When trimming whitespace at the end of every chunk, we were iterating
    backwards over the string byte-by-byte instead of rune-by-rune.
    
    As an example of how this can cause corruption, suppose we have the
    multi-byte character  (`"\u2705"`), which is represented in utf-8 as
    the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which
    passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this
    caused us to mistakenly slice in the middle of the rune, removing `0x85`
    and leaving `0xE2 0x9C`, which beyond being the incorrect place to
    slice, is not even a valid utf-8 character.
    
    `trailingWhitespaceLen()` was modified to count from the end in a
    rune-aware way. Tests with various multibyte unicode characters were
    also added.
    
    
    Fixes: #12414
    05ba4ca1
qwen3coder.go 13.3 KB