• Jesse Gross's avatar
    kvcache: Clean up sliding window state with independent batches · 1fc35f12
    Jesse Gross authored
    Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
    are out of the cache's window each time we start a new forward pass.
    
    The cache storage needs to handle the window size for each sequence
    plus the batch size, since the batch needs to attend to the full
    window size. This means that we have greater than a window size
    stored while processing the batch.
    
    When the next batch comes, we are currently only looking at the
    sequences in the incoming batch to slide the window forward.
    However, we also need to clean up the other sequences that might
    be occupying space in the batch processing buffer to ensure each
    sequence is only using its window size of storage. Failure to do
    this can result in "no kv cache slot found" errors.
    
    Fixes: #10127
    1fc35f12
causal.go 20.7 KB