• Jesse Gross's avatar
    kvcache: Enable SWA to retain additional entries · 4183bb05
    Jesse Gross authored
    Models that use sliding window attention can only resume a sequence
    from the cache if it falls within the saved windows. This works well
    if the next message picks up where the old one left off. However, it
    generally prevents a partial prefix match unless the entire conversation
    falls within the sliding window.
    
    This can be a problem with reasoning models where the traces are
    supposed to be removed from future messages, forcing the entire
    history to be re-evaluated.
    
    This change allows models to specify that a larger amount of the
    history be retained in memory, to allow more partial resumption.
    It still respects the window that the model was trained on for
    token generation.
    4183bb05
causal.go 19.7 KB