• Jesse Gross's avatar
    kvcache: Enable SWA to retain additional entries · 4183bb05
    Jesse Gross authored
    Models that use sliding window attention can only resume a sequence
    from the cache if it falls within the saved windows. This works well
    if the next message picks up where the old one left off. However, it
    generally prevents a partial prefix match unless the entire conversation
    falls within the sliding window.
    
    This can be a problem with reasoning models where the traces are
    supposed to be removed from future messages, forcing the entire
    history to be re-evaluated.
    
    This change allows models to specify that a larger amount of the
    history be retained in memory, to allow more partial resumption.
    It still respects the window that the model was trained on for
    token generation.
    4183bb05
causal_test.go 21.7 KB