• Nick Hill's avatar
    [BugFix] Eagerly abort cancelled final-step requests (#29987) · dc264bce
    Nick Hill authored
    
    
    Currently, when requests are cancelled while executing their final
    step, "completion" is handled based on normal stop processing (e.g.
    length or stop token), so the abort has no effect. This is typically
    not a problem, but when a kv connector is involved it thinks the
    request completed successfully rather than being aborted.
    
    This is problematic for disaggregated prefill which will free kv
    cache blocks if the request was aborted but not if it completed
    successfully—since the cancelled request will never be sent to
    the decode side, kv cache blocks remain pinned until the fall-back
    timeout expires. The problem is exacerbated when many requests
    are cancelled and/or there are large prefills whose forward pass
    takes a long time (since the window is bigger).
    
    This PR fixes the problem by processing pending aborts
    immediately prior to processing model output each step; we process
    only aborts, not new requests, since it's preferable for latency to
    process model outputs before new incoming requests.
    
    Fixes #26400.
    Signed-off-by: default avatarNick Hill <nhill@redhat.com>
    dc264bce
core.py 57.1 KB