• Jesse Gross's avatar
    llama: Ensure KV cache is fully defragmented. · 08a832b4
    Jesse Gross authored
    Sometimes the KV cache requires defragmentation even without
    triggering the threshold heuristic. In this case, decoding
    will not being able to find a KV cache slot. This is particularly
    difficult for the caller to handle if it happens in between
    ubatches. To avoid this, we should immediately trigger a defrag.
    
    In addition, a heavily fragmented cache can require more than
    max_moves to defragment. Currently, we stop when we hit the limit
    but this can leave a cache that still does not have adequate space
    even after defragmentation is triggered. Instead, we should do
    multiple batches of processing until everything is complete.
    
    Fixes #7949
    08a832b4
llama.cpp 972 KB