• OlivierDehaene's avatar
    feat: prefill chunking (#2600) · a6a0c97e
    OlivierDehaene authored
    
    
    * wip
    
    * rollback
    
    * refactor to use prefix/postfix namming + fix all_input_ids_tensor
    
    * maybe patching vlms?
    
    * fix filter and concat
    
    * wip, no filter, no concat
    
    * current
    
    * add prepare_for_prefill
    
    * working
    
    * load tested
    
    * re-create slots
    
    * re-create slots
    
    * fix slot_filtering_indices
    
    * feedback loop
    
    * remove log
    
    * fix benchmarker
    
    * fix vlm and seq2seq
    
    * rename to cache and input lengths
    
    * fix prefill logprobs
    
    * fix launcher
    
    * fix logprobs?
    
    * idk at this point
    
    * max input length
    
    * omfg
    
    * remove debugging lines
    
    * fix tests
    
    * fix mllama
    
    * fix cargo tests
    
    * remove support chunking for paged
    
    * Fixing non blocked attentions
    
    * Fixing dtype + AMD, Ipex targets.
    
    * lint fix.
    
    * rename
    
    * Fix prefix_caching variable, remove defaults in server (confusing a lot
    of the times).
    
    * Add simple resolution when user specifies ATTENTION=paged.
    
    * Put back non default simple tests.
    
    * Fix env name
    
    ---------
    Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
    a6a0c97e
model.py 5.94 KB