• Nicolas Patry's avatar
    Lots of improvements (Still 2 allocators) (#2449) · e415b690
    Nicolas Patry authored
    
    
    * Making prefix/flashinfer the default and testing the full release tests.
    
    * Include flashinfer in the docker.
    
    * Using prebuilt.
    
    * Allowing window_left_size (dummy version).
    
    * Disabling flashinfer/prefix caching on odd head_dim
    
    * Disable prefix caching for lora.
    
    * More specific codes.
    
    * Update lock
    
    * Updating integration tests with new values with FI/FD.
    
    Remove paged as a default too, and using FD everywhere.
    
    * Update cargo lock ?
    
    * Upgrade to 1.80 because of bitstream...
    
    * Everywhere 1.80
    
    * Forgot last default place.
    
    * Apply suggestions from code review
    Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
    
    * Updated flake lock
    
    * Tmp
    
    * Upgrade resolution system for less errors in resolution.
    
    * Remove lambda for cleaner function.
    
    * Handling debugger.
    
    * OVerride the env in server tests.
    
    * Is this enough to make it work ?
    
    * This seems to be working.
    
    * Downgrade some logs.
    
    * Fixing the default for vlm.
    
    * Don't enable prefix caching on VLM just yet.
    
    * Change `add_special_tokens` in order to have the correct tokens for chat
    input and not (since it's super important with the prefixing now)
    
    * Fixing prefix caching for flashdecoding.
    
    * Update all models.
    
    * Fixed flashinfer version.
    
    * add_special_tokens is internal only
    
    * Fixing seqlen with the new vlms.
    
    * Fixing the issue with `add_special_tokens` not being passed around.
    
    * Fixing the test.
    
    * Removing encoder_decoder (seq2seq).
    
    * Update the chat test.
    
    * Fixing the batching tokenization in flash causal lm.
    
    * Truncating left for radix purposes.
    
    * Oops this doesn't belong here.
    
    * Put back default pure shell.
    
    * Update server tests
    
    - Default to throughput test in k6
    - Use TGI_WIGGLE_ROOM to adjust wiggle room
    
    * Only n_heads / process_group.size() are necessary.
    
    * Revert the integrationt tests change (seem linked to head_size
    modification).
    
    * Adding error message when assert is violated.
    
    * Fixing the free algorithm to handle times where the common prefix is
    smaller.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
    
    * Update server/text_generation_server/layers/attention/common.py
    Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
    
    * Fix disabling prefix caching - Fix windowing checks.
    
    * Revert the Cohere tokenizer change (for now using a revision instead).
    
    * Fmt.
    
    ---------
    Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
    Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
    e415b690
flash_causal_lm.py 74.3 KB