• Nicolas Patry's avatar
    Auto max prefill (#2797) · 5df80590
    Nicolas Patry authored
    * Attempt at automatic max batch prefill.
    
    * Taking into account number of shards.
    
    * Adding more cards.
    
    * Adding A100 + H100
    
    * Adding a few more cards.
    
    * Logprobs cost too much.
    
    * h100 better name, and keep factor of 2
    
    * Damn inflated sparse tflops.
    
    * Typo in h100.
    
    * Updated the flops calculation (checked with fvcore).
    
    * chunking by default.
    
    * Fix prefix caching for chat completion since we removed logprobs.
    
    * More tests.
    
    * Dropping all the prefill logprobs.
    
    * Add a flag that enables users to get logprobs back.
    
    * Repairing prompt token counting.
    
    * Fixing a few tests.
    
    * Remove some scaffolding.
    
    * Attempting to reduces the issues (workarounds for now).
    5df80590
lib.rs 55.8 KB