1. 15 Jul, 2024 1 commit
    • drbh's avatar
      feat: simple mistral lora integration tests (#2180) · 5a650669
      drbh authored
      * feat: simple mistral lora integration tests
      
      * fix: include args in docker launcher
      
      * fix: disable cuda graphs with lora and warn
      
      * fix: adjust docs and precommit issues
      
      * fix: re update docs
      5a650669
  2. 12 Jul, 2024 2 commits
  3. 11 Jul, 2024 2 commits
  4. 09 Jul, 2024 4 commits
    • Daniël de Kok's avatar
      Move quantized weight handling out of the `Weights` class (#2194) · 8511669c
      Daniël de Kok authored
      Quantized weights were loaded in the `Weights` class, but this was
      getting quite unwieldy, where every higher level method to load weights
      was a long conditional to cover all the different quantizers.
      
      This change moves loading of quantized weights out of the `Weights`
      class. This is done by defining a simple `WeightsLoader` interface
      that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
      and `MarlinWeightsLoader`. These implementations are in the quantizers'
      respective modules. The `Weights` class provides the low-level load
      operations (such as loading tensors or sharded tensors), but delegates
      loads that need quantizer-specific weight processing to a loader. The
      loaders still use the low-level functionality provided by `Weights`.
      
      I initially tried making a hierarchy where a class like `GPTQWeights`
      would inherit from `Weights`. But it is not very flexible (e.g. does
      not work well with the new weight storage mock used in tests) and
      the implicit indirections made the code harder to follow.
      8511669c
    • Nicolas Patry's avatar
      Updating the self check (#2209) · 4c976fb4
      Nicolas Patry authored
      * Updating the self check
      
      * Fix.
      
      * Revert the CLI .
      
      * cli.
      
      * Space.
      
      * Revert cargo update.
      4c976fb4
    • vinkamath's avatar
      Fixed README ToC (#2196) · f5ba9bfd
      vinkamath authored
      
      Co-authored-by: default avatarVinayak Kamath <Vinayak.Kamath@target.com>
      f5ba9bfd
    • Nicolas Patry's avatar
      Adding sanity check to openapi docs. · fe710af2
      Nicolas Patry authored
      fe710af2
  5. 08 Jul, 2024 10 commits
  6. 05 Jul, 2024 6 commits
    • Daniël de Kok's avatar
      Consistently take `prefix` in model constructors (#2191) · 05c094fc
      Daniël de Kok authored
      * Consistently take `prefix` in model constructors
      
      * Release test check fix
      
      * Misc refactor-related fixes
      05c094fc
    • Daniël de Kok's avatar
      GPTQ CI improvements (#2151) · 67ef0649
      Daniël de Kok authored
      * Add more representative Llama GPTQ test
      
      The Llama GPTQ test is updated to use a model with the commonly-used
      quantizer config format and activation sorting. The old test is
      kept around (but renamed) since it tests the format produced by
      `text-generation-server quantize`.
      
      * Add support for manually triggering a release build
      67ef0649
    • Daniël de Kok's avatar
      Fix Starcoder2 after refactor (#2189) · b67d4633
      Daniël de Kok authored
      b67d4633
    • Nicolas Patry's avatar
      Hotfixing after refactor. · 853d4eb9
      Nicolas Patry authored
      853d4eb9
    • Nicolas Patry's avatar
      Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2
      Nicolas Patry authored
      * Refactor dead code.
      
      * First working step.
      
      * Remove a lot of duplicated code.
      
      * More dead code.
      
      * More cleanup.
      
      * Fix Santacoder test.
      
      * Fixing the simple tests.
      
      * Fixing sharding.
      
      * Fixes for VLM.
      
      * Fixing santacoder (num_kv_heads hardcoded).
      
      * Removing more dead code.
      
      * Fixing `config.n_head`.
      
      * Stopping earlier because of `<end_of_utterance>` in idefics2.
      
      * Addresses comments.
      
      * Removing the dead code.
      
      * Fuse back mistral into FlashCausalLM.
      
      * Finish removal.
      
      * Fixing docs + causal_lm `batch_class`.
      
      * Fixing docs + causal.lm.
      
      * Add default to Gemma Causality.
      
      * Default value for gemma/gemma2.
      
      * Wrong default.
      fb2f74e2
    • Aaron Mihalik's avatar
      Adding "longrope" for Phi-3 (#2172) (#2179) · c6bcadf8
      Aaron Mihalik authored
      Adding "longrope" for phi-3
      c6bcadf8
  7. 04 Jul, 2024 1 commit
  8. 03 Jul, 2024 5 commits
  9. 02 Jul, 2024 6 commits
  10. 01 Jul, 2024 3 commits
    • Nicolas Patry's avatar
      [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e
      Nicolas Patry authored
      * Using flash decoding
      
      Conditional flashdecoding.
      
      Fix max_q.
      
      Working kvcache
      
      Working version with flash decoding.
      
      Make it work for mistral.
      
      Fix after rebase..
      
      Less intrusive.
      
      REvert changes in modeling.
      
      Speedup flashdecoding.
      
      HHachweew
      Hack to make other models work.
      
      Fixing non flash decoding llama path.
      
      Router logic knows about page size.
      
      Missing 2 models.
      
      Missing cohere.
      
      Fixing cohere flash decoding.
      
      Revamped all this architecture.
      
      Fix cohere.
      
      Fixing falcon.
      
      Enabling custom block size schedule.
      
      Update router/src/infer.rs
      
      Not sending preallocated output.
      
      * Making it work on non flash decoding.
      
      * Fix Cohere.
      
      * Fix non decoding paths.
      
      * Rebased.
      
      * No need for cache_manager anymore.
      
      * Update?
      
      * "ipex" -> "cpu"
      
      * These do not belong.
      
      * Factoring cu_seqlen_qk for better abstracting over every model.
      
      * Fixing non flash tests/imports.
      
      * Changing return everywhere.
      
      * Update mistral past.
      
      * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).
      
      * Fixup mistral clamping (had issues with cuda graphs).
      
      * No need to recreate anything actually.
      4327210e
    • Nicolas Patry's avatar
      Fixing baichuan override. (#2158) · 4f55f158
      Nicolas Patry authored
      4f55f158
    • Nicolas Patry's avatar
      GH router. (#2153) · d0225b10
      Nicolas Patry authored
      d0225b10