1. 05 Jul, 2024 3 commits
    • Nicolas Patry's avatar
      Hotfixing after refactor. · 853d4eb9
      Nicolas Patry authored
      853d4eb9
    • Nicolas Patry's avatar
      Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2
      Nicolas Patry authored
      * Refactor dead code.
      
      * First working step.
      
      * Remove a lot of duplicated code.
      
      * More dead code.
      
      * More cleanup.
      
      * Fix Santacoder test.
      
      * Fixing the simple tests.
      
      * Fixing sharding.
      
      * Fixes for VLM.
      
      * Fixing santacoder (num_kv_heads hardcoded).
      
      * Removing more dead code.
      
      * Fixing `config.n_head`.
      
      * Stopping earlier because of `<end_of_utterance>` in idefics2.
      
      * Addresses comments.
      
      * Removing the dead code.
      
      * Fuse back mistral into FlashCausalLM.
      
      * Finish removal.
      
      * Fixing docs + causal_lm `batch_class`.
      
      * Fixing docs + causal.lm.
      
      * Add default to Gemma Causality.
      
      * Default value for gemma/gemma2.
      
      * Wrong default.
      fb2f74e2
    • Aaron Mihalik's avatar
      Adding "longrope" for Phi-3 (#2172) (#2179) · c6bcadf8
      Aaron Mihalik authored
      Adding "longrope" for phi-3
      c6bcadf8
  2. 04 Jul, 2024 1 commit
  3. 03 Jul, 2024 5 commits
  4. 02 Jul, 2024 6 commits
  5. 01 Jul, 2024 12 commits
  6. 28 Jun, 2024 1 commit
  7. 27 Jun, 2024 6 commits
  8. 25 Jun, 2024 6 commits
    • drbh's avatar
      be2d3803
    • Daniël de Kok's avatar
      Add support for Marlin 2:4 sparsity (#2102) · f1f98e36
      Daniël de Kok authored
      This change adds support for 2:4 sparsity when using Marlin
      quantization. The 2:4 kernel is used when:
      
      * The quantizer is `marlin`;
      * the quantizer checkpoint format is `marlin_24`.
      
      Fixes #2098.
      f1f98e36
    • Daniël de Kok's avatar
      Support AWQ quantization with bias (#2117) · 14980df2
      Daniël de Kok authored
      When the AWQ quantizer was used with a layer that uses a bias,
      the bias tensor was not correctly passed/used. Instead, the
      value `true`/`1.0` was added to the linear transformation.
      
      Correctly pass through the bias when it is not `None`.
      
      Fixes #2106.
      14980df2
    • drbh's avatar
      Enable multiple LoRa adapters (#2010) · 04e1af94
      drbh authored
      
      
      * feat: first draft load multiple lora
      
      * feat: load weights within layer and refactor lora pass
      
      * fix: refactor and reduce lora math
      
      * feat: baseline impl single request multi lora support
      
      * feat: prefer lorax implementation and port loading logic
      
      * fix: prefer adapter_data and refactors
      
      * feat: perfer loraxs custom punica kernels and add mlp loras
      
      * fix: adjust batch for bgmv
      
      * fix: adjust adapter_segments logic when in batch
      
      * fix: refactor and move changes to v3 proto
      
      * fix: pass model_id for all flash causal lms
      
      * fix: pass model_id for all causal and seq2seq lms
      
      * fix: add model_id to model test
      
      * feat: add lora support to mistral and refactors
      
      * feat: prefer model id in request
      
      * fix: include rust code for adapter id
      
      * feat: bump launcher and add new lora docs
      
      * feat: support base model generation and refactors
      
      * fix: rename doc to retry ci build
      
      * feat: support if vlm models
      
      * fix: add adapter_data param and avoid missing layers
      
      * fix: add adapter_data param to phi and neox
      
      * fix: update all models forwards to include adapter_data
      
      * fix: add model_id to IdeficsCausalLM
      
      * Update lora.md
      
      Fixed a typo
      
      * Update lora.md
      
      Fixing spam image
      
      * fix: add lora kernel to dockerfile, support running without kernels and refactors
      
      * fix: avoid dockerfile conflict
      
      * fix: refactors and adjust flash llama lora logic
      
      * fix: skip llama test due to CI issue (temp)
      
      * fix: skip llama test CI (temp) 2
      
      * fix: revert skips and prefer updated ci token for tests
      
      * fix: refactors and helpful comments
      
      * fix: add noop in TensorParallelAdapterRowLinear too
      
      * fix: refactor and move shard_lora_weights logic
      
      * fix: exit early if no adapter_data
      
      ---------
      Co-authored-by: default avatarDerek <datavistics@gmail.com>
      04e1af94
    • Nicolas Patry's avatar
      Fix CI . (#2118) · a2a97b05
      Nicolas Patry authored
      Fix clippy.
      a2a97b05
    • Daniël de Kok's avatar
      Add pytest release marker (#2114) · fc9c3153
      Daniël de Kok authored
      * Add pytest release marker
      
      Annotate a test with `@pytest.mark.release` and it only gets run
      with `pytest integration-tests --release`.
      
      * Mark many models as `release` to speed up CI
      fc9c3153