1. 08 Aug, 2024 3 commits
  2. 07 Aug, 2024 1 commit
  3. 06 Aug, 2024 3 commits
  4. 05 Aug, 2024 1 commit
    • drbh's avatar
      fix: attempt forward on flash attn2 to check hardware support (#2335) · 215ed3ad
      drbh authored
      * fix: attempt forward on flash attn2 to check hardware support
      
      * fix: warn window_size_left when using flash attn 1
      
      * fix: prefer version check over test op and avoid window_size_left if not flash attn2
      
      * fix: improve condtional and error message
      
      * fix: update sliding window conditional
      
      * fix: simplify changes and revert model changes
      
      * fix: avoid changing conditional
      
      * fix: typo tweak
      215ed3ad
  5. 01 Aug, 2024 2 commits
  6. 31 Jul, 2024 2 commits
  7. 30 Jul, 2024 1 commit
  8. 29 Jul, 2024 2 commits
  9. 26 Jul, 2024 2 commits
    • drbh's avatar
      feat: add ruff and resolve issue (#2262) · bab02ff2
      drbh authored
      * feat: add ruff and resolve issue
      
      * fix: update client exports and adjust after rebase
      
      * fix: adjust syntax to avoid circular import
      
      * fix: adjust client ruff settings
      
      * fix: lint and refactor import check and avoid model enum as global names
      
      * fix: improve fbgemm_gpu check and lints
      
      * fix: update lints
      
      * fix: prefer comparing model enum over str
      
      * fix: adjust lints and ignore specific rules
      
      * fix: avoid unneeded quantize check
      bab02ff2
    • Daniël de Kok's avatar
  10. 25 Jul, 2024 1 commit
  11. 24 Jul, 2024 4 commits
  12. 23 Jul, 2024 5 commits
  13. 22 Jul, 2024 3 commits
  14. 21 Jul, 2024 1 commit
  15. 20 Jul, 2024 1 commit
    • OlivierDehaene's avatar
      feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79
      OlivierDehaene authored
      * feat(fp8): add support for fbgemm
      
      * allow loading fp8 weights directly
      
      * update outlines
      
      * fix makefile
      
      * build fbgemm
      
      * avoid circular import and fix dockerfile
      
      * add default dtype
      
      * refactored weights loader
      
      * fix auto conversion
      
      * fix quantization config parsing
      
      * force new nccl on install
      
      * missing get_weights implementation
      
      * increase timeout
      53ec0b79
  16. 19 Jul, 2024 6 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • Daniël de Kok's avatar
    • Daniël de Kok's avatar
      3b41e93a
    • Daniël de Kok's avatar
      18db78f2
    • Daniël de Kok's avatar
    • Daniël de Kok's avatar
      Improve the handling of quantized weights (#2250) · ba291dad
      Daniël de Kok authored
      * Improve the handling of quantized weights
      
      Handling of quantized weights was split between two mechanisms:
      
      - For quantized checkpoints, we used the new weight loader
        infrastructure.
      - For quantization while loading (EETQ, FP8, bitsandbytes) we
        instead relied on conditional in `get_linear`.
      
      Weight loaders support context managers to selectively load
      particular layers with different weight loaders, which is useful
      for models like Idefics2 AWQ, which uses a quantized text model,
      but unquantized vision and connector models. However, the context
      manager would be overrided by `get_linear`, which string-checks
      `quantizer`. Also, the context manager would not work with
      EETQ, FP8, and bitsandbytes.
      
      This change migrates all quantizers to the weight loader infrastructure.
      This has several benefits:
      
      - We can use context managers with all quantizers.
      - All the implementation details move down to the quantizer layers,
        `get_linear` does not need to know how to handle quantizer linear
        layers.
      - All quantizer weights are strongly typed, we don't pass around
        raw tensors.
      - We don't have to pass around the `quantizer` string everywhere.
      
      * Exclude non-MLP layers when using FP8 quantization with Llama
      ba291dad
  17. 18 Jul, 2024 1 commit
  18. 16 Jul, 2024 1 commit