1. 23 Jul, 2024 5 commits
  2. 22 Jul, 2024 6 commits
  3. 21 Jul, 2024 1 commit
  4. 20 Jul, 2024 3 commits
  5. 19 Jul, 2024 9 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • drbh's avatar
      fix: adjust default tool choice (#2244) · 68a9685f
      drbh authored
      * fix: adjust default tool choice
      
      * feat: improve tool choice syntax and response parsing/errors
      
      * fix: remove dev tests
      
      * feat: add ToolChoice to docs
      68a9685f
    • Erik Kaunismäki's avatar
      add usage stats to toctree (#2260) · 40f5dc3e
      Erik Kaunismäki authored
      quick fix
      40f5dc3e
    • Erik Kaunismäki's avatar
      usage stats and crash reports (#2220) · 4c19593a
      Erik Kaunismäki authored
      
      
      * draft of usage stats
      
      * fix wrong link
      
      * launcher doesn't need sysinfo dep
      
      * only tokenizer class instead of hole struct
      
      * unused import
      
      * fix clippy errors
      
      * update openAPI doc
      
      * cargo fmt
      
      * fix error in passing flags to router
      
      * try again to update docs
      
      * run pre-commit locally
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * on crash use anonymous error event
      
      * delete json_output and ngrok
      
      * more robust way of checking if is in container
      
      * more robust nvidia smi
      
      * parse xpu more robustly
      
      * fix errors
      
      * add nvidia-smi details in docs
      
      * cargo fmt
      
      * fix clippy
      
      * should make docs check pass
      
      * Update router/src/usage_stats.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * error reason can't be in nested json
      
      * cargo fmt
      
      ---------
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      Co-authored-by: default avatarErik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
      4c19593a
    • Daniël de Kok's avatar
    • Daniël de Kok's avatar
      3b41e93a
    • Daniël de Kok's avatar
      18db78f2
    • Daniël de Kok's avatar
    • Daniël de Kok's avatar
      Improve the handling of quantized weights (#2250) · ba291dad
      Daniël de Kok authored
      * Improve the handling of quantized weights
      
      Handling of quantized weights was split between two mechanisms:
      
      - For quantized checkpoints, we used the new weight loader
        infrastructure.
      - For quantization while loading (EETQ, FP8, bitsandbytes) we
        instead relied on conditional in `get_linear`.
      
      Weight loaders support context managers to selectively load
      particular layers with different weight loaders, which is useful
      for models like Idefics2 AWQ, which uses a quantized text model,
      but unquantized vision and connector models. However, the context
      manager would be overrided by `get_linear`, which string-checks
      `quantizer`. Also, the context manager would not work with
      EETQ, FP8, and bitsandbytes.
      
      This change migrates all quantizers to the weight loader infrastructure.
      This has several benefits:
      
      - We can use context managers with all quantizers.
      - All the implementation details move down to the quantizer layers,
        `get_linear` does not need to know how to handle quantizer linear
        layers.
      - All quantizer weights are strongly typed, we don't pass around
        raw tensors.
      - We don't have to pass around the `quantizer` string everywhere.
      
      * Exclude non-MLP layers when using FP8 quantization with Llama
      ba291dad
  6. 18 Jul, 2024 1 commit
  7. 16 Jul, 2024 3 commits
  8. 15 Jul, 2024 3 commits
  9. 12 Jul, 2024 2 commits
  10. 11 Jul, 2024 2 commits
  11. 09 Jul, 2024 4 commits
    • Daniël de Kok's avatar
      Move quantized weight handling out of the `Weights` class (#2194) · 8511669c
      Daniël de Kok authored
      Quantized weights were loaded in the `Weights` class, but this was
      getting quite unwieldy, where every higher level method to load weights
      was a long conditional to cover all the different quantizers.
      
      This change moves loading of quantized weights out of the `Weights`
      class. This is done by defining a simple `WeightsLoader` interface
      that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
      and `MarlinWeightsLoader`. These implementations are in the quantizers'
      respective modules. The `Weights` class provides the low-level load
      operations (such as loading tensors or sharded tensors), but delegates
      loads that need quantizer-specific weight processing to a loader. The
      loaders still use the low-level functionality provided by `Weights`.
      
      I initially tried making a hierarchy where a class like `GPTQWeights`
      would inherit from `Weights`. But it is not very flexible (e.g. does
      not work well with the new weight storage mock used in tests) and
      the implicit indirections made the code harder to follow.
      8511669c
    • Nicolas Patry's avatar
      Updating the self check (#2209) · 4c976fb4
      Nicolas Patry authored
      * Updating the self check
      
      * Fix.
      
      * Revert the CLI .
      
      * cli.
      
      * Space.
      
      * Revert cargo update.
      4c976fb4
    • vinkamath's avatar
      Fixed README ToC (#2196) · f5ba9bfd
      vinkamath authored
      
      Co-authored-by: default avatarVinayak Kamath <Vinayak.Kamath@target.com>
      f5ba9bfd
    • Nicolas Patry's avatar
      Adding sanity check to openapi docs. · fe710af2
      Nicolas Patry authored
      fe710af2
  12. 08 Jul, 2024 1 commit