1. 29 Aug, 2024 1 commit
  2. 28 Aug, 2024 1 commit
  3. 16 Aug, 2024 3 commits
  4. 09 Aug, 2024 2 commits
  5. 08 Aug, 2024 2 commits
  6. 05 Aug, 2024 1 commit
  7. 31 Jul, 2024 1 commit
  8. 29 Jul, 2024 1 commit
    • Erik Kaunismäki's avatar
      Run ci api key (#2315) · 583d37a2
      Erik Kaunismäki authored
      
      
      * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.
      
      * change name to info routes
      
      * Fix comment
      
      * convert strings to lowercase for case insensitive comparison
      
      * convert header to string
      
      * fixes and update docs
      
      * update docs again
      
      * revert wrong update
      
      ---------
      Co-authored-by: default avatarKevin Duffy <kevin.duffy94@gmail.com>
      583d37a2
  9. 23 Jul, 2024 1 commit
  10. 19 Jul, 2024 3 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • Erik Kaunismäki's avatar
      add usage stats to toctree (#2260) · 40f5dc3e
      Erik Kaunismäki authored
      quick fix
      40f5dc3e
    • Erik Kaunismäki's avatar
      usage stats and crash reports (#2220) · 4c19593a
      Erik Kaunismäki authored
      
      
      * draft of usage stats
      
      * fix wrong link
      
      * launcher doesn't need sysinfo dep
      
      * only tokenizer class instead of hole struct
      
      * unused import
      
      * fix clippy errors
      
      * update openAPI doc
      
      * cargo fmt
      
      * fix error in passing flags to router
      
      * try again to update docs
      
      * run pre-commit locally
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * on crash use anonymous error event
      
      * delete json_output and ngrok
      
      * more robust way of checking if is in container
      
      * more robust nvidia smi
      
      * parse xpu more robustly
      
      * fix errors
      
      * add nvidia-smi details in docs
      
      * cargo fmt
      
      * fix clippy
      
      * should make docs check pass
      
      * Update router/src/usage_stats.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * error reason can't be in nested json
      
      * cargo fmt
      
      ---------
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      Co-authored-by: default avatarErik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
      4c19593a
  11. 09 Jul, 2024 2 commits
  12. 08 Jul, 2024 1 commit
  13. 05 Jul, 2024 1 commit
    • Nicolas Patry's avatar
      Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2
      Nicolas Patry authored
      * Refactor dead code.
      
      * First working step.
      
      * Remove a lot of duplicated code.
      
      * More dead code.
      
      * More cleanup.
      
      * Fix Santacoder test.
      
      * Fixing the simple tests.
      
      * Fixing sharding.
      
      * Fixes for VLM.
      
      * Fixing santacoder (num_kv_heads hardcoded).
      
      * Removing more dead code.
      
      * Fixing `config.n_head`.
      
      * Stopping earlier because of `<end_of_utterance>` in idefics2.
      
      * Addresses comments.
      
      * Removing the dead code.
      
      * Fuse back mistral into FlashCausalLM.
      
      * Finish removal.
      
      * Fixing docs + causal_lm `batch_class`.
      
      * Fixing docs + causal.lm.
      
      * Add default to Gemma Causality.
      
      * Default value for gemma/gemma2.
      
      * Wrong default.
      fb2f74e2
  14. 04 Jul, 2024 1 commit
  15. 27 Jun, 2024 2 commits
  16. 25 Jun, 2024 3 commits
    • drbh's avatar
      Enable multiple LoRa adapters (#2010) · 04e1af94
      drbh authored
      
      
      * feat: first draft load multiple lora
      
      * feat: load weights within layer and refactor lora pass
      
      * fix: refactor and reduce lora math
      
      * feat: baseline impl single request multi lora support
      
      * feat: prefer lorax implementation and port loading logic
      
      * fix: prefer adapter_data and refactors
      
      * feat: perfer loraxs custom punica kernels and add mlp loras
      
      * fix: adjust batch for bgmv
      
      * fix: adjust adapter_segments logic when in batch
      
      * fix: refactor and move changes to v3 proto
      
      * fix: pass model_id for all flash causal lms
      
      * fix: pass model_id for all causal and seq2seq lms
      
      * fix: add model_id to model test
      
      * feat: add lora support to mistral and refactors
      
      * feat: prefer model id in request
      
      * fix: include rust code for adapter id
      
      * feat: bump launcher and add new lora docs
      
      * feat: support base model generation and refactors
      
      * fix: rename doc to retry ci build
      
      * feat: support if vlm models
      
      * fix: add adapter_data param and avoid missing layers
      
      * fix: add adapter_data param to phi and neox
      
      * fix: update all models forwards to include adapter_data
      
      * fix: add model_id to IdeficsCausalLM
      
      * Update lora.md
      
      Fixed a typo
      
      * Update lora.md
      
      Fixing spam image
      
      * fix: add lora kernel to dockerfile, support running without kernels and refactors
      
      * fix: avoid dockerfile conflict
      
      * fix: refactors and adjust flash llama lora logic
      
      * fix: skip llama test due to CI issue (temp)
      
      * fix: skip llama test CI (temp) 2
      
      * fix: revert skips and prefer updated ci token for tests
      
      * fix: refactors and helpful comments
      
      * fix: add noop in TensorParallelAdapterRowLinear too
      
      * fix: refactor and move shard_lora_weights logic
      
      * fix: exit early if no adapter_data
      
      ---------
      Co-authored-by: default avatarDerek <datavistics@gmail.com>
      04e1af94
    • KevinDuffy94's avatar
      Add OTLP Service Name Environment Variable (#2076) · 1869ee2f
      KevinDuffy94 authored
      * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069
      
      * Update Docs
      
      * Update README.md
      
      * Update Launcher Docs
      
      * Update Launcher Docs
      Removing Option
      1869ee2f
    • Lucain's avatar
      Support `HF_TOKEN` environment variable (#2066) · 3447c722
      Lucain authored
      
      
      * Support HF_TOKEN environement variable
      
      * Load test.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      3447c722
  17. 14 Jun, 2024 2 commits
  18. 06 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for Marlin-quantized models · 4594e6fa
      Daniël de Kok authored
      This change adds support for Marlin-quantized models. Marlin is an
      FP16xINT4 matmul kernel, which provides good speedups decoding batches
      of 16-32 tokens. It supports quantized models with symmetric
      quantization, groupsize -1 or 128, and 4-bit.
      
      Tested with:
      
      - Llama 2
      - Llama 3
      - Phi 3
      4594e6fa
  19. 04 Jun, 2024 1 commit
    • Emmanuel Ferdman's avatar
      fix: update triton implementation reference (#2002) · fec0167a
      Emmanuel Ferdman authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      PR #1986 moved the location of the `flash_attn_triton.py` file. This PR
      adjusts sources to changes.
      
      
      ## Before submitting
      - [x] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [x] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation
      
      ).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      Signed-off-by: default avatarEmmanuel Ferdman <emmanuelferdman@gmail.com>
      fec0167a
  20. 31 May, 2024 2 commits
  21. 30 May, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for exl2 quantization · 36dd1601
      Daniël de Kok authored
      Mostly straightforward, changes to existing code:
      
      * Wrap quantizer parameters in a small wrapper to avoid passing
        around untyped tuples and needing to repack them as a dict.
      * Move scratch space computation to warmup, because we need the
        maximum input sequence length to avoid allocating huge
        scratch buffers that OOM.
      36dd1601
  22. 28 May, 2024 1 commit
    • Nicolas Patry's avatar
      Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959) · e76b9824
      Nicolas Patry authored
      - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's
      our time now
      - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files)
      hasn't yet, and hasn't for several months now, so let's disabled the
      feature for the time being.
      
      
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      e76b9824
  23. 27 May, 2024 1 commit
  24. 23 May, 2024 1 commit
  25. 22 May, 2024 1 commit
    • Nicolas Patry's avatar
      Creating doc automatically for supported models. (#1929) · 2f243a1a
      Nicolas Patry authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      2f243a1a
  26. 21 May, 2024 1 commit
    • Junlin Zhou's avatar
      docs: Fix grafana dashboard url (#1925) · 904ff369
      Junlin Zhou authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes an incorrect url in monitoring doc.
      
      ## Before submitting
      - [x] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      904ff369
  27. 20 May, 2024 1 commit
  28. 17 May, 2024 1 commit