1. 15 Jul, 2024 1 commit
    • drbh's avatar
      feat: simple mistral lora integration tests (#2180) · 5a650669
      drbh authored
      * feat: simple mistral lora integration tests
      
      * fix: include args in docker launcher
      
      * fix: disable cuda graphs with lora and warn
      
      * fix: adjust docs and precommit issues
      
      * fix: re update docs
      5a650669
  2. 05 Jul, 2024 2 commits
    • Daniël de Kok's avatar
      GPTQ CI improvements (#2151) · 67ef0649
      Daniël de Kok authored
      * Add more representative Llama GPTQ test
      
      The Llama GPTQ test is updated to use a model with the commonly-used
      quantizer config format and activation sorting. The old test is
      kept around (but renamed) since it tests the format produced by
      `text-generation-server quantize`.
      
      * Add support for manually triggering a release build
      67ef0649
    • Nicolas Patry's avatar
      Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2
      Nicolas Patry authored
      * Refactor dead code.
      
      * First working step.
      
      * Remove a lot of duplicated code.
      
      * More dead code.
      
      * More cleanup.
      
      * Fix Santacoder test.
      
      * Fixing the simple tests.
      
      * Fixing sharding.
      
      * Fixes for VLM.
      
      * Fixing santacoder (num_kv_heads hardcoded).
      
      * Removing more dead code.
      
      * Fixing `config.n_head`.
      
      * Stopping earlier because of `<end_of_utterance>` in idefics2.
      
      * Addresses comments.
      
      * Removing the dead code.
      
      * Fuse back mistral into FlashCausalLM.
      
      * Finish removal.
      
      * Fixing docs + causal_lm `batch_class`.
      
      * Fixing docs + causal.lm.
      
      * Add default to Gemma Causality.
      
      * Default value for gemma/gemma2.
      
      * Wrong default.
      fb2f74e2
  3. 01 Jul, 2024 1 commit
    • Daniël de Kok's avatar
      Use GPTQ-Marlin for supported GPTQ configurations (#2111) · 2ce80194
      Daniël de Kok authored
      GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
      let's use it by default if the kernels are installed, the GPU supports
      it, and the kernels support the configuration.
      
      For models generated by `text-generation-server quantize`, use
      `sym=False`. This subcommand symmetric quantization since the beginning
      and incorrectly reporting the model to be symmetric will use
      GPTQ-Marlin (which does not support asymmetric quantization).
      2ce80194
  4. 27 Jun, 2024 1 commit
  5. 25 Jun, 2024 2 commits
  6. 24 Jun, 2024 1 commit
    • Nicolas Patry's avatar
      New runner. Manual squash. (#2110) · 480d3b33
      Nicolas Patry authored
      * New runner. Manual squash.
      
      * Network host.
      
      * Put back trufflehog with proper extension.
      
      * No network host ?
      
      * Moving buildx install after tailscale ?
      
      * 1.79
      480d3b33
  7. 17 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Support different image sizes in prefill in VLMs (#2065) · e9037708
      Daniël de Kok authored
      When a batch contained images if different sizes during prefill, the
      server would fail (see e.g. #2056). Images were processed separately and
      then concatenated. However, this can fail for images with different sizes.
      
      Fix this by preprocessing all images in the batch together, so that the
      image processor can ensure that all image tensors have compatible sizes.
      e9037708
  8. 14 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for GPTQ Marlin (#2052) · 093a27c5
      Daniël de Kok authored
      Add support for GPTQ Marlin kernels
      
      GPTQ Marlin extends the Marlin kernels to support common GPTQ
      configurations:
      
      - bits: 4 or 8
      - groupsize: -1, 32, 64, or 128
      - desc_act: true/false
      
      Using the GPTQ Marlin kernels requires repacking the parameters in the
      Marlin quantizer format.
      
      The kernels were contributed by Neural Magic to VLLM. We vendor them
      here for convenience.
      093a27c5
  9. 11 Jun, 2024 1 commit
  10. 06 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for Marlin-quantized models · 4594e6fa
      Daniël de Kok authored
      This change adds support for Marlin-quantized models. Marlin is an
      FP16xINT4 matmul kernel, which provides good speedups decoding batches
      of 16-32 tokens. It supports quantized models with symmetric
      quantization, groupsize -1 or 128, and 4-bit.
      
      Tested with:
      
      - Llama 2
      - Llama 3
      - Phi 3
      4594e6fa
  11. 30 May, 2024 2 commits
    • Daniël de Kok's avatar
      Gemma GPTQ checks: skip logprob checks · 967ced2f
      Daniël de Kok authored
      This test fails somewhat regularly due to non-determinism and this
      test is primarily to verify that we are loading a model which doesn't
      have `float16` as the default dtype correctly.
      967ced2f
    • Daniël de Kok's avatar
      Add support for exl2 quantization · 36dd1601
      Daniël de Kok authored
      Mostly straightforward, changes to existing code:
      
      * Wrap quantizer parameters in a small wrapper to avoid passing
        around untyped tuples and needing to repack them as a dict.
      * Move scratch space computation to warmup, because we need the
        maximum input sequence length to avoid allocating huge
        scratch buffers that OOM.
      36dd1601
  12. 28 May, 2024 1 commit
    • Daniël de Kok's avatar
      Fix (non-container) pytest stdout buffering-related lock-up · f20463e4
      Daniël de Kok authored
      Two issues:
      
      1. When one of the stdout/stderr pipe buffers of a process started
         with `subprocess.Popen` is full, the process can get blocked until
         the buffer is drained.
      2. Calling `Popen.wait` can deadlock when called before draining
         the pipe buffers (if they are full).
      
      This avoids the issue altogether by giving the child process a
      temporary file to write to.
      f20463e4
  13. 27 May, 2024 2 commits
  14. 24 May, 2024 1 commit
    • Nicolas Patry's avatar
      Fix seeded output. (#1949) · d32e33bd
      Nicolas Patry authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      d32e33bd
  15. 16 May, 2024 1 commit
  16. 15 May, 2024 1 commit
    • Daniël de Kok's avatar
      Add GPT-2 with flash attention (#1889) · b5bc6e5c
      Daniël de Kok authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      This change adds `FlashGPT2ForCausalLM` and wires it up. The model
      itself is pretty straightforward, the main difference from other models
      is that it uses trained position embeddings and that all weight matrices
      are transposed compared to other models (due to the use of Conv1D in the
      upstream model).
      
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [x] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [x] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [x] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [x] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      @Narsil 
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      b5bc6e5c
  17. 23 Apr, 2024 1 commit
    • Nicolas Patry's avatar
      Idefics2. (#1756) · bfddfa59
      Nicolas Patry authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      bfddfa59
  18. 18 Apr, 2024 1 commit
  19. 17 Apr, 2024 1 commit
  20. 16 Apr, 2024 1 commit
  21. 12 Apr, 2024 2 commits
    • OlivierDehaene's avatar
      v2.0.0 (#1736) · c38a7d7d
      OlivierDehaene authored
      c38a7d7d
    • Nicolas Patry's avatar
      Improve the defaults for the launcher (#1727) · 1b2670c8
      Nicolas Patry authored
      # What does this PR do?
      
      - Renamed `max_input_length` into `max_input_tokens` for consistency
      (backward compatible change, will yell if both are set.)
      - Will now use the config for `max_input_tokens` `max_total_token` and
      `max_batch_total_tokens`.
      - Capping the values to 16k in order to save VRAM on behalf of users
      (overriddable by simply setting the values).
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      1b2670c8
  22. 09 Apr, 2024 1 commit
    • Nicolas Patry's avatar
      Adding Llava-Next (Llava 1.6) with full support. (#1709) · 4634b00c
      Nicolas Patry authored
      # What does this PR do?
      
      - Changed all models to extract `embed_tokens` in order to enable llava
      to separately call the embeddings and the core model layers.
      - Added VlmCausalLM to inherit from FlashMistral in order to be
      maximally supported. The only added logics sits on top and parses images
      into pixel values, preallocates input_ids space for the image
      embeddings, and passes them for the model.
      - Added Clip for the vision tower.
      - Didn't add flash for the vision tower since there's no padding anyway.
      - Added heuristic (potentially incomplete) to calculate number of
      features *before* calculating the clip patches (allows for easier logic
      reuse of the LLM under the hood).
      
      
      Still needs to be done:
      
      - [x] Implement the image parsing in the controller side, to avoid
      downloading n times per TP shard and also refusing requests too large
      early and avoid issues where the truncation actually truncates the
      image.
      - [ ] Make sure it works with quantization properly.
      - [x] Make sure it works with TP>1
      
      
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      4634b00c
  23. 04 Apr, 2024 1 commit
    • Nicolas Patry's avatar
      Add cuda graphs sizes and make it default. (#1703) · 99874eae
      Nicolas Patry authored
      # What does this PR do?
      
      ```
      text-generation-launcher --model-id XXX # Uses cuda graphs by default
      text-generation-launcher --model-id XXX --cuda-graphs "1,2"  #Restrict the number of cuda graphs which saves VRAM
      text-generation-launcher --model-id XXX --cuda-graphs "0"  # Disabling it entirely
      ```
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      99874eae
  24. 29 Mar, 2024 1 commit
  25. 22 Mar, 2024 2 commits
  26. 21 Mar, 2024 3 commits
    • Nicolas Patry's avatar
      Repair idefics integration tests. (#1663) · deb440b3
      Nicolas Patry authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      deb440b3
    • drbh's avatar
      fix: improve tool type, bump pydantic and outlines (#1650) · de6cb15f
      drbh authored
      This PR resolves a couple 
      
      - [X] adjusts the tool response to align with openai's tools response
      type
      - [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue
      when running tests)
      - [X] bump `outlines` version and fix import for new name
      de6cb15f
    • drbh's avatar
      fix: prefer spaces url over temp url (#1662) · 4f09c80c
      drbh authored
      This PR fixes the broken urls in the idefics tests causing CI to fail
      4f09c80c
  27. 01 Mar, 2024 1 commit
  28. 29 Feb, 2024 1 commit
    • drbh's avatar
      fix: Handle concurrent grammar requests (#1610) · 343aa7a1
      drbh authored
      This PR fixes parallel grammar requests, currently grammar states are
      not concatenated correctly when a new request is added to the batch and
      this results in incorrect generation. This PR updates the `concatenate`
      function to correctly include the previous states.
      
      fixes: #1601
      343aa7a1
  29. 28 Feb, 2024 4 commits