1. 12 Jul, 2023 3 commits
    • Nicolas Patry's avatar
      feat(server): Implements sharding for non divisible `vocab_size`. (#583) · 67347950
      Nicolas Patry authored
      - The code is relatively easy (just disable the checks on Embedding and
      Head)
      
      This cannot be done in the same easy fashion for hidden_dim/head_dim.
      It's relatively easy on some models (classic MHA) but it would make the
      other
      models (MQA) much more complex, and GPTQ quantization another quite
      hairy piece
      of code.
      67347950
    • ssmi153's avatar
      fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590) · 2c4bf882
      ssmi153 authored
      # What does this PR do?
      
      This fixes a typo and extends the GPTP_BITS environment variables
      through to the second method which requires the same logic. Please let
      me know if there's anything I've misunderstood in this change.
      
      Thanks @Narsil for the original fix.
      2c4bf882
    • Nicolas Patry's avatar
      feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580) · 5bd2ab65
      Nicolas Patry authored
      # What does this PR do?
      
      Some models are already converted, and do not have those values in the
      file, this enables users to use them with less friction.
      
      Went for pure env based because adding flags would end up (imo) very
      tedious to maintain. There's a lot of sanitation to do: those flags
      would be errors if not used in conjuction with `--quantize gptq`.
      Then the flags need to exist in the launcher and the server passing them
      all throughout all function calls.
      
      This PR is intended as an easy escape hatch, not the defacto method to
      use gptq in TGI.
      
      Fixes #500
      5bd2ab65
  2. 30 Jun, 2023 1 commit
  3. 26 Jun, 2023 1 commit
    • Nicolas Patry's avatar
      feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438) · aefde28b
      Nicolas Patry authored
      Let's start discussing implementation.
      
      - Need to expose the quantization scripts (either included here or add
      doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
      - Make sure GPTQ works for multiple models (priority to Falcon).
      
      Currently it means that every place we use `get_{tensor|sharded}` to
      check for quantization.
      
      My idea is to reintegrate as much as possible into `utils/layer.py` by
      expanding `load_multi` to be a bit more generic.
      This might require some thinking, but ultimately the
      `qweight,qzeros,scales,g_idx` should be in a single place, and
      independant of bias presence.
      
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation
      
      ).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      
      ---------
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-41-161.ec2.internal>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      aefde28b
  4. 23 Jun, 2023 1 commit
  5. 08 Jun, 2023 1 commit