Commits · c8a01d759173483efc2135c4e7506b23e14e7fc4 · OpenDAS / text-generation-inference

25 Jul, 2023 1 commit

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671) · a0d55358

Nicolas Patry authored Jul 25, 2023

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601 

In next steps we should make sure our quantization script uses that
format and make it standard.


# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

a0d55358

04 Jul, 2023 2 commits
- feat(server): use latest flash attention commit (#543) · 31e2253a
  OlivierDehaene authored Jul 04, 2023
```
@njhill FYI
```
  31e2253a
- fix(server): Handle loading from local files for MPT (#534) · 2a101207
  Antoni Baum authored Jul 04, 2023
```
This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.
```
  2a101207
03 Jul, 2023 1 commit

feat(server): Add Non flash MPT. (#514) · 1da07e85

Nicolas Patry authored Jul 03, 2023

# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

1da07e85

30 Jun, 2023 1 commit
- feat: Add the option to force another dtype than `f16`. (#513) · ecf6dc3a
  Nicolas Patry authored Jun 30, 2023
  
  ecf6dc3a
08 Jun, 2023 1 commit

feat(server): Rework model loading (#344) · abd58ff8

Nicolas Patry authored Jun 08, 2023

# What does this PR do?

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f

)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

abd58ff8

31 May, 2023 1 commit
- fix(server): fix bnb quantization for CausalLM models (#385) · 337afb28
  OlivierDehaene authored May 31, 2023
  
  337afb28
30 May, 2023 1 commit
- fix(server): fix quantization · bf7f1d54
  OlivierDehaene authored May 30, 2023
  
  bf7f1d54
26 May, 2023 1 commit
- feat(server): support vectorized warpers in flash causal lm (#317) · 62f91f78
  OlivierDehaene authored May 26, 2023
```
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
```
  62f91f78
23 May, 2023 2 commits
- feat(server): support trust_remote_code (#363) · e3e487dc
  OlivierDehaene authored May 23, 2023
  
  e3e487dc
- feat(server): support fp16 for t5 (#360) · cfaa8580
  OlivierDehaene authored May 23, 2023
```
Fixes #349
```
  cfaa8580
22 May, 2023 1 commit

feat(server): Support BLOOMChat-176B (#348) (#351) · e649bf9a

OlivierDehaene authored May 22, 2023



@njhill, 
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

e649bf9a

16 May, 2023 1 commit
- fix(server): fix decode token (#334) · 5a582261
  OlivierDehaene authored May 16, 2023
```
Fixes #333

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  5a582261
15 May, 2023 3 commits

feat: add snapshot testing (#282) · e71471be
OlivierDehaene authored May 15, 2023

e71471be

Removing dead variables. (#327) · d7a97aa0

Nicolas Patry authored May 15, 2023

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

d7a97aa0

Lifting check_unitialized. (#325) · 91e674bb

Nicolas Patry authored May 15, 2023

# What does this PR do?

Lifting check_unitialized.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

91e674bb

12 May, 2023 1 commit

feat(server): GPTQ quantization (step1) (#277) · 76a48cd3

Nicolas Patry authored May 12, 2023

Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

76a48cd3

10 May, 2023 2 commits
- feat(server): use float16 (#304) · 745f596c
  OlivierDehaene authored May 10, 2023
  
  745f596c
- feat(server): shard token decode (#303) · 68e9d6ab
  OlivierDehaene authored May 10, 2023
  
  68e9d6ab
03 May, 2023 1 commit
- feat(server): support hf endpoint weight layout (#266) · 85aa7e2e
  OlivierDehaene authored May 03, 2023
  
  85aa7e2e
21 Apr, 2023 1 commit
- feat(router): add device and dtype info (#215) · 343437c7
  OlivierDehaene authored Apr 21, 2023
  
  343437c7
12 Apr, 2023 2 commits
- feat(server): support sharded santacoder (#167) · 880a76ee
  OlivierDehaene authored Apr 12, 2023
  
  880a76ee
- feat(server): optimize decode for sane tokenizers (#170) · 5fa8ae04
  OlivierDehaene authored Apr 12, 2023
  
  5fa8ae04
11 Apr, 2023 1 commit

feat(server): support OPT models (#55) · f26dfd0d

OlivierDehaene authored Apr 11, 2023

OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.

f26dfd0d

09 Apr, 2023 1 commit
- feat(router): make router input validation optional (#164) · 99879600
  OlivierDehaene authored Apr 09, 2023
  
  99879600
07 Mar, 2023 1 commit
- feat(clients): Python client (#103) · 3fef90d5
  OlivierDehaene authored Mar 07, 2023
  
  3fef90d5
06 Mar, 2023 1 commit
- feat: allow local models (#101) · cd5961b5
  OlivierDehaene authored Mar 06, 2023
```
closes #99
```
  cd5961b5
14 Feb, 2023 1 commit
- feat: add safetensors conversion (#63) · 0fbc6919
  OlivierDehaene authored Feb 14, 2023
  
  0fbc6919
03 Feb, 2023 1 commit
- feat(router): refactor API and add openAPI schemas (#53) · 20c3c594
  OlivierDehaene authored Feb 03, 2023
  
  20c3c594
31 Jan, 2023 3 commits
- feat(server): Support GPT-Neox (#39) · f830706b
  OlivierDehaene authored Jan 31, 2023
  
  f830706b
- fix(server): fix quantization for sharded models (#45) · c6e8b944
  OlivierDehaene authored Jan 31, 2023
  
  c6e8b944
- fix(server): fix seeding with multiple shards (#44) · 54fec931
  OlivierDehaene authored Jan 31, 2023
  
  54fec931
30 Jan, 2023 1 commit
- feat: Support sampling seeding (#37) · cd298bc5
  OlivierDehaene authored Jan 30, 2023
```
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
```
  cd298bc5
26 Jan, 2023 1 commit
- feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33) · ce960be0
  OlivierDehaene authored Jan 26, 2023
  
  ce960be0
20 Jan, 2023 2 commits
- fix(server): Fix position ids (#28) · 1f570d18
  OlivierDehaene authored Jan 20, 2023
  
  1f570d18
- feat(server): Support SantaCoder (#26) · 15511edc
  OlivierDehaene authored Jan 20, 2023
  
  15511edc
17 Jan, 2023 1 commit

fix(server): Minor refactorization using new_zeros (#24) · e6d3eb5d

Nick Hill authored Jan 17, 2023

- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher

e6d3eb5d

15 Dec, 2022 1 commit
- feat: Return logprobs (#8) · 32a25306
  OlivierDehaene authored Dec 15, 2022
  
  32a25306
08 Dec, 2022 1 commit
- feat(server): Add model tests (#6) · a2985036
  OlivierDehaene authored Dec 08, 2022
  
  a2985036
05 Dec, 2022 1 commit

fix(batching): Avoid theoretical hang in batcher loop (#5) · 31d76e23

Nick Hill authored Dec 05, 2022



- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>

31d76e23