Commits · 214c06f51005b5be2a06963bb3ad1281abc2bc18 · OpenDAS / text-generation-inference

20 Jul, 2023 1 commit

Add trust_remote_code to quantize script (#647) · 214c06f5

cdawg authored Jul 20, 2023

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->

214c06f5

19 Jul, 2023 4 commits
- docs: Update README.md (#643) · 5a1512c0
  Nicolas Patry authored Jul 19, 2023
  
  5a1512c0
- docs: Update README.md (#639) · 1c81df15
  Nicolas Patry authored Jul 19, 2023
  
  1c81df15
- feat(router): ngrok edge (#642) · b66b1904
  OlivierDehaene authored Jul 19, 2023
  
  b66b1904
- feat(server): auto max_batch_total_tokens for flash att models (#630) · fe80f536
  OlivierDehaene authored Jul 19, 2023
  
  fe80f536
18 Jul, 2023 5 commits

fix(server): fix llamav2 config (#635) · 5e6ddfd6
OlivierDehaene authored Jul 18, 2023

5e6ddfd6
v0.9.3 (#634) · cf83f9b6
OlivierDehaene authored Jul 18, 2023

cf83f9b6
feat(server): add support for llamav2 (#633) · 211b211e
Nicolas Patry authored Jul 18, 2023

211b211e
feat(server): flash attention v2 (#624) · 3b71c385
OlivierDehaene authored Jul 18, 2023

3b71c385

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587) · 4d38a1c4

Nicolas Patry authored Jul 18, 2023

but should work on more configurations (no need for 2 GPUs, less RAM
usage).


# What does this PR do?

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.


<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

4d38a1c4

17 Jul, 2023 2 commits
- fea(launcher): debug logs (#623) · 44acf72a
  OlivierDehaene authored Jul 17, 2023
  
  44acf72a
- fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621) · bc287324
  Nicolas Patry authored Jul 17, 2023
  
  bc287324
15 Jul, 2023 1 commit
- fix(server): empty_cache when stopped · a2cf1bdb
  OlivierDehaene authored Jul 15, 2023
  
  a2cf1bdb
14 Jul, 2023 1 commit
- v0.9.2 (#616) · c58a0c18
  OlivierDehaene authored Jul 14, 2023
  
  c58a0c18
13 Jul, 2023 4 commits
- fix(server): blacklist local files (#609) · 5b9de4a1
  OlivierDehaene authored Jul 13, 2023
```
Close #589 #602
```
  5b9de4a1
- docs: README: Add logo + baseline (#611) · c8b077be
  Victor Muštar authored Jul 13, 2023
```
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)
```
  c8b077be
- feat(router): explicit warning if revision is not set (#608) · 982ce322
  OlivierDehaene authored Jul 13, 2023
  
  982ce322
- feat(launcher): add arg validation and drop subprocess (#595) · b7327205
  OlivierDehaene authored Jul 13, 2023
  
  b7327205
12 Jul, 2023 9 commits

GPTQ Env vars: catch correct type of error (#596) · 36285595

ssmi153 authored Jul 13, 2023

# What does this PR do?

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.

36285595

feat(server): empty cache on errors · f2f0289f
OlivierDehaene authored Jul 12, 2023

f2f0289f

feat(server): Implements sharding for non divisible `vocab_size`. (#583) · 67347950

Nicolas Patry authored Jul 12, 2023

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

67347950

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590) · 2c4bf882

ssmi153 authored Jul 12, 2023

# What does this PR do?

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.

2c4bf882

fix(server): Adding logger import to t5_modeling.py (#585) · 7f907222
Adam Kowalski authored Jul 12, 2023
```
Logger is referenced during the apex importing but is not imported,
causing a NameError
```
7f907222
fix(server): T5 weights names. (#582) · db4efbf4
Nicolas Patry authored Jul 12, 2023
```
Fixes #541
```
db4efbf4
chore: migrate ci region for more availability. (#581) · f063ebde
Nicolas Patry authored Jul 12, 2023

f063ebde

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580) · 5bd2ab65

Nicolas Patry authored Jul 12, 2023

# What does this PR do?

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500

5bd2ab65

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't... · f0181436

Nicolas Patry authored Jul 12, 2023

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)

Fixes #555

f0181436

10 Jul, 2023 1 commit
- feat: better errors for warmup and TP (#575) · b4024edd
  OlivierDehaene authored Jul 10, 2023
```
Close #571
```
  b4024edd
07 Jul, 2023 1 commit

fix(server): harden the weights choice to save on disk. (#561) · e943a294

Nicolas Patry authored Jul 07, 2023

- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).
  
- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593

e943a294

06 Jul, 2023 2 commits
- v0.9.1 (#558) · 31b36cca
  OlivierDehaene authored Jul 06, 2023
  
  31b36cca
- fix(server): decrease memory fragmentation (#557) · c4bb5264
  OlivierDehaene authored Jul 06, 2023
  
  c4bb5264
05 Jul, 2023 1 commit

feat(router): add argument for hostname in router (#545) (#550) · 6f429427

OlivierDehaene authored Jul 05, 2023

# What does this PR do?

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health

'  # failed before this commit
```

Trigger CI

---------
Co-authored-by: Phil Chen <philchen2000@gmail.com>

6f429427

04 Jul, 2023 5 commits

feat(server): use latest flash attention commit (#543) · 31e2253a
OlivierDehaene authored Jul 04, 2023
```
@njhill FYI
```
31e2253a

fix(server): avoid errors for very small top_p values (#544) · e4b26aa1

Nick Hill authored Jul 04, 2023

See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.

e4b26aa1

fix(server): Handle loading from local files for MPT (#534) · 2a101207

Antoni Baum authored Jul 04, 2023

This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.

2a101207

docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462) · e6888d0e
Nicolas Patry authored Jul 04, 2023

e6888d0e

fix: Update server/Makefile to include Makefile-vllm (#520) · 8405581f

Antoni Baum authored Jul 04, 2023

# What does this PR do?

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

8405581f

03 Jul, 2023 1 commit

feat(server): Add Non flash MPT. (#514) · 1da07e85

Nicolas Patry authored Jul 03, 2023

# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

1da07e85

01 Jul, 2023 1 commit
- v0.9.0 (#525) · e28a8090
  OlivierDehaene authored Jul 01, 2023
  
  e28a8090
30 Jun, 2023 1 commit
- fix(launcher): fix issue where launcher does not properly report shard failures (#522) · 2b53d719
  OlivierDehaene authored Jun 30, 2023
  
  2b53d719