Commits · ba291dad9f5cd0061e5ee4cab332db42c434e369 · OpenDAS / text-generation-inference

19 Jul, 2024 1 commit

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

25 Jun, 2024 1 commit

Add pytest release marker (#2114) · fc9c3153

Daniël de Kok authored Jun 25, 2024

* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI

fc9c3153

06 Jun, 2024 1 commit

Add support for Marlin-quantized models · 4594e6fa

Daniël de Kok authored Jun 05, 2024

This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3

4594e6fa

24 Jul, 2023 1 commit
- feat: add cuda memory fraction (#659) · 73a4d65d
  OlivierDehaene authored Jul 24, 2023
```
Close #673
```
  73a4d65d
21 Jul, 2023 1 commit

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666) · d5b5bc75

Nicolas Patry authored Jul 21, 2023

Just trying to get the integration tests to pass.


# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation

).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------
Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

d5b5bc75

02 Jun, 2023 1 commit
- feat(server): only compute prefill logprobs when asked (#406) · 895c5f15
  OlivierDehaene authored Jun 02, 2023
```
Close #288
```
  895c5f15
31 May, 2023 1 commit
- increase health checks · 444400b4
  OlivierDehaene authored May 31, 2023
  
  444400b4
26 May, 2023 1 commit
- feat(server): support vectorized warpers in flash causal lm (#317) · 62f91f78
  OlivierDehaene authored May 26, 2023
```
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
```
  62f91f78
16 May, 2023 1 commit
- feat(integration-tests): improve comparison and health checks (#336) · dbdc587d
  OlivierDehaene authored May 16, 2023
  
  dbdc587d
15 May, 2023 1 commit
- feat: add snapshot testing (#282) · e71471be
  OlivierDehaene authored May 15, 2023
  
  e71471be