Commits · cfa73b5c99bc009903fbc340f8b77a6d4674455d · OpenDAS / text-generation-inference

27 Aug, 2024 1 commit

drbh authored Aug 26, 2024



* fix[router]: Fix tools not passed in chat template
Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>

cfa73b5c

16 Aug, 2024 2 commits

All integration tests back everywhere (too many failed CI). (#2428) · e4201f44

Nicolas Patry authored Aug 16, 2024

* All integration tests back everywhere (too many failed CI).

* Upgrade integration tests after 12.4

* Attempt to remove the specifed compute cap.

* Common arch list.

* Punica uses raw ASM which is not valid on 9.0 apparently.

e4201f44

Further fixes. (#2426) · c7ab1810

Nicolas Patry authored Aug 16, 2024

* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.

c7ab1810

15 Aug, 2024 2 commits

Upgrading the tests to match the current workings. (#2423) · 1b0aa062
Nicolas Patry authored Aug 15, 2024

1b0aa062

Fixing exl2 and other quanize tests again. (#2419) · 57b34958

Nicolas Patry authored Aug 15, 2024

* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.

57b34958

12 Aug, 2024 1 commit
- Upgrade fbgemm (#2398) · 9c739651
  Nicolas Patry authored Aug 12, 2024
```
* Upgrade fbgemm

* Fix fbgemm version
```
  9c739651
08 Aug, 2024 1 commit

Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371) · a379d553

drbh authored Aug 07, 2024



* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>

a379d553

29 Jul, 2024 1 commit
- fix: adjust test snapshots and small refactors (#2323) · 0b95693f
  drbh authored Jul 29, 2024
```
* fix: adjust test snapshots and small refactors

* fix: revert non snapshot changes
```
  0b95693f
26 Jul, 2024 1 commit

feat: add ruff and resolve issue (#2262) · bab02ff2

drbh authored Jul 26, 2024

* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check

bab02ff2

25 Jul, 2024 3 commits
- Fixing idefics on g6 tests. (#2306) · 17ed42be
  Nicolas Patry authored Jul 25, 2024
  
  17ed42be
- Some small fixes for the Torch 2.4.0 update (#2304) · 9256d7c3
  Daniël de Kok authored Jul 25, 2024
```
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update
```
  9256d7c3
- Using g6 instead of g5. (#2281) · 26614057
  Nicolas Patry authored Jul 25, 2024
```
* Using g6 instead of g5.

* Update the idefics2 snapshot.
```
  26614057
22 Jul, 2024 2 commits

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

fix(server): fix fp8 weight loading (#2268) · 4844ff79

OlivierDehaene authored Jul 22, 2024

* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype

4844ff79

20 Jul, 2024 1 commit
- Add FP8 release test (#2261) · e5c1d6d6
  Daniël de Kok authored Jul 20, 2024
  
  e5c1d6d6
19 Jul, 2024 2 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

15 Jul, 2024 1 commit

feat: simple mistral lora integration tests (#2180) · 5a650669

drbh authored Jul 15, 2024

* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs

5a650669

05 Jul, 2024 2 commits

GPTQ CI improvements (#2151) · 67ef0649

Daniël de Kok authored Jul 05, 2024

* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build

67ef0649

Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2

Nicolas Patry authored Jul 05, 2024

* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.

fb2f74e2

01 Jul, 2024 1 commit

Use GPTQ-Marlin for supported GPTQ configurations (#2111) · 2ce80194

Daniël de Kok authored Jul 01, 2024

GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).

2ce80194

27 Jun, 2024 1 commit

Idefics2: sync added image tokens with transformers (#2080) · dd2d91b0

Daniël de Kok authored Jun 27, 2024

Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.

dd2d91b0

25 Jun, 2024 2 commits

Add pytest release marker (#2114) · fc9c3153

Daniël de Kok authored Jun 25, 2024

* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI

fc9c3153

Support `HF_TOKEN` environment variable (#2066) · 3447c722

Lucain authored Jun 25, 2024



* Support HF_TOKEN environement variable

* Load test.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3447c722

24 Jun, 2024 1 commit

New runner. Manual squash. (#2110) · 480d3b33

Nicolas Patry authored Jun 24, 2024

* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79

480d3b33

17 Jun, 2024 1 commit

Support different image sizes in prefill in VLMs (#2065) · e9037708

Daniël de Kok authored Jun 17, 2024

When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.

e9037708

14 Jun, 2024 1 commit

Add support for GPTQ Marlin (#2052) · 093a27c5

Daniël de Kok authored Jun 14, 2024

Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.

093a27c5

11 Jun, 2024 1 commit

Support chat response format (#2046) · 376a0b7a

drbh authored Jun 11, 2024

* feat: support response_format in chat

* fix: adjust typos

* fix: add trufflehog lint

376a0b7a

06 Jun, 2024 1 commit

Add support for Marlin-quantized models · 4594e6fa

Daniël de Kok authored Jun 05, 2024

This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3

4594e6fa

30 May, 2024 2 commits

Gemma GPTQ checks: skip logprob checks · 967ced2f

Daniël de Kok authored May 30, 2024

This test fails somewhat regularly due to non-determinism and this
test is primarily to verify that we are loading a model which doesn't
have `float16` as the default dtype correctly.

967ced2f

Add support for exl2 quantization · 36dd1601

Daniël de Kok authored May 28, 2024

Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.

36dd1601

28 May, 2024 1 commit

Fix (non-container) pytest stdout buffering-related lock-up · f20463e4

Daniël de Kok authored May 28, 2024

Two issues:

1. When one of the stdout/stderr pipe buffers of a process started
   with `subprocess.Popen` is full, the process can get blocked until
   the buffer is drained.
2. Calling `Popen.wait` can deadlock when called before draining
   the pipe buffers (if they are full).

This avoids the issue altogether by giving the child process a
temporary file to write to.

f20463e4

27 May, 2024 2 commits

Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953) · a401c83c

Daniël de Kok authored May 27, 2024

# What does this PR do?

Fix GPTQ for models which do not have float16 at the default dtype

Before this change GPTQ models would not work if the model's default
data type is not `float16`. For example, Gemma GPTQ models would fail
because the default dtype of Gemma is `bfloat16`. There are two issues:

If the default `dtype` is not `float16`, the quantizer's `float16`
parameters get converted to that dtype. The kernels cannot deal
with non-`float16` types. The same applies to inputs of quantized ops.

This is resolved by setting the dtype of gptq/awq-quantized models to
`float16`.

Simpler version of #1951.

**Draft:** just testing...

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

a401c83c

Fix (flash) Gemma prefix and enable tests · 9231098f
Daniël de Kok authored May 24, 2024

9231098f

24 May, 2024 1 commit

Fix seeded output. (#1949) · d32e33bd

Nicolas Patry authored May 24, 2024

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

d32e33bd

16 May, 2024 1 commit

Pali gemma modeling (#1895) · 40213c95

drbh authored May 16, 2024

This PR adds paligemma modeling code

Blog post: https://huggingface.co/blog/paligemma
Transformers PR: https://github.com/huggingface/transformers/pull/30814

install the latest changes and run with
```bash
# get the weights
# text-generation-server download-weights gv-hf/PaliGemma-base-224px-hf

# run TGI
text-generation-launcher --model-id gv-hf/PaliGemma-base-224px-hf
```


basic example sending various requests
```python
from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:3000")


images = [
    "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/cow_beach_1.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png

",
]

prompts = [
    "What animal is in this image?",
    "Name three colors in this image.",
    "What are 10 colors in this image?",
    "Where is the cow standing?",
    "answer en Where is the cow standing?",
    "Is there a bird in the image?",
    "Is ther a cow in the image?",
    "Is there a rabbit in the image?",
    "how many birds are in the image?",
    "how many rabbits are in the image?",
]

for img in images:
    print(f"\nImage: {img.split('/')[-1]}")
    for prompt in prompts:
        inputs = f"![]({img}){prompt}\n"
        json_data = {
            "inputs": inputs,
            "parameters": {
                "max_new_tokens": 30,
                "do_sample": False,
            },
        }
        generated_output = client.text_generation(prompt, max_new_tokens=30, stream=False)
        print([f"{prompt}\n{generated_output}"])

```

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

40213c95

15 May, 2024 1 commit

Add GPT-2 with flash attention (#1889) · b5bc6e5c

Daniël de Kok authored May 15, 2024

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

This change adds `FlashGPT2ForCausalLM` and wires it up. The model
itself is pretty straightforward, the main difference from other models
is that it uses trained position embeddings and that all weight matrices
are transposed compared to other models (due to the use of Conv1D in the
upstream model).


<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [x] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil 

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

b5bc6e5c

23 Apr, 2024 1 commit

Idefics2. (#1756) · bfddfa59

Nicolas Patry authored Apr 23, 2024

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

bfddfa59

18 Apr, 2024 1 commit
- v2.0.1 · 2d0a7173
  OlivierDehaene authored Apr 18, 2024
  
  2d0a7173
17 Apr, 2024 1 commit

feat: accept list as prompt and use first string (#1702) · 06c3d4b1

drbh authored Apr 17, 2024

This PR allows the `CompletionRequest.prompt` to be sent as a string or
array of strings. When an array is sent the first value will be used if
it's a string; otherwise the according error will be thrown

Fixes:
https://github.com/huggingface/text-generation-inference/issues/1690
Similar to: https://github.com/vllm-project/vllm/pull/323/files

06c3d4b1