Commits · 173387eae543ee23884ebc95b8a1330b4e75c21f · OpenDAS / text-generation-inference

30 May, 2024 1 commit
- add custom vllm source code · 70056d1e
  huangwb authored May 29, 2024
  
  70056d1e
24 Apr, 2024 1 commit
- first runnable TGI changes on DCU platform · 25e8c688
  huangwb authored Apr 24, 2024
  
  25e8c688
10 Apr, 2024 1 commit
- fix: fix CohereForAI/c4ai-command-r-plus (#1707) · ad9d6288
  OlivierDehaene authored Apr 10, 2024
```
@Narsil @drbh this will update flash attention v2 and vllm.
You will need to re-install them.
```
  ad9d6288
22 Mar, 2024 1 commit
- feat: cohere (#1660) · 1e9bcd9d
  OlivierDehaene authored Mar 22, 2024
  
  1e9bcd9d
16 Feb, 2024 1 commit
- v1.4.1 (#1568) · 4139054b
  OlivierDehaene authored Feb 16, 2024
  
  4139054b
08 Feb, 2024 1 commit

Impl simple mamba model (#1480) · bd405e03

drbh authored Feb 08, 2024

This draft PR is a work in progress implementation of the mamba model.
This PR currently loads weights, and produces correct logits after a
single pass.

This PR still needs to correctly integrate this model so it produces
tokens as expected, and apply optimization to avoid all copies during
runtime/unnecessary operations.

#### Helpful resources
[Mamba: Linear-Time Sequence Modeling with Selective State Spaces
(Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752)
https://github.com/johnma2006/mamba-minimal

https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs
https://github.com/huggingface/transformers/pull/28094



Notes: this dev work is currently targeting `state-spaces/mamba-130m`,
so if you want to test please use that model. Additionally when starting
the router the prefill needs to be limited: `cargo run --
--max-batch-prefill-tokens 768 --max-input-length 768`


## Update / Current State

Integration tests have been added and basic functionality such as model
loading is supported.

```bash
cd integration-tests
pytest -vv models/test_fused_kernel_mamba.py
```
- [x] add tests
- [x] load model
- [x] make simple request 
- [ ] resolve warmup issue
- [ ] resolve output issues


fetching models tested during dev
```bash
text-generation-server download-weights state-spaces/mamba-130m
text-generation-server download-weights state-spaces/mamba-1.4b
text-generation-server download-weights state-spaces/mamba-2.8b
```

The server can be run 
```bash
cd server
 MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b
```

router
```bash
cargo run
```

make a request
```bash
curl -s localhost:3000/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json' | jq
```

response
```json
{
  "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data."
}
```

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

bd405e03

11 Dec, 2023 1 commit
- feat: mixtral (#1328) · 3a521c92
  OlivierDehaene authored Dec 11, 2023
  
  3a521c92
27 Nov, 2023 1 commit

Add RoCm support (#1243) · b2b5df0e

fxmarty authored Nov 27, 2023



This PR adds support for AMD Instinct MI210 & MI250 GPUs, with paged
attention and FAv2 support.

Remaining items to discuss, on top of possible others:
* Should we have a
`ghcr.io/huggingface/text-generation-inference:1.1.0+rocm` hosted image,
or is it too early?
* Should we set up a CI on MI210/MI250? I don't have access to the
runners of TGI though.
* Are we comfortable with those changes being directly in TGI, or do we
need a fork?

---------
Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: Your Name <you@example.com>

b2b5df0e

23 Nov, 2023 1 commit
- chore: update to torch 2.1.0 (#1182) · 35509ff5
  OlivierDehaene authored Nov 23, 2023
```
Close #1142
```
  35509ff5
27 Sep, 2023 1 commit

Support eetq weight only quantization (#1068) · 95a4bb69

Nicolas Patry authored Sep 27, 2023

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation

).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------
Co-authored-by: zhaosida <zhaosida@corp.netease.com>

95a4bb69

25 Sep, 2023 1 commit

Add AWQ quantization inference support (#1019) (#1054) · c5de7cd8

Nicolas Patry authored Sep 25, 2023

# Add AWQ quantization inference support

Fixes
https://github.com/huggingface/text-generation-inference/issues/781

This PR (partially) adds support for AWQ quantization for inference.
More information on AWQ [here](https://arxiv.org/abs/2306.00978). In
general, AWQ is faster and more accurate than GPTQ, which is currently
supported by TGI.

This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors
(in `requirements.txt`, just one line change).

Quick way to test this PR would be bring up TGI as follows:

```
text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq

text-generation-launcher \
--huggingface-hub-cache ~/.cache/huggingface/hub/ \
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
--quantize awq
```

Please note:
* This PR was tested with FlashAttention v2 and vLLM.
* This PR adds support for AWQ inference, not quantizing the models.
That needs to be done outside of TGI, instructions

[here](https://github.com/mit-han-lab/llm-awq/tree/f084f40bd996f3cf3a0633c1ad7d9d476c318aaa).
* This PR only adds support for `FlashLlama` models for now.
* Multi-GPU setup has not been tested. 
* No integration tests have been added so far, will add later if
maintainers are interested in this change.
* This PR can be tested on any of the models released

[here](https://huggingface.co/abhinavkulkarni?sort_models=downloads#models).

Please refer to the linked issue for benchmarks for

[abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq](https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq)
vs

[TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ).

Please note, AWQ has released faster (and in case of Llama, fused)
kernels for 4-bit GEMM, currently at the top of the `main` branch at
https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit
that has been tested to work. We can switch to latest commit later on.

## Who can review?

@OlivierDehaene OR @Narsil

---------



# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation

).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------
Co-authored-by: Abhinav M Kulkarni <abhinavkulkarni@gmail.com>
Co-authored-by: Abhinav Kulkarni <abhinav@concentric.ai>

c5de7cd8

27 Jul, 2023 1 commit
- fix(server): fix missing datasets in quantize · 2efd46ef
  OlivierDehaene authored Jul 27, 2023
  
  2efd46ef
18 Jul, 2023 1 commit
- feat(server): flash attention v2 (#624) · 3b71c385
  OlivierDehaene authored Jul 18, 2023
  
  3b71c385
04 Jul, 2023 1 commit

fix: Update server/Makefile to include Makefile-vllm (#520) · 8405581f

Antoni Baum authored Jul 04, 2023

# What does this PR do?

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

8405581f

08 Jun, 2023 1 commit

feat(server): Rework model loading (#344) · abd58ff8

Nicolas Patry authored Jun 08, 2023

# What does this PR do?

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f

)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

abd58ff8

16 May, 2023 1 commit
- fix(server): fix decode token (#334) · 5a582261
  OlivierDehaene authored May 16, 2023
```
Fixes #333

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  5a582261
20 Apr, 2023 1 commit
- feat(router): drop requests when client closes the channel (#202) · 709d8936
  OlivierDehaene authored Apr 20, 2023
  
  709d8936
19 Apr, 2023 1 commit
- fix(docker): remove unused dependencies (#205) · 6837b2eb
  OlivierDehaene authored Apr 19, 2023
  
  6837b2eb
16 Apr, 2023 1 commit
- fix(docker): fix docker image dependencies (#187) · 7a1ba585
  OlivierDehaene authored Apr 17, 2023
  
  7a1ba585
09 Apr, 2023 1 commit
- feat(docker): improve flash_attention caching (#160) · 1883d8ec
  OlivierDehaene authored Apr 09, 2023
  
  1883d8ec
27 Mar, 2023 1 commit

feat(server): Add mypy-protobuf (#141) · 8e8dd984

Nick Hill authored Mar 27, 2023

Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.

8e8dd984

24 Mar, 2023 1 commit
- feat(server): flash neoX (#133) · 05e9a796
  OlivierDehaene authored Mar 24, 2023
  
  05e9a796
15 Mar, 2023 1 commit
- fix(server): add position ids to neox (#126) · 8ad60b75
  OlivierDehaene authored Mar 15, 2023
  
  8ad60b75
13 Mar, 2023 1 commit
- fix(server): revert gpt-neox optims (#123) · cbd36aa4
  OlivierDehaene authored Mar 13, 2023
  
  cbd36aa4
07 Mar, 2023 1 commit
- feat(clients): Python client (#103) · 3fef90d5
  OlivierDehaene authored Mar 07, 2023
  
  3fef90d5
03 Mar, 2023 2 commits
- v0.3.2 (#97) · 1c19b093
  OlivierDehaene authored Mar 03, 2023
  
  1c19b093
- feat(server): fix transformers commit (#96) · 0b6807ca
  OlivierDehaene authored Mar 03, 2023
  
  0b6807ca
13 Feb, 2023 1 commit
- feat: add distributed tracing (#62) · 9af45414
  OlivierDehaene authored Feb 13, 2023
  
  9af45414
24 Jan, 2023 1 commit
- fix(dockerfile): fix docker build (#32) · 13e7044a
  OlivierDehaene authored Jan 24, 2023
  
  13e7044a
08 Dec, 2022 1 commit
- feat(server): Add model tests (#6) · a2985036
  OlivierDehaene authored Dec 08, 2022
  
  a2985036
01 Dec, 2022 1 commit
- feat(server): Support Galactica (#4) · daa1d81d
  OlivierDehaene authored Dec 01, 2022
  
  daa1d81d
08 Nov, 2022 1 commit
- fix(server): Fix Transformers fork version · fa43fb71
  OlivierDehaene authored Nov 08, 2022
  
  fa43fb71
07 Nov, 2022 1 commit
- feat(server): Improved doc · 4236e41b
  OlivierDehaene authored Nov 07, 2022
  
  4236e41b
03 Nov, 2022 1 commit
- fix(models): Revert buggy support for AutoModel · 755fc0e4
  OlivierDehaene authored Nov 03, 2022
  
  755fc0e4
28 Oct, 2022 1 commit
- feat(server): Support all AutoModelForCausalLM on a best effort basis · 3cf6368c
  OlivierDehaene authored Oct 28, 2022
  
  3cf6368c
22 Oct, 2022 1 commit

feat(server): Use safetensors · c8ce9b25

Nicolas Patry authored Oct 22, 2022


Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

c8ce9b25

20 Oct, 2022 1 commit
- v0.1.0 · f16f2f5a
  Olivier Dehaene authored Oct 18, 2022
  
  f16f2f5a
08 Oct, 2022 1 commit
- Init · 295831a4
  Olivier Dehaene authored Oct 08, 2022
  
  295831a4