Commits · 10e6f292956705bebd7b9feeb96a9289c4ce20a2 · OpenDAS / text-generation-inference

24 Sep, 2024 1 commit
- chore: Add old V2 backend (#2551) · 10e6f292
  OlivierDehaene authored Sep 24, 2024
```
* wip

* added v2
```
  10e6f292
09 Aug, 2024 1 commit

drbh authored Aug 09, 2024



* Fix unsigned integer underflow

Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.

In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
    break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.

Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

* fix: update v3 scheduler and ensure max_batch_size > 0

---------
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>

6d06473c

08 Jul, 2024 1 commit

update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190) · 58effe78

Wang, Yi authored Jul 08, 2024



update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

58effe78

25 Jun, 2024 1 commit

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024



* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------
Co-authored-by: Derek <datavistics@gmail.com>

04e1af94

04 Jun, 2024 1 commit

feat: add SchedulerV3 (#1996) · 757223b3

OlivierDehaene authored Jun 04, 2024

- Refactor code to allow supporting multiple versions of the
generate.proto at the same time
- Add v3/generate.proto (ISO to generate.proto for now but allow for
future changes without impacting v2 backends)
- Add Schedule trait to abstract queuing and batching mechanisms that
will be different in the future
- Add SchedulerV2/V3 impl

757223b3

03 Jun, 2024 1 commit

router: send the input as chunks to the backend · df71aafd

Daniël de Kok authored Jun 03, 2024

Before this change, the generation input was sent to the backend as a
single string, encoding images as Base64 and packing them in
Markdown-style links.

This change adds a new chunked input representation that separates text
chunks from images chunks. Image chunks contain binary data (for smaller
message sizes) and the image's MIME type.

The stringly-typed inputs are still sent to support backends that do not
support chunked inputs yet.

df71aafd

12 Apr, 2024 2 commits

Improve the defaults for the launcher (#1727) · 1b2670c8

Nicolas Patry authored Apr 12, 2024

# What does this PR do?

- Renamed `max_input_length` into `max_input_tokens` for consistency
(backward compatible change, will yell if both are set.)
- Will now use the config for `max_input_tokens` `max_total_token` and
`max_batch_total_tokens`.
- Capping the values to 16k in order to save VRAM on behalf of users
(overriddable by simply setting the values).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

1b2670c8

fix(router): fix a possible deadlock in next_batch (#1731) · c2c98725
OlivierDehaene authored Apr 12, 2024

c2c98725

15 Feb, 2024 1 commit

Outlines guided generation (#1539) · cef0553d

drbh authored Feb 15, 2024

This WIP PR starts to add grammar support via outlines, currently this
PR supports very simple regex grammars and does not optimize for
precompiling or caching grammar fsm's.

todo:
- [X] add simple outlines guidance to `NextTokenChooser`
- [X] update protos for grammar
- [X] update generation params API
- [X] constrain simple grammar
- [ ] support parsing more complex grammar into fsm
- [ ] support all outline support grammar types
- [ ] explore optimizations to avoid recompiling grammars

guided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq
```
response
```json
{
  "generated_text": "david@example.com"
}
```

unguided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq
```
response
```json
{
  "generated_text": "    email = 'david"
}
```

cef0553d

09 Feb, 2024 1 commit
- feat(router): add max_batch_size (#1542) · 53214633
  OlivierDehaene authored Feb 09, 2024
```
Some hardware require a maximum batch size.
```
  53214633
08 Feb, 2024 1 commit
- feat(server): add frequency penalty (#1541) · 09b7c26b
  OlivierDehaene authored Feb 08, 2024
  
  09b7c26b
11 Dec, 2023 1 commit
- Speculative (#1308) · 9ecfa16b
  Nicolas Patry authored Dec 11, 2023
  
  9ecfa16b
23 Oct, 2023 1 commit
- feat: remove flume (#1184) · f9910d13
  OlivierDehaene authored Oct 23, 2023
  
  f9910d13
28 Sep, 2023 1 commit
- feat: add mistral model (#1071) · 3b56d766
  OlivierDehaene authored Sep 28, 2023
  
  3b56d766
28 Aug, 2023 1 commit

Rebased #617 (#868) · 211b54ac

Nicolas Patry authored Aug 28, 2023

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation

).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------
Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>

211b54ac

19 Jul, 2023 1 commit
- feat(server): auto max_batch_total_tokens for flash att models (#630) · fe80f536
  OlivierDehaene authored Jul 19, 2023
  
  fe80f536
30 Jun, 2023 1 commit
- feat(server): add paged attention to flash models (#516) · e74bd41e
  OlivierDehaene authored Jun 30, 2023
```
Closes #478
```
  e74bd41e
23 Jun, 2023 1 commit
- fix(router): add timeout on flume sends (#488) · bd3a9d8e
  OlivierDehaene authored Jun 23, 2023
  
  bd3a9d8e
16 Jun, 2023 1 commit
- feat(router): add ngrok integration (#453) · f59fb8b6
  OlivierDehaene authored Jun 16, 2023
  
  f59fb8b6
02 Jun, 2023 1 commit
- feat(server): only compute prefill logprobs when asked (#406) · 895c5f15
  OlivierDehaene authored Jun 02, 2023
```
Close #288
```
  895c5f15
26 Apr, 2023 2 commits
- feat(router): new healthcheck that skips the queue (#244) · db2b4e07
  Nicolas Patry authored Apr 26, 2023
```
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
```
  db2b4e07
- feat(router): add tests to validation (#237) · c4fb09f2
  Nicolas Patry authored Apr 26, 2023
  
  c4fb09f2
24 Apr, 2023 1 commit
- feat(router): use number of tokens in batch as input for dynamic batching (#226) · ebc74d56
  OlivierDehaene authored Apr 24, 2023
```
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  ebc74d56
20 Apr, 2023 1 commit
- feat(router): drop requests when client closes the channel (#202) · 709d8936
  OlivierDehaene authored Apr 20, 2023
  
  709d8936
09 Apr, 2023 1 commit
- feat(router): make router input validation optional (#164) · 99879600
  OlivierDehaene authored Apr 09, 2023
  
  99879600
30 Mar, 2023 1 commit
- feat(benchmark): tui based benchmarking tool (#149) · 610bb1f9
  OlivierDehaene authored Mar 30, 2023
  
  610bb1f9
16 Mar, 2023 1 commit
- fix(server): use server tokenizer as gt (#128) · b49dbf2d
  OlivierDehaene authored Mar 16, 2023
  
  b49dbf2d
09 Mar, 2023 1 commit
- feat: support typical sampling (#114) · 1a2d6825
  OlivierDehaene authored Mar 09, 2023
```
closes #112
```
  1a2d6825
06 Mar, 2023 1 commit
- feat: allow local models (#101) · cd5961b5
  OlivierDehaene authored Mar 06, 2023
```
closes #99
```
  cd5961b5
02 Mar, 2023 1 commit
- feat(server): add logits watermark (#90) · 9b8ea6a6
  OlivierDehaene authored Mar 02, 2023
  
  9b8ea6a6
16 Feb, 2023 1 commit
- feat(router): add prometheus metrics scrape endpoint (#71) · 439fcaf8
  OlivierDehaene authored Feb 16, 2023
  
  439fcaf8
13 Feb, 2023 1 commit
- feat: add distributed tracing (#62) · 9af45414
  OlivierDehaene authored Feb 13, 2023
  
  9af45414
02 Feb, 2023 1 commit
- feat(router): use background task to manage request queue (#52) · 7b870e1e
  OlivierDehaene authored Feb 02, 2023
```
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  7b870e1e