Commits · 04e1af94d73198ca07794e9a2a12761d4d2d413d · OpenDAS / text-generation-inference

25 Jun, 2024 1 commit

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024



* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------
Co-authored-by: Derek <datavistics@gmail.com>

04e1af94

07 Jun, 2024 1 commit

server: use chunked inputs · bf3c8137

Daniël de Kok authored May 31, 2024

The router will now send the input as chunks besides as a single
string. This change modifies the server to process chunked input
rather than strings. This also allows us to remove the image
extraction code from the server.

bf3c8137

14 Dec, 2023 1 commit
- feat: add more latency metrics in forward (#1346) · 50b495f3
  OlivierDehaene authored Dec 14, 2023
  
  50b495f3
11 Dec, 2023 2 commits
- chore: formatting · 72ee382d
  OlivierDehaene authored Dec 11, 2023
  
  72ee382d
- Speculative (#1308) · 9ecfa16b
  Nicolas Patry authored Dec 11, 2023
  
  9ecfa16b
08 Jun, 2023 1 commit

feat(server): Rework model loading (#344) · abd58ff8

Nicolas Patry authored Jun 08, 2023

# What does this PR do?

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f

)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

abd58ff8

02 Jun, 2023 1 commit
- feat(server): only compute prefill logprobs when asked (#406) · 895c5f15
  OlivierDehaene authored Jun 02, 2023
```
Close #288
```
  895c5f15
26 May, 2023 1 commit
- feat(server): support vectorized warpers in flash causal lm (#317) · 62f91f78
  OlivierDehaene authored May 26, 2023
```
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
```
  62f91f78
24 May, 2023 1 commit
- feat: decrease IPC proto size (#367) · 218c9ada
  OlivierDehaene authored May 24, 2023
```
Closes #307 #308
```
  218c9ada
16 May, 2023 1 commit
- fix(server): fix decode token (#334) · 5a582261
  OlivierDehaene authored May 16, 2023
```
Fixes #333

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  5a582261
24 Apr, 2023 2 commits
- feat(router): use number of tokens in batch as input for dynamic batching (#226) · ebc74d56
  OlivierDehaene authored Apr 24, 2023
```
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  ebc74d56
- feat(server): reduce memory requirement (#214) · 4a7dd408
  Nick Hill authored Apr 24, 2023
  
  4a7dd408
20 Apr, 2023 1 commit
- feat(router): drop requests when client closes the channel (#202) · 709d8936
  OlivierDehaene authored Apr 20, 2023
  
  709d8936
11 Apr, 2023 1 commit
- feat(server): add flash attention llama (#144) · 299217c9
  OlivierDehaene authored Apr 11, 2023
  
  299217c9
09 Apr, 2023 1 commit
- feat(router): make router input validation optional (#164) · 99879600
  OlivierDehaene authored Apr 09, 2023
  
  99879600
16 Mar, 2023 1 commit
- fix(server): use server tokenizer as gt (#128) · b49dbf2d
  OlivierDehaene authored Mar 16, 2023
  
  b49dbf2d
07 Mar, 2023 1 commit
- feat(clients): Python client (#103) · 3fef90d5
  OlivierDehaene authored Mar 07, 2023
  
  3fef90d5
06 Mar, 2023 1 commit
- fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100) · 9b205d33
  OlivierDehaene authored Mar 06, 2023
  
  9b205d33
24 Feb, 2023 1 commit
- feat(server): pre-allocate max attention mask (#75) · 44ce098c
  OlivierDehaene authored Feb 24, 2023
  
  44ce098c
03 Feb, 2023 1 commit
- feat(router): refactor API and add openAPI schemas (#53) · 20c3c594
  OlivierDehaene authored Feb 03, 2023
  
  20c3c594
02 Feb, 2023 1 commit

breaking(router): modify /generate API to only return generated text (#50) · b1482d90

OlivierDehaene authored Feb 02, 2023

@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.

b1482d90

31 Jan, 2023 3 commits

feat: Add token streaming using ServerSideEvents support (#41) · 017a2a8c
OlivierDehaene authored Jan 31, 2023

017a2a8c
Revert "feat: Add token streaming using ServerSideEvents support" (#40) · 4f9ac67c
OlivierDehaene authored Jan 31, 2023
```
Reverts huggingface/text-generation-inference#36
```
4f9ac67c

feat: Add token streaming using ServerSideEvents support (#36) · 7fbfbb0d

OlivierDehaene authored Jan 31, 2023

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is: 

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

7fbfbb0d

20 Jan, 2023 2 commits
- fix(server): Fix position ids (#28) · 1f570d18
  OlivierDehaene authored Jan 20, 2023
  
  1f570d18
- feat(server): Support SantaCoder (#26) · 15511edc
  OlivierDehaene authored Jan 20, 2023
  
  15511edc
15 Dec, 2022 1 commit
- feat: Return logprobs (#8) · 32a25306
  OlivierDehaene authored Dec 15, 2022
  
  32a25306
12 Dec, 2022 1 commit
- feat: Support stop sequences (#7) · 718096f6
  OlivierDehaene authored Dec 12, 2022
  
  718096f6
08 Dec, 2022 1 commit
- feat(server): Add model tests (#6) · a2985036
  OlivierDehaene authored Dec 08, 2022
  
  a2985036