Commits · e415b690a68d7a0e149c996e46def41c867ff421 · OpenDAS / text-generation-inference

29 Aug, 2024 1 commit

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

26 Aug, 2024 1 commit

Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) · 30be1884

drbh authored Aug 26, 2024

* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613



* fix: adjust pali gemma for post layer norm and small refactors

---------
Co-authored-by: Travis Addair <tgaddair@gmail.com>

30be1884

20 Aug, 2024 1 commit

Prefix caching (#2402) · b70ae096

Nicolas Patry authored Aug 20, 2024



* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

b70ae096

08 Aug, 2024 4 commits

fix: prefer hidden_activation over hidden_act in gemma2 (#2381) · f8521900
drbh authored Aug 08, 2024

f8521900
fix EleutherAI/gpt-neox-20b does not work in tgi (#2346) · 689b1abb
Wang, Yi authored Aug 09, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
689b1abb

Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371) · a379d553

drbh authored Aug 07, 2024



* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>

a379d553

add gptj modeling in TGI #2366 (CI RUN) (#2372) · 21267f3c

drbh authored Aug 07, 2024



* add gptj modeling
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

21267f3c

07 Aug, 2024 1 commit
- fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350) · 8094ecfc
  almersawi authored Aug 08, 2024
```
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
```
  8094ecfc
06 Aug, 2024 2 commits
- fix: prefer original layernorm names for 180B (#2365) · 133015f4
  drbh authored Aug 06, 2024
  
  133015f4
- fix: default num_ln_in_parallel_attn to one if not supplied (#2364) · a64d407d
  drbh authored Aug 06, 2024
  
  a64d407d
01 Aug, 2024 2 commits

Unify attention output handling (#2343) · 47447ef0

Daniël de Kok authored Aug 01, 2024

- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).

47447ef0

enable HuggingFaceM4/idefics-9b in intel gpu (#2338) · 9ab99374
Wang, Yi authored Aug 01, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
9ab99374

26 Jul, 2024 2 commits

feat: add ruff and resolve issue (#2262) · bab02ff2

drbh authored Jul 26, 2024

* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check

bab02ff2

Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) · 4b49c50f
Daniël de Kok authored Jul 26, 2024

4b49c50f

24 Jul, 2024 2 commits

fix of use of unquantized weights in cohere GQA loading, also enable … (#2291) · 86422506

Wang, Yi authored Jul 24, 2024



fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

86422506

fix crash in multi-modal (#2245) · 5ad39dd3

Wang, Yi authored Jul 24, 2024



* fix crash in multi-modal
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5ad39dd3

23 Jul, 2024 2 commits

[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254) · 3961e323

shaltielshmid authored Jul 23, 2024



* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3961e323

Fixing mistral nemo. (#2276) · abc32537
Nicolas Patry authored Jul 23, 2024

abc32537

22 Jul, 2024 2 commits

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) · 4e420722

icyboy™ authored Jul 22, 2024

* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading

4e420722

21 Jul, 2024 1 commit
- fix(server): fix deepseekv2 loading (#2266) · f3435bab
  OlivierDehaene authored Jul 21, 2024
  
  f3435bab
20 Jul, 2024 1 commit

feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79

OlivierDehaene authored Jul 20, 2024

* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout

53ec0b79

19 Jul, 2024 5 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

Hotfix: fix MPT after recent refactor (#2257) · 3b41e93a
Daniël de Kok authored Jul 19, 2024

3b41e93a
Hotfix: various GPT-based model fixes (#2256) · 18db78f2
Daniël de Kok authored Jul 19, 2024

18db78f2
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) · 80adb5be
Daniël de Kok authored Jul 19, 2024

80adb5be

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

18 Jul, 2024 1 commit
- fix(server): fix cohere (#2249) · 1d1b1efa
  OlivierDehaene authored Jul 18, 2024
  
  1d1b1efa
16 Jul, 2024 1 commit
- Add support for AWQ-quantized Idefics2 (#2233) · 06d0e880
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2036.
```
  06d0e880
09 Jul, 2024 1 commit

Move quantized weight handling out of the `Weights` class (#2194) · 8511669c

Daniël de Kok authored Jul 09, 2024

Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.

8511669c

08 Jul, 2024 2 commits
- Falcon/DBRX: get correct number of key-value heads (#2205) · 5c7c9f13
  Daniël de Kok authored Jul 08, 2024
  
  5c7c9f13
- fix dbrx & opt model prefix bug (#2201) · 521d0d99
  icyboy™ authored Jul 08, 2024
```
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
```
  521d0d99
05 Jul, 2024 4 commits

Consistently take `prefix` in model constructors (#2191) · 05c094fc
Daniël de Kok authored Jul 05, 2024
```
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
```
05c094fc
Fix Starcoder2 after refactor (#2189) · b67d4633
Daniël de Kok authored Jul 05, 2024

b67d4633
Hotfixing after refactor. · 853d4eb9
Nicolas Patry authored Jul 05, 2024

853d4eb9

Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2

Nicolas Patry authored Jul 05, 2024

* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.

fb2f74e2

02 Jul, 2024 2 commits
- Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) · 0759ec49
  Nicolas Patry authored Jul 02, 2024
  
  0759ec49
- fix: use the base layers weight in mistral rocm (#2155) · b966bc0d
  drbh authored Jul 02, 2024
  
  b966bc0d
01 Jul, 2024 2 commits

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e

Nicolas Patry authored Jul 01, 2024

* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.

4327210e

Fixing baichuan override. (#2158) · 4f55f158
Nicolas Patry authored Jul 01, 2024

4f55f158