Commits · befd9f6735ed8d7f5d8e9110b1f921e16d856a8b · OpenDAS / text-generation-inference

30 Oct, 2024 1 commit

drbh authored Oct 30, 2024

* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2

befd9f67

28 Oct, 2024 2 commits

Fixing auto bloom test. (#2699) · 3a9cdc32
Nicolas Patry authored Oct 28, 2024

3a9cdc32

We can have a tokenizer anywhere. (#2527) · 90b226db

Nicolas Patry authored Oct 28, 2024

* We can have a tokenizer anywhere.

* Handling potential lack of offsets (python tokenizer)

* Remove redundancy.

* Fixing the tests.

* Flake.lock update ?

* Fixing the  GIL locking.

* Fixing mamba by using the transformers version.

* Adding the legacy handle.

* Ellide lifetime.

* Lint.

* Deprecation message.

* Fixing bad rebase.

90b226db

24 Oct, 2024 1 commit

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

23 Oct, 2024 2 commits
- hotfix: fix flashllama · 27ff1871
  OlivierDehaene authored Oct 23, 2024
  
  27ff1871
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
17 Oct, 2024 1 commit

Simplify the `attention` function (#2609) · 59ea38cb

Daniël de Kok authored Oct 17, 2024

* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support

59ea38cb

16 Oct, 2024 1 commit

feat: prefill chunking (#2600) · a6a0c97e

OlivierDehaene authored Oct 16, 2024



* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

a6a0c97e

07 Oct, 2024 1 commit
- enable mllama in intel platform (#2610) · 57f9685d
  Wang, Yi authored Oct 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  57f9685d
04 Oct, 2024 1 commit

Add basic FP8 KV cache support (#2603) · 2358c2bb

Daniël de Kok authored Oct 04, 2024

* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml

2358c2bb

02 Oct, 2024 1 commit

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

30 Sep, 2024 2 commits

feat: support phi3.5 moe (#2479) · 93a7042d

drbh authored Sep 30, 2024



* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

93a7042d

Update ROCM libs and improvements (#2579) · f9e561ec

Mohit Sharma authored Sep 30, 2024

* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile

f9e561ec

27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

26 Sep, 2024 1 commit
- Add LoRA adapters support for Gemma2 (#2567) · 0b7df771
  Alvaro Bartolome authored Sep 26, 2024
```
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
```
  0b7df771
24 Sep, 2024 1 commit
- Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537) · 3f14cd14
  Daniël de Kok authored Sep 24, 2024
```
This replaces the custom layers in both models.
```
  3f14cd14
20 Sep, 2024 1 commit
- hotfix: ipex fails since cuda moe kernel is not supported (#2532) · f478aa77
  Wang, Yi authored Sep 20, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  f478aa77
17 Sep, 2024 1 commit

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

05 Sep, 2024 1 commit

hotfix: fix regression of attention api change in intel platform (#2439) · 5cd8025f

Wang, Yi authored Sep 05, 2024



fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5cd8025f

02 Sep, 2024 1 commit
- feat: support lora revisions and qkv_proj weights (#2482) · 6cb42f49
  drbh authored Sep 02, 2024
```
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
```
  6cb42f49
29 Aug, 2024 1 commit

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

26 Aug, 2024 1 commit

Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) · 30be1884

drbh authored Aug 26, 2024

* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613



* fix: adjust pali gemma for post layer norm and small refactors

---------
Co-authored-by: Travis Addair <tgaddair@gmail.com>

30be1884

20 Aug, 2024 1 commit

Prefix caching (#2402) · b70ae096

Nicolas Patry authored Aug 20, 2024



* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

b70ae096

08 Aug, 2024 4 commits

fix: prefer hidden_activation over hidden_act in gemma2 (#2381) · f8521900
drbh authored Aug 08, 2024

f8521900
fix EleutherAI/gpt-neox-20b does not work in tgi (#2346) · 689b1abb
Wang, Yi authored Aug 09, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
689b1abb

Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371) · a379d553

drbh authored Aug 07, 2024



* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>

a379d553

add gptj modeling in TGI #2366 (CI RUN) (#2372) · 21267f3c

drbh authored Aug 07, 2024



* add gptj modeling
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

21267f3c

07 Aug, 2024 1 commit
- fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350) · 8094ecfc
  almersawi authored Aug 08, 2024
```
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
```
  8094ecfc
06 Aug, 2024 2 commits
- fix: prefer original layernorm names for 180B (#2365) · 133015f4
  drbh authored Aug 06, 2024
  
  133015f4
- fix: default num_ln_in_parallel_attn to one if not supplied (#2364) · a64d407d
  drbh authored Aug 06, 2024
  
  a64d407d
01 Aug, 2024 2 commits

Unify attention output handling (#2343) · 47447ef0

Daniël de Kok authored Aug 01, 2024

- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).

47447ef0

enable HuggingFaceM4/idefics-9b in intel gpu (#2338) · 9ab99374
Wang, Yi authored Aug 01, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
9ab99374

26 Jul, 2024 2 commits

feat: add ruff and resolve issue (#2262) · bab02ff2

drbh authored Jul 26, 2024

* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check

bab02ff2

Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) · 4b49c50f
Daniël de Kok authored Jul 26, 2024

4b49c50f

24 Jul, 2024 2 commits

fix of use of unquantized weights in cohere GQA loading, also enable … (#2291) · 86422506

Wang, Yi authored Jul 24, 2024



fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

86422506

fix crash in multi-modal (#2245) · 5ad39dd3

Wang, Yi authored Jul 24, 2024



* fix crash in multi-modal
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5ad39dd3

23 Jul, 2024 2 commits

[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254) · 3961e323

shaltielshmid authored Jul 23, 2024



* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3961e323

Fixing mistral nemo. (#2276) · abc32537
Nicolas Patry authored Jul 23, 2024

abc32537

22 Jul, 2024 2 commits

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) · 4e420722

icyboy™ authored Jul 22, 2024

* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading

4e420722