Commits · 82d19d7723c085a0f0bd37494ee2e2c41e1323ca · OpenDAS / text-generation-inference

08 Aug, 2024 3 commits

drbh authored Aug 08, 2024



* Update __init__.py

Fix issue with NoneType comparison for max_input_tokens and sliding_window

- Add default values for max_input_tokens and sliding_window to handle None cases.
- Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError.
- This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'.

* Update __init__.py

Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness.

* fix: syntax/style tweak

---------
Co-authored-by: Praz <prazanth2006@gmail.com>

82d19d77

Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371) · a379d553

drbh authored Aug 07, 2024



* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>

a379d553

add gptj modeling in TGI #2366 (CI RUN) (#2372) · 21267f3c

drbh authored Aug 07, 2024



* add gptj modeling
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

21267f3c

07 Aug, 2024 1 commit
- fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350) · 8094ecfc
  almersawi authored Aug 08, 2024
```
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
```
  8094ecfc
06 Aug, 2024 3 commits
- fix: prefer original layernorm names for 180B (#2365) · 133015f4
  drbh authored Aug 06, 2024
  
  133015f4
- fix: default num_ln_in_parallel_attn to one if not supplied (#2364) · a64d407d
  drbh authored Aug 06, 2024
  
  a64d407d
- fix: return the out tensor rather then the functions return value (#2361) · 29b8d19c
  drbh authored Aug 06, 2024
  
  29b8d19c
05 Aug, 2024 1 commit

fix: attempt forward on flash attn2 to check hardware support (#2335) · 215ed3ad

drbh authored Aug 05, 2024

* fix: attempt forward on flash attn2 to check hardware support

* fix: warn window_size_left when using flash attn 1

* fix: prefer version check over test op and avoid window_size_left if not flash attn2

* fix: improve condtional and error message

* fix: update sliding window conditional

* fix: simplify changes and revert model changes

* fix: avoid changing conditional

* fix: typo tweak

215ed3ad

01 Aug, 2024 2 commits

Unify attention output handling (#2343) · 47447ef0

Daniël de Kok authored Aug 01, 2024

- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).

47447ef0

enable HuggingFaceM4/idefics-9b in intel gpu (#2338) · 9ab99374
Wang, Yi authored Aug 01, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
9ab99374

31 Jul, 2024 2 commits

Pr 2290 ci run (#2329) · f7f61876

drbh authored Jul 31, 2024



* MODEL_ID propagation fix

* fix: remove global model id

---------
Co-authored-by: root <root@tw031.pit.tensorwave.lan>

f7f61876

Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300) · 34f7dcfd

Daniël de Kok authored Jul 31, 2024

The `GPTWeightLoader` was structured like this in pseudocode:

if marlin:
  Set up tensors in a way that GPTQ-Marlin expects
else:
  Set up tensors in a way that ExLlama/GPTQ/AWQ expect

However, the GPT-Marlin implementation details should really be in the
`marlin` module. So move the former part out to a separate
`GPTQMarlinWeightsLoader`.

34f7dcfd

30 Jul, 2024 1 commit
- server quantize: store quantizer config in standard format (#2299) · 53aec273
  Daniël de Kok authored Jul 30, 2024
```
- Create `quantization_config` option in the model config.
- Don't store the quantizer config in tensors anymore.
```
  53aec273
29 Jul, 2024 2 commits
- patch-error-on-invalid-grammar (#2282) · 3d7f4f41
  Erik Kaunismäki authored Jul 29, 2024
```
* quick fix

* allow silent failure

* explicit todo that this is only short term
```
  3d7f4f41
- Install Marlin from standalone package (#2320) · 922732b2
  Daniël de Kok authored Jul 29, 2024
  
  922732b2
26 Jul, 2024 2 commits

feat: add ruff and resolve issue (#2262) · bab02ff2

drbh authored Jul 26, 2024

* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check

bab02ff2

Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) · 4b49c50f
Daniël de Kok authored Jul 26, 2024

4b49c50f

25 Jul, 2024 1 commit

Some small fixes for the Torch 2.4.0 update (#2304) · 9256d7c3

Daniël de Kok authored Jul 25, 2024

* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update

9256d7c3

24 Jul, 2024 4 commits

fix: refactor adapter weight loading and mapping (#2193) · 5d85a958

drbh authored Jul 24, 2024

* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo

5d85a958

Split up `layers.marlin` into several files (#2292) · 93d2b9fe
Daniël de Kok authored Jul 24, 2024
```
The marlin.py file was getting large, split it up.
```
93d2b9fe

fix of use of unquantized weights in cohere GQA loading, also enable … (#2291) · 86422506

Wang, Yi authored Jul 24, 2024



fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

86422506

fix crash in multi-modal (#2245) · 5ad39dd3

Wang, Yi authored Jul 24, 2024



* fix crash in multi-modal
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5ad39dd3

23 Jul, 2024 5 commits

Add support for Llama 3 rotary embeddings (#2286) · 4ab41737
Daniël de Kok authored Jul 23, 2024
```
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
```
4ab41737

[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254) · 3961e323

shaltielshmid authored Jul 23, 2024



* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3961e323

Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c

Daniël de Kok authored Jul 23, 2024

* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.

9935720c

fix(l4): fix fp8 logic on l4 (#2277) · 5fca30ee

OlivierDehaene authored Jul 23, 2024

* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89

5fca30ee

Fixing mistral nemo. (#2276) · abc32537
Nicolas Patry authored Jul 23, 2024

abc32537

22 Jul, 2024 3 commits

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

fix(server): fix fp8 weight loading (#2268) · 4844ff79

OlivierDehaene authored Jul 22, 2024

* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype

4844ff79

Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) · 4e420722

icyboy™ authored Jul 22, 2024

* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading

4e420722

21 Jul, 2024 1 commit
- fix(server): fix deepseekv2 loading (#2266) · f3435bab
  OlivierDehaene authored Jul 21, 2024
  
  f3435bab
20 Jul, 2024 1 commit

feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79

OlivierDehaene authored Jul 20, 2024

* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout

53ec0b79

19 Jul, 2024 6 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

Hotfix: pass through model revision in `VlmCausalLM` (#2258) · 3f37a667
Daniël de Kok authored Jul 19, 2024

3f37a667
Hotfix: fix MPT after recent refactor (#2257) · 3b41e93a
Daniël de Kok authored Jul 19, 2024

3b41e93a
Hotfix: various GPT-based model fixes (#2256) · 18db78f2
Daniël de Kok authored Jul 19, 2024

18db78f2
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) · 80adb5be
Daniël de Kok authored Jul 19, 2024

80adb5be

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

18 Jul, 2024 1 commit
- fix(server): fix cohere (#2249) · 1d1b1efa
  OlivierDehaene authored Jul 18, 2024
  
  1d1b1efa
16 Jul, 2024 1 commit
- Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237) · da82c63a
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2236.
```
  da82c63a