Commits · 5ad39dd3c35f751c8b553ebb802aa2d996d5c9c5 · OpenDAS / text-generation-inference

24 Jul, 2024 1 commit

fix crash in multi-modal (#2245) · 5ad39dd3

Wang, Yi authored Jul 24, 2024



* fix crash in multi-modal
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5ad39dd3

23 Jul, 2024 9 commits

hotfix: update nccl · a8950294
OlivierDehaene authored Jul 23, 2024

a8950294
chore: update to torch 2.4 (#2259) · e7e3aa6c
OlivierDehaene authored Jul 23, 2024
```
* chore: update to torch 2.4

* remove un-necessary patch

* fix
```
e7e3aa6c
hotfix: pin numpy (#2289) · bc9593a5
Daniël de Kok authored Jul 23, 2024

bc9593a5
Add support for Llama 3 rotary embeddings (#2286) · 4ab41737
Daniël de Kok authored Jul 23, 2024
```
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
```
4ab41737

Preparing for release. (#2285) · 5d121a97

Nicolas Patry authored Jul 23, 2024

* Preparing for release.

* Updating docs.

* Fixing token within the docker image for the launcher.

5d121a97

[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254) · 3961e323

shaltielshmid authored Jul 23, 2024



* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3961e323

Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c

Daniël de Kok authored Jul 23, 2024

* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.

9935720c

fix(l4): fix fp8 logic on l4 (#2277) · 5fca30ee

OlivierDehaene authored Jul 23, 2024

* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89

5fca30ee

Fixing mistral nemo. (#2276) · abc32537
Nicolas Patry authored Jul 23, 2024

abc32537

22 Jul, 2024 6 commits

use proper name for ci (#2274) · 47004651
Adrien authored Jul 22, 2024

47004651

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

fix(server): fix fp8 weight loading (#2268) · 4844ff79

OlivierDehaene authored Jul 22, 2024

* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype

4844ff79

fix(ci): test new instances (#2272) · 6aebf44f

Adrien authored Jul 22, 2024



* test new instances
Signed-off-by: Adrien <adrien@huggingface.co>

* improve build ci
Signed-off-by: Adrien <adrien@huggingface.co>

---------
Signed-off-by: Adrien <adrien@huggingface.co>

6aebf44f

legacy warning on text_generation client (#2271) · 07441f5a
Erik Kaunismäki authored Jul 22, 2024
```
Update README.md

point to huggingface_hub inference clients instead
```
07441f5a

Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) · 4e420722

icyboy™ authored Jul 22, 2024

* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading

4e420722

21 Jul, 2024 1 commit
- fix(server): fix deepseekv2 loading (#2266) · f3435bab
  OlivierDehaene authored Jul 21, 2024
  
  f3435bab
20 Jul, 2024 3 commits

feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79

OlivierDehaene authored Jul 20, 2024

* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout

53ec0b79

Add FP8 release test (#2261) · e5c1d6d6
Daniël de Kok authored Jul 20, 2024

e5c1d6d6

re-push to internal registry (#2242) · 11123a8e

Adrien authored Jul 20, 2024



* re-push to internal registry
Signed-off-by: Adrien <adrien@huggingface.co>

* fix name
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip debug
Signed-off-by: Adrien <adrien@huggingface.co>

* add debug
Signed-off-by: Adrien <adrien@huggingface.co>

* should
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* ww
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* ww
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* w
Signed-off-by: Adrien <adrien@huggingface.co>

* revert tests
Signed-off-by: Adrien <adrien@huggingface.co>

* last reverts
Signed-off-by: Adrien <adrien@huggingface.co>

* another one
Signed-off-by: Adrien <adrien@huggingface.co>

---------
Signed-off-by: Adrien <adrien@huggingface.co>

11123a8e

19 Jul, 2024 9 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

fix: adjust default tool choice (#2244) · 68a9685f

drbh authored Jul 19, 2024

* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs

68a9685f

add usage stats to toctree (#2260) · 40f5dc3e
Erik Kaunismäki authored Jul 19, 2024
```
quick fix
```
40f5dc3e

usage stats and crash reports (#2220) · 4c19593a

Erik Kaunismäki authored Jul 19, 2024



* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>

4c19593a

Hotfix: pass through model revision in `VlmCausalLM` (#2258) · 3f37a667
Daniël de Kok authored Jul 19, 2024

3f37a667
Hotfix: fix MPT after recent refactor (#2257) · 3b41e93a
Daniël de Kok authored Jul 19, 2024

3b41e93a
Hotfix: various GPT-based model fixes (#2256) · 18db78f2
Daniël de Kok authored Jul 19, 2024

18db78f2
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) · 80adb5be
Daniël de Kok authored Jul 19, 2024

80adb5be

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

18 Jul, 2024 1 commit
- fix(server): fix cohere (#2249) · 1d1b1efa
  OlivierDehaene authored Jul 18, 2024
  
  1d1b1efa
16 Jul, 2024 3 commits
- Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237) · da82c63a
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2236.
```
  da82c63a
- `server quantize`: expose groupsize option (#2225) · 2cb18428
  Daniël de Kok authored Jul 16, 2024
  
  2cb18428
- Add support for AWQ-quantized Idefics2 (#2233) · 06d0e880
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2036.
```
  06d0e880
15 Jul, 2024 3 commits

fix: Remove bitsandbytes installation when running cpu-only install (#2216) · 0ad7f6f8
Hugo Larcher authored Jul 15, 2024
```
Remove bitsandbytes installation when running cpu-only install
```
0ad7f6f8

fix custom cache dir (#2226) · 457fb0a1

Erik Kaunismäki authored Jul 15, 2024

* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace

457fb0a1

feat: simple mistral lora integration tests (#2180) · 5a650669

drbh authored Jul 15, 2024

* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs

5a650669

12 Jul, 2024 2 commits

Use symmetric quantization in the `quantize` subcommand (#2120) · dbb23fbf

Daniël de Kok authored Jul 12, 2024

Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.

dbb23fbf

[fix] Modifying base in yarn embedding (#2212) · c46eaf70
SeongBeomLEE authored Jul 12, 2024

c46eaf70

11 Jul, 2024 2 commits

fix: append DONE message to chat stream (#2221) · d789de32
drbh authored Jul 11, 2024
```
* fix: append DONE message to chat stream

* fix: update completions endpoint
```
d789de32

Add support for FP8 on compute capability >=8.0, <8.9 (#2213) · cb150eb2

Daniël de Kok authored Jul 11, 2024



Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.
Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>

cb150eb2