Commits · 583d37a2f8aee624aa1b77dfc359e32469205c08 · OpenDAS / text-generation-inference

29 Jul, 2024 2 commits

Erik Kaunismäki authored Jul 29, 2024



* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------
Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>

583d37a2

fix: fix buildkit config in ci · fd2e0631
Adrien authored Jul 29, 2024
```
Signed-off-by: Adrien <adrien@huggingface.co>
```
fd2e0631

26 Jul, 2024 2 commits

feat: add ruff and resolve issue (#2262) · bab02ff2

drbh authored Jul 26, 2024

* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check

bab02ff2

Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) · 4b49c50f
Daniël de Kok authored Jul 26, 2024

4b49c50f

25 Jul, 2024 4 commits
- Fix registry name (#2307) · 3905f854
  Adrien authored Jul 25, 2024
  
  3905f854
- Fixing idefics on g6 tests. (#2306) · 17ed42be
  Nicolas Patry authored Jul 25, 2024
  
  17ed42be
- Some small fixes for the Torch 2.4.0 update (#2304) · 9256d7c3
  Daniël de Kok authored Jul 25, 2024
```
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update
```
  9256d7c3
- Using g6 instead of g5. (#2281) · 26614057
  Nicolas Patry authored Jul 25, 2024
```
* Using g6 instead of g5.

* Update the idefics2 snapshot.
```
  26614057
24 Jul, 2024 4 commits

fix: refactor adapter weight loading and mapping (#2193) · 5d85a958

drbh authored Jul 24, 2024

* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo

5d85a958

Split up `layers.marlin` into several files (#2292) · 93d2b9fe
Daniël de Kok authored Jul 24, 2024
```
The marlin.py file was getting large, split it up.
```
93d2b9fe

fix of use of unquantized weights in cohere GQA loading, also enable … (#2291) · 86422506

Wang, Yi authored Jul 24, 2024



fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

86422506

fix crash in multi-modal (#2245) · 5ad39dd3

Wang, Yi authored Jul 24, 2024



* fix crash in multi-modal
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5ad39dd3

23 Jul, 2024 9 commits

hotfix: update nccl · a8950294
OlivierDehaene authored Jul 23, 2024

a8950294
chore: update to torch 2.4 (#2259) · e7e3aa6c
OlivierDehaene authored Jul 23, 2024
```
* chore: update to torch 2.4

* remove un-necessary patch

* fix
```
e7e3aa6c
hotfix: pin numpy (#2289) · bc9593a5
Daniël de Kok authored Jul 23, 2024

bc9593a5
Add support for Llama 3 rotary embeddings (#2286) · 4ab41737
Daniël de Kok authored Jul 23, 2024
```
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
```
4ab41737

Preparing for release. (#2285) · 5d121a97

Nicolas Patry authored Jul 23, 2024

* Preparing for release.

* Updating docs.

* Fixing token within the docker image for the launcher.

5d121a97

[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254) · 3961e323

shaltielshmid authored Jul 23, 2024



* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3961e323

Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c

Daniël de Kok authored Jul 23, 2024

* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.

9935720c

fix(l4): fix fp8 logic on l4 (#2277) · 5fca30ee

OlivierDehaene authored Jul 23, 2024

* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89

5fca30ee

Fixing mistral nemo. (#2276) · abc32537
Nicolas Patry authored Jul 23, 2024

abc32537

22 Jul, 2024 6 commits

use proper name for ci (#2274) · 47004651
Adrien authored Jul 22, 2024

47004651

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

fix(server): fix fp8 weight loading (#2268) · 4844ff79

OlivierDehaene authored Jul 22, 2024

* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype

4844ff79

fix(ci): test new instances (#2272) · 6aebf44f

Adrien authored Jul 22, 2024



* test new instances
Signed-off-by: Adrien <adrien@huggingface.co>

* improve build ci
Signed-off-by: Adrien <adrien@huggingface.co>

---------
Signed-off-by: Adrien <adrien@huggingface.co>

6aebf44f

legacy warning on text_generation client (#2271) · 07441f5a
Erik Kaunismäki authored Jul 22, 2024
```
Update README.md

point to huggingface_hub inference clients instead
```
07441f5a

Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) · 4e420722

icyboy™ authored Jul 22, 2024

* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading

4e420722

21 Jul, 2024 1 commit
- fix(server): fix deepseekv2 loading (#2266) · f3435bab
  OlivierDehaene authored Jul 21, 2024
  
  f3435bab
20 Jul, 2024 3 commits

feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79

OlivierDehaene authored Jul 20, 2024

* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout

53ec0b79

Add FP8 release test (#2261) · e5c1d6d6
Daniël de Kok authored Jul 20, 2024

e5c1d6d6

re-push to internal registry (#2242) · 11123a8e

Adrien authored Jul 20, 2024



* re-push to internal registry
Signed-off-by: Adrien <adrien@huggingface.co>

* fix name
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip debug
Signed-off-by: Adrien <adrien@huggingface.co>

* add debug
Signed-off-by: Adrien <adrien@huggingface.co>

* should
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* ww
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* ww
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* w
Signed-off-by: Adrien <adrien@huggingface.co>

* revert tests
Signed-off-by: Adrien <adrien@huggingface.co>

* last reverts
Signed-off-by: Adrien <adrien@huggingface.co>

* another one
Signed-off-by: Adrien <adrien@huggingface.co>

---------
Signed-off-by: Adrien <adrien@huggingface.co>

11123a8e

19 Jul, 2024 9 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

fix: adjust default tool choice (#2244) · 68a9685f

drbh authored Jul 19, 2024

* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs

68a9685f

add usage stats to toctree (#2260) · 40f5dc3e
Erik Kaunismäki authored Jul 19, 2024
```
quick fix
```
40f5dc3e

usage stats and crash reports (#2220) · 4c19593a

Erik Kaunismäki authored Jul 19, 2024



* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>

4c19593a

Hotfix: pass through model revision in `VlmCausalLM` (#2258) · 3f37a667
Daniël de Kok authored Jul 19, 2024

3f37a667
Hotfix: fix MPT after recent refactor (#2257) · 3b41e93a
Daniël de Kok authored Jul 19, 2024

3b41e93a
Hotfix: various GPT-based model fixes (#2256) · 18db78f2
Daniël de Kok authored Jul 19, 2024

18db78f2
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) · 80adb5be
Daniël de Kok authored Jul 19, 2024

80adb5be

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad