Commits · 68a9685f1b6beabe8a9c5893d54decc9d65aa299 · OpenDAS / text-generation-inference

19 Jul, 2024 8 commits

fix: adjust default tool choice (#2244) · 68a9685f

drbh authored Jul 19, 2024

* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs

68a9685f

add usage stats to toctree (#2260) · 40f5dc3e
Erik Kaunismäki authored Jul 19, 2024
```
quick fix
```
40f5dc3e

usage stats and crash reports (#2220) · 4c19593a

Erik Kaunismäki authored Jul 19, 2024



* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>

4c19593a

Hotfix: pass through model revision in `VlmCausalLM` (#2258) · 3f37a667
Daniël de Kok authored Jul 19, 2024

3f37a667
Hotfix: fix MPT after recent refactor (#2257) · 3b41e93a
Daniël de Kok authored Jul 19, 2024

3b41e93a
Hotfix: various GPT-based model fixes (#2256) · 18db78f2
Daniël de Kok authored Jul 19, 2024

18db78f2
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) · 80adb5be
Daniël de Kok authored Jul 19, 2024

80adb5be

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

18 Jul, 2024 1 commit
- fix(server): fix cohere (#2249) · 1d1b1efa
  OlivierDehaene authored Jul 18, 2024
  
  1d1b1efa
16 Jul, 2024 3 commits
- Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237) · da82c63a
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2236.
```
  da82c63a
- `server quantize`: expose groupsize option (#2225) · 2cb18428
  Daniël de Kok authored Jul 16, 2024
  
  2cb18428
- Add support for AWQ-quantized Idefics2 (#2233) · 06d0e880
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2036.
```
  06d0e880
15 Jul, 2024 3 commits

fix: Remove bitsandbytes installation when running cpu-only install (#2216) · 0ad7f6f8
Hugo Larcher authored Jul 15, 2024
```
Remove bitsandbytes installation when running cpu-only install
```
0ad7f6f8

fix custom cache dir (#2226) · 457fb0a1

Erik Kaunismäki authored Jul 15, 2024

* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace

457fb0a1

feat: simple mistral lora integration tests (#2180) · 5a650669

drbh authored Jul 15, 2024

* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs

5a650669

12 Jul, 2024 2 commits

Use symmetric quantization in the `quantize` subcommand (#2120) · dbb23fbf

Daniël de Kok authored Jul 12, 2024

Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.

dbb23fbf

[fix] Modifying base in yarn embedding (#2212) · c46eaf70
SeongBeomLEE authored Jul 12, 2024

c46eaf70

11 Jul, 2024 2 commits

fix: append DONE message to chat stream (#2221) · d789de32
drbh authored Jul 11, 2024
```
* fix: append DONE message to chat stream

* fix: update completions endpoint
```
d789de32

Add support for FP8 on compute capability >=8.0, <8.9 (#2213) · cb150eb2

Daniël de Kok authored Jul 11, 2024



Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.
Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>

cb150eb2

09 Jul, 2024 4 commits

Move quantized weight handling out of the `Weights` class (#2194) · 8511669c

Daniël de Kok authored Jul 09, 2024

Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.

8511669c

Updating the self check (#2209) · 4c976fb4

Nicolas Patry authored Jul 09, 2024

* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.

4c976fb4

Fixed README ToC (#2196) · f5ba9bfd
vinkamath authored Jul 09, 2024
```
Co-authored-by: Vinayak Kamath <Vinayak.Kamath@target.com>
```
f5ba9bfd
Adding sanity check to openapi docs. · fe710af2
Nicolas Patry authored Jul 09, 2024

fe710af2

08 Jul, 2024 10 commits
- Fix buildx cache + change runner type (#2176) · 5e2a3058
  Guillaume LEGENDRE authored Jul 08, 2024
```
* Update build.yaml

* Update build.yaml

* change to S3 cache

* change to CPU Runners

* remove comments
```
  5e2a3058
- Fix nccl regression on PyTorch 2.3 upgrade (#2099) · 4c50b6d0
  fxmarty authored Jul 08, 2024
```
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
```
  4c50b6d0
- feat: use model name as adapter id in chat endpoints (#2128) · 87ebb647
  drbh authored Jul 08, 2024
  
  87ebb647
- update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190) · 58effe78
  Wang, Yi authored Jul 08, 2024
```
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  58effe78
- fix: python deserialization (#2178) · 16d9e505
  Javier Martinez authored Jul 08, 2024
  
  16d9e505
- add doc for intel gpus (#2181) · 07e240ca
  Wang, Yi authored Jul 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  07e240ca
- Falcon/DBRX: get correct number of key-value heads (#2205) · 5c7c9f13
  Daniël de Kok authored Jul 08, 2024
  
  5c7c9f13
- Fix incorrect cache allocation with multi-query (#2203) · 153fcf77
  Daniël de Kok authored Jul 08, 2024
```
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
```
  153fcf77
- hotfix: Fix number of KV heads (#2202) · cce475a9
  Daniël de Kok authored Jul 08, 2024
```
Fix number of KV heads
```
  cce475a9
- fix dbrx & opt model prefix bug (#2201) · 521d0d99
  icyboy™ authored Jul 08, 2024
```
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
```
  521d0d99
05 Jul, 2024 6 commits

Consistently take `prefix` in model constructors (#2191) · 05c094fc
Daniël de Kok authored Jul 05, 2024
```
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
```
05c094fc

GPTQ CI improvements (#2151) · 67ef0649

Daniël de Kok authored Jul 05, 2024

* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build

67ef0649

Fix Starcoder2 after refactor (#2189) · b67d4633
Daniël de Kok authored Jul 05, 2024

b67d4633
Hotfixing after refactor. · 853d4eb9
Nicolas Patry authored Jul 05, 2024

853d4eb9

Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2

Nicolas Patry authored Jul 05, 2024

* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.

fb2f74e2

Adding "longrope" for Phi-3 (#2172) (#2179) · c6bcadf8
Aaron Mihalik authored Jul 05, 2024
```
Adding "longrope" for phi-3
```
c6bcadf8

04 Jul, 2024 1 commit
- Preparing patch release. (#2186) · 245d3de9
  Nicolas Patry authored Jul 04, 2024
  
  245d3de9