Commits · 5a65066922ce28dbc202dc03bb2410da14b980d2 · OpenDAS / text-generation-inference

15 Jul, 2024 1 commit

feat: simple mistral lora integration tests (#2180) · 5a650669

drbh authored Jul 15, 2024

* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs

5a650669

12 Jul, 2024 2 commits

Use symmetric quantization in the `quantize` subcommand (#2120) · dbb23fbf

Daniël de Kok authored Jul 12, 2024

Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.

dbb23fbf

[fix] Modifying base in yarn embedding (#2212) · c46eaf70
SeongBeomLEE authored Jul 12, 2024

c46eaf70

11 Jul, 2024 2 commits

fix: append DONE message to chat stream (#2221) · d789de32
drbh authored Jul 11, 2024
```
* fix: append DONE message to chat stream

* fix: update completions endpoint
```
d789de32

Add support for FP8 on compute capability >=8.0, <8.9 (#2213) · cb150eb2

Daniël de Kok authored Jul 11, 2024



Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.
Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>

cb150eb2

09 Jul, 2024 4 commits

Move quantized weight handling out of the `Weights` class (#2194) · 8511669c

Daniël de Kok authored Jul 09, 2024

Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.

8511669c

Updating the self check (#2209) · 4c976fb4

Nicolas Patry authored Jul 09, 2024

* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.

4c976fb4

Fixed README ToC (#2196) · f5ba9bfd
vinkamath authored Jul 09, 2024
```
Co-authored-by: Vinayak Kamath <Vinayak.Kamath@target.com>
```
f5ba9bfd
Adding sanity check to openapi docs. · fe710af2
Nicolas Patry authored Jul 09, 2024

fe710af2

08 Jul, 2024 10 commits
- Fix buildx cache + change runner type (#2176) · 5e2a3058
  Guillaume LEGENDRE authored Jul 08, 2024
```
* Update build.yaml

* Update build.yaml

* change to S3 cache

* change to CPU Runners

* remove comments
```
  5e2a3058
- Fix nccl regression on PyTorch 2.3 upgrade (#2099) · 4c50b6d0
  fxmarty authored Jul 08, 2024
```
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
```
  4c50b6d0
- feat: use model name as adapter id in chat endpoints (#2128) · 87ebb647
  drbh authored Jul 08, 2024
  
  87ebb647
- update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190) · 58effe78
  Wang, Yi authored Jul 08, 2024
```
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  58effe78
- fix: python deserialization (#2178) · 16d9e505
  Javier Martinez authored Jul 08, 2024
  
  16d9e505
- add doc for intel gpus (#2181) · 07e240ca
  Wang, Yi authored Jul 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  07e240ca
- Falcon/DBRX: get correct number of key-value heads (#2205) · 5c7c9f13
  Daniël de Kok authored Jul 08, 2024
  
  5c7c9f13
- Fix incorrect cache allocation with multi-query (#2203) · 153fcf77
  Daniël de Kok authored Jul 08, 2024
```
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
```
  153fcf77
- hotfix: Fix number of KV heads (#2202) · cce475a9
  Daniël de Kok authored Jul 08, 2024
```
Fix number of KV heads
```
  cce475a9
- fix dbrx & opt model prefix bug (#2201) · 521d0d99
  icyboy™ authored Jul 08, 2024
```
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
```
  521d0d99
05 Jul, 2024 6 commits

Consistently take `prefix` in model constructors (#2191) · 05c094fc
Daniël de Kok authored Jul 05, 2024
```
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
```
05c094fc

GPTQ CI improvements (#2151) · 67ef0649

Daniël de Kok authored Jul 05, 2024

* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build

67ef0649

Fix Starcoder2 after refactor (#2189) · b67d4633
Daniël de Kok authored Jul 05, 2024

b67d4633
Hotfixing after refactor. · 853d4eb9
Nicolas Patry authored Jul 05, 2024

853d4eb9

Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2

Nicolas Patry authored Jul 05, 2024

* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.

fb2f74e2

Adding "longrope" for Phi-3 (#2172) (#2179) · c6bcadf8
Aaron Mihalik authored Jul 05, 2024
```
Adding "longrope" for phi-3
```
c6bcadf8

04 Jul, 2024 1 commit
- Preparing patch release. (#2186) · 245d3de9
  Nicolas Patry authored Jul 04, 2024
  
  245d3de9
03 Jul, 2024 5 commits

Fixing missing `object` field for regular completions. (#2175) · 5ad41aa2
Nicolas Patry authored Jul 03, 2024
```
* Fixing missing `object` field for regular completions.

* Fixing docs by re-adding missing `Prompt`.
```
5ad41aa2
Fixing the dockerfile warnings. (#2173) · 2b3bd1e0
Nicolas Patry authored Jul 03, 2024

2b3bd1e0
Revert "Fixing missing `object` field for regular completions." · be4a4c47
Nicolas Patry authored Jul 03, 2024
```
This reverts commit 2bbb7fa4.
```
be4a4c47
Fixing missing `object` field for regular completions. · 2bbb7fa4
Nicolas Patry authored Jul 03, 2024

2bbb7fa4

feat: improve update_docs for openapi schema (#2169) · 571530dd

drbh authored Jul 03, 2024

* feat: add pre commit step to force schema update when router changes

* fix: prefer improved update_doc and start server and compare

* fix: adjust typo

* fix: adjust revert typo

* fix: update workflow to use update_doc md command

* feat: improve workflow to check openapi schema too

* fix: adjust timeout for CI

* fix: adjust raise condition and install server in ci

* fix: install protoc before server

* feat: improve update doc and add command to print router schema

* fix: adjust autodoc workflow

* fix: explicitly install protoc and python

* fix: alllow trailing space in openapi schema diff

571530dd

02 Jul, 2024 6 commits
- Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) · 0759ec49
  Nicolas Patry authored Jul 02, 2024
  
  0759ec49
- Ci test (#2124) · 963b6c6f
  Guillaume LEGENDRE authored Jul 02, 2024
```
* first test with registry mirror

* change push registry

* remove comments

* Move cache to push registry

* fix registry url

* Update .github/workflows/ci_build.yaml

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  963b6c6f
- Fixing rocm. (#2164) · dea9c0dc
  Nicolas Patry authored Jul 02, 2024
  
  dea9c0dc
- fix: use the base layers weight in mistral rocm (#2155) · b966bc0d
  drbh authored Jul 02, 2024
  
  b966bc0d
- fix FlashDecoding change's regression in intel platform (#2161) · 5d97e0c4
  Wang, Yi authored Jul 02, 2024
```
install triton because GPTQParams needs it.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  5d97e0c4
- Fixing graph capture for flash decoding. (#2163) · 022f6515
  Nicolas Patry authored Jul 02, 2024
  
  022f6515
01 Jul, 2024 3 commits

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e

Nicolas Patry authored Jul 01, 2024

* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.

4327210e

Fixing baichuan override. (#2158) · 4f55f158
Nicolas Patry authored Jul 01, 2024

4f55f158
GH router. (#2153) · d0225b10
Nicolas Patry authored Jul 01, 2024

d0225b10