Commits · 53ec0b790bb086be3772cb7da14c5f0f006105c4 · OpenDAS / text-generation-inference

20 Jul, 2024 1 commit

feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79

OlivierDehaene authored Jul 20, 2024

* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout

53ec0b79

19 Jul, 2024 6 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

Hotfix: pass through model revision in `VlmCausalLM` (#2258) · 3f37a667
Daniël de Kok authored Jul 19, 2024

3f37a667
Hotfix: fix MPT after recent refactor (#2257) · 3b41e93a
Daniël de Kok authored Jul 19, 2024

3b41e93a
Hotfix: various GPT-based model fixes (#2256) · 18db78f2
Daniël de Kok authored Jul 19, 2024

18db78f2
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) · 80adb5be
Daniël de Kok authored Jul 19, 2024

80adb5be

Improve the handling of quantized weights (#2250) · ba291dad

Daniël de Kok authored Jul 19, 2024

* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama

ba291dad

18 Jul, 2024 1 commit
- fix(server): fix cohere (#2249) · 1d1b1efa
  OlivierDehaene authored Jul 18, 2024
  
  1d1b1efa
16 Jul, 2024 1 commit
- Add support for AWQ-quantized Idefics2 (#2233) · 06d0e880
  Daniël de Kok authored Jul 16, 2024
```
Fixes #2036.
```
  06d0e880
09 Jul, 2024 1 commit

Move quantized weight handling out of the `Weights` class (#2194) · 8511669c

Daniël de Kok authored Jul 09, 2024

Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.

8511669c

08 Jul, 2024 4 commits
- Falcon/DBRX: get correct number of key-value heads (#2205) · 5c7c9f13
  Daniël de Kok authored Jul 08, 2024
  
  5c7c9f13
- Fix incorrect cache allocation with multi-query (#2203) · 153fcf77
  Daniël de Kok authored Jul 08, 2024
```
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
```
  153fcf77
- hotfix: Fix number of KV heads (#2202) · cce475a9
  Daniël de Kok authored Jul 08, 2024
```
Fix number of KV heads
```
  cce475a9
- fix dbrx & opt model prefix bug (#2201) · 521d0d99
  icyboy™ authored Jul 08, 2024
```
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
```
  521d0d99
05 Jul, 2024 4 commits

Consistently take `prefix` in model constructors (#2191) · 05c094fc
Daniël de Kok authored Jul 05, 2024
```
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
```
05c094fc
Fix Starcoder2 after refactor (#2189) · b67d4633
Daniël de Kok authored Jul 05, 2024

b67d4633
Hotfixing after refactor. · 853d4eb9
Nicolas Patry authored Jul 05, 2024

853d4eb9

Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2

Nicolas Patry authored Jul 05, 2024

* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.

fb2f74e2

02 Jul, 2024 3 commits
- Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) · 0759ec49
  Nicolas Patry authored Jul 02, 2024
  
  0759ec49
- fix: use the base layers weight in mistral rocm (#2155) · b966bc0d
  drbh authored Jul 02, 2024
  
  b966bc0d
- Fixing graph capture for flash decoding. (#2163) · 022f6515
  Nicolas Patry authored Jul 02, 2024
  
  022f6515
01 Jul, 2024 5 commits

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e

Nicolas Patry authored Jul 01, 2024

* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.

4327210e

Fixing baichuan override. (#2158) · 4f55f158
Nicolas Patry authored Jul 01, 2024

4f55f158

refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132) · 5da4cfab

Wang, Yi authored Jul 01, 2024



* refine get xpu free memory
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5da4cfab

fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123) · 9d0ca503
icyboy™ authored Jul 01, 2024
```
https://github.com/huggingface/text-generation-inference/issues/2122
```
9d0ca503
fix: use weights from base_layer (#2141) · 25f57e2e
drbh authored Jul 01, 2024

25f57e2e

27 Jun, 2024 2 commits

Fixing gemma2. (#2135) · 3ea8259a
Nicolas Patry authored Jun 27, 2024
```
* Fixing gemma2.

* Adding new model.
```
3ea8259a

Idefics2: sync added image tokens with transformers (#2080) · dd2d91b0

Daniël de Kok authored Jun 27, 2024

Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.

dd2d91b0

25 Jun, 2024 4 commits

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024



* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------
Co-authored-by: Derek <datavistics@gmail.com>

04e1af94

fix cpu and xpu issue (#2116) · e563983d
Wang, Yi authored Jun 25, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
e563983d

Removing IPEX_AVAIL. (#2115) · 9e2fdf57

Nicolas Patry authored Jun 25, 2024

* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN

9e2fdf57

Cpu tgi (#1936) · b64c70c9

Wang, Yi authored Jun 25, 2024



* add CPU tgi support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>

b64c70c9

21 Jun, 2024 1 commit
- feat: sort cuda graphs in descending order (#2104) · 811a9381
  drbh authored Jun 21, 2024
  
  811a9381
20 Jun, 2024 1 commit
- Support exl2-quantized Qwen2 models (#2085) · f5a98375
  Daniël de Kok authored Jun 20, 2024
```
Fixes #2081.
```
  f5a98375
17 Jun, 2024 1 commit

Support different image sizes in prefill in VLMs (#2065) · e9037708

Daniël de Kok authored Jun 17, 2024

When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.

e9037708

14 Jun, 2024 2 commits

Update the link for qwen2 (#2068) · 96b7b40c

Tiezhen WANG authored Jun 14, 2024



* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

96b7b40c

Add support for GPTQ Marlin (#2052) · 093a27c5

Daniël de Kok authored Jun 14, 2024

Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.

093a27c5

12 Jun, 2024 2 commits
- fix(layers): fix SuRotaryEmbedding (#2060) · 90184df7
  OlivierDehaene authored Jun 12, 2024
```
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
```
  90184df7
- fix(server): fix OPT implementation (#2061) · 521de6ca
  OlivierDehaene authored Jun 12, 2024
  
  521de6ca
10 Jun, 2024 1 commit

Add Phi-3 medium support (#2039) · 85dfc392

Daniël de Kok authored Jun 10, 2024

Add support for Phi-3-medium

The main difference between the medium and mini models is that medium
uses grouped query attention with a packed QKV matrix. This change adds
support for GQA with packed matrixes to `Weights.get_weights_col_packed`
and uses it for Phi-3. This also allows us to remove the custom
implementation of GQA from dbrx attention loading.

85dfc392