Commits · be4a4c47f9cc525a20d2900e48c677491e36ffe5 · OpenDAS / text-generation-inference

03 Jul, 2024 3 commits

Revert "Fixing missing `object` field for regular completions." · be4a4c47
Nicolas Patry authored Jul 03, 2024
```
This reverts commit 2bbb7fa4.
```
be4a4c47
Fixing missing `object` field for regular completions. · 2bbb7fa4
Nicolas Patry authored Jul 03, 2024

2bbb7fa4

feat: improve update_docs for openapi schema (#2169) · 571530dd

drbh authored Jul 03, 2024

* feat: add pre commit step to force schema update when router changes

* fix: prefer improved update_doc and start server and compare

* fix: adjust typo

* fix: adjust revert typo

* fix: update workflow to use update_doc md command

* feat: improve workflow to check openapi schema too

* fix: adjust timeout for CI

* fix: adjust raise condition and install server in ci

* fix: install protoc before server

* feat: improve update doc and add command to print router schema

* fix: adjust autodoc workflow

* fix: explicitly install protoc and python

* fix: alllow trailing space in openapi schema diff

571530dd

02 Jul, 2024 6 commits
- Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) · 0759ec49
  Nicolas Patry authored Jul 02, 2024
  
  0759ec49
- Ci test (#2124) · 963b6c6f
  Guillaume LEGENDRE authored Jul 02, 2024
```
* first test with registry mirror

* change push registry

* remove comments

* Move cache to push registry

* fix registry url

* Update .github/workflows/ci_build.yaml

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  963b6c6f
- Fixing rocm. (#2164) · dea9c0dc
  Nicolas Patry authored Jul 02, 2024
  
  dea9c0dc
- fix: use the base layers weight in mistral rocm (#2155) · b966bc0d
  drbh authored Jul 02, 2024
  
  b966bc0d
- fix FlashDecoding change's regression in intel platform (#2161) · 5d97e0c4
  Wang, Yi authored Jul 02, 2024
```
install triton because GPTQParams needs it.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  5d97e0c4
- Fixing graph capture for flash decoding. (#2163) · 022f6515
  Nicolas Patry authored Jul 02, 2024
  
  022f6515
01 Jul, 2024 12 commits

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e

Nicolas Patry authored Jul 01, 2024

* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.

4327210e

Fixing baichuan override. (#2158) · 4f55f158
Nicolas Patry authored Jul 01, 2024

4f55f158
GH router. (#2153) · d0225b10
Nicolas Patry authored Jul 01, 2024

d0225b10
Fixing test. (#2152) · 17cebc45
Nicolas Patry authored Jul 01, 2024

17cebc45

fix: prefer serde structs over custom functions (#2127) · 9eefb2f6

drbh authored Jul 01, 2024



* fix: prefer enum for chat object

* fix: adjust typo

* fix: enum CompletionType not ObjectType

* fix: adjust typo

* feat: leverage serde for conditional deser

* fix: adjust HubTokenizerConfig after rebase

* fix: update create_post_processor logic for token type

* fix: adjust unwrap syntax in template

* Fixing the post processor.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

9eefb2f6

refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132) · 5da4cfab

Wang, Yi authored Jul 01, 2024



* refine get xpu free memory
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5da4cfab

fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123) · 9d0ca503
icyboy™ authored Jul 01, 2024
```
https://github.com/huggingface/text-generation-inference/issues/2122
```
9d0ca503

Use GPTQ-Marlin for supported GPTQ configurations (#2111) · 2ce80194

Daniël de Kok authored Jul 01, 2024

GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).

2ce80194

feat: download lora adapter weights from launcher (#2140) · 0d97a93c
drbh authored Jul 01, 2024

0d97a93c
fix: use weights from base_layer (#2141) · 25f57e2e
drbh authored Jul 01, 2024

25f57e2e
Fixing clippy. (#2149) · b4552f9d
Nicolas Patry authored Jul 01, 2024

b4552f9d

fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148) · 6ea570dd

Wang, Yi authored Jul 01, 2024



* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices]
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

6ea570dd

28 Jun, 2024 1 commit
- Fixing the CI to also run in release when it's a tag ? (#2138) · fb98ab27
  Nicolas Patry authored Jun 28, 2024
  
  fb98ab27
27 Jun, 2024 6 commits
- fix: refactor post_processor logic and add test (#2137) · 74b0231b
  drbh authored Jun 27, 2024
```
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
```
  74b0231b
- Fixing gemma2. (#2135) · 3ea8259a
  Nicolas Patry authored Jun 27, 2024
```
* Fixing gemma2.

* Adding new model.
```
  3ea8259a
- Fixing malformed rust tokenizers (#2134) · 0e4ab6d3
  Nicolas Patry authored Jun 27, 2024
```
* Fixing malformed rust tokenizers

* Fix for deepseek too.
```
  0e4ab6d3
- Idefics2: sync added image tokens with transformers (#2080) · dd2d91b0
  Daniël de Kok authored Jun 27, 2024
```
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
```
  dd2d91b0
- Bumping to 2.1 (#2131) · b53b21c6
  Nicolas Patry authored Jun 27, 2024
  
  b53b21c6
- Fixing prom leak by upgrading. (#2129) · bcfcd474
  Nicolas Patry authored Jun 27, 2024
  
  bcfcd474
25 Jun, 2024 12 commits

fix: simplify kserve endpoint and fix imports (#2119) · be2d3803
drbh authored Jun 25, 2024

be2d3803

Add support for Marlin 2:4 sparsity (#2102) · f1f98e36

Daniël de Kok authored Jun 25, 2024

This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.

f1f98e36

Support AWQ quantization with bias (#2117) · 14980df2

Daniël de Kok authored Jun 25, 2024

When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.

14980df2

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024

* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data par...

04e1af94

Fix CI . (#2118) · a2a97b05
Nicolas Patry authored Jun 25, 2024
```
Fix clippy.
```
a2a97b05

Add pytest release marker (#2114) · fc9c3153

Daniël de Kok authored Jun 25, 2024

* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI

fc9c3153

fix cpu and xpu issue (#2116) · e563983d
Wang, Yi authored Jun 25, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
e563983d

Removing IPEX_AVAIL. (#2115) · 9e2fdf57

Nicolas Patry authored Jun 25, 2024

* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN

9e2fdf57

feat: add simple tests for weights (#2092) · 3f3b7ffd

drbh authored Jun 25, 2024

* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes

3f3b7ffd

Cpu tgi (#1936) · b64c70c9

Wang, Yi authored Jun 25, 2024



* add CPU tgi support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>

b64c70c9

fix ChatCompletion and ChatCompletionChunk object string not compatible with... · b69f0780

sunxichen authored Jun 25, 2024


fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>

b69f0780

use xpu-smi to dump used memory (#2047) · 83634dc1

Wang, Yi authored Jun 25, 2024



* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

83634dc1