Commits · 9eefb2f672052826ad4786aae463452d188bb75b · OpenDAS / text-generation-inference

"vscode:/vscode.git/clone" did not exist on "92e4e9c650fafb42294a80a42b6d394e10b5f3c4"

01 Jul, 2024 8 commits

fix: prefer serde structs over custom functions (#2127) · 9eefb2f6

drbh authored Jul 01, 2024



* fix: prefer enum for chat object

* fix: adjust typo

* fix: enum CompletionType not ObjectType

* fix: adjust typo

* feat: leverage serde for conditional deser

* fix: adjust HubTokenizerConfig after rebase

* fix: update create_post_processor logic for token type

* fix: adjust unwrap syntax in template

* Fixing the post processor.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

9eefb2f6

refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132) · 5da4cfab

Wang, Yi authored Jul 01, 2024



* refine get xpu free memory
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5da4cfab

fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123) · 9d0ca503
icyboy™ authored Jul 01, 2024
```
https://github.com/huggingface/text-generation-inference/issues/2122
```
9d0ca503

Use GPTQ-Marlin for supported GPTQ configurations (#2111) · 2ce80194

Daniël de Kok authored Jul 01, 2024

GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).

2ce80194

feat: download lora adapter weights from launcher (#2140) · 0d97a93c
drbh authored Jul 01, 2024

0d97a93c
fix: use weights from base_layer (#2141) · 25f57e2e
drbh authored Jul 01, 2024

25f57e2e
Fixing clippy. (#2149) · b4552f9d
Nicolas Patry authored Jul 01, 2024

b4552f9d

fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148) · 6ea570dd

Wang, Yi authored Jul 01, 2024



* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices]
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

6ea570dd

28 Jun, 2024 1 commit
- Fixing the CI to also run in release when it's a tag ? (#2138) · fb98ab27
  Nicolas Patry authored Jun 28, 2024
  
  fb98ab27
27 Jun, 2024 6 commits
- fix: refactor post_processor logic and add test (#2137) · 74b0231b
  drbh authored Jun 27, 2024
```
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
```
  74b0231b
- Fixing gemma2. (#2135) · 3ea8259a
  Nicolas Patry authored Jun 27, 2024
```
* Fixing gemma2.

* Adding new model.
```
  3ea8259a
- Fixing malformed rust tokenizers (#2134) · 0e4ab6d3
  Nicolas Patry authored Jun 27, 2024
```
* Fixing malformed rust tokenizers

* Fix for deepseek too.
```
  0e4ab6d3
- Idefics2: sync added image tokens with transformers (#2080) · dd2d91b0
  Daniël de Kok authored Jun 27, 2024
```
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
```
  dd2d91b0
- Bumping to 2.1 (#2131) · b53b21c6
  Nicolas Patry authored Jun 27, 2024
  
  b53b21c6
- Fixing prom leak by upgrading. (#2129) · bcfcd474
  Nicolas Patry authored Jun 27, 2024
  
  bcfcd474
25 Jun, 2024 15 commits

fix: simplify kserve endpoint and fix imports (#2119) · be2d3803
drbh authored Jun 25, 2024

be2d3803

Add support for Marlin 2:4 sparsity (#2102) · f1f98e36

Daniël de Kok authored Jun 25, 2024

This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.

f1f98e36

Support AWQ quantization with bias (#2117) · 14980df2

Daniël de Kok authored Jun 25, 2024

When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.

14980df2

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024



* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------
Co-authored-by: Derek <datavistics@gmail.com>

04e1af94

Fix CI . (#2118) · a2a97b05
Nicolas Patry authored Jun 25, 2024
```
Fix clippy.
```
a2a97b05

Add pytest release marker (#2114) · fc9c3153

Daniël de Kok authored Jun 25, 2024

* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI

fc9c3153

fix cpu and xpu issue (#2116) · e563983d
Wang, Yi authored Jun 25, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
e563983d

Removing IPEX_AVAIL. (#2115) · 9e2fdf57

Nicolas Patry authored Jun 25, 2024

* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN

9e2fdf57

feat: add simple tests for weights (#2092) · 3f3b7ffd

drbh authored Jun 25, 2024

* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes

3f3b7ffd

Cpu tgi (#1936) · b64c70c9

Wang, Yi authored Jun 25, 2024



* add CPU tgi support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>

b64c70c9

fix ChatCompletion and ChatCompletionChunk object string not compatible with... · b69f0780

sunxichen authored Jun 25, 2024


fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>

b69f0780

use xpu-smi to dump used memory (#2047) · 83634dc1

Wang, Yi authored Jun 25, 2024



* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

83634dc1

corrected Pydantic warning. (#2095) · 5b2155b0

Jeff authored Jun 25, 2024



* corrected Pydantic warning.

* Update clients/python/text_generation/types.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

5b2155b0

Add OTLP Service Name Environment Variable (#2076) · 1869ee2f

KevinDuffy94 authored Jun 25, 2024

* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option

1869ee2f

Support `HF_TOKEN` environment variable (#2066) · 3447c722

Lucain authored Jun 25, 2024



* Support HF_TOKEN environement variable

* Load test.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3447c722

24 Jun, 2024 2 commits

Fix cargo-chef prepare (#2101) · 405765b1

ur4t authored Jun 25, 2024

* Fix cargo-chef prepare

In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.

* Fix Dockerfile_amd and Dockerfile_intel

405765b1

New runner. Manual squash. (#2110) · 480d3b33

Nicolas Patry authored Jun 24, 2024

* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79

480d3b33

21 Jun, 2024 2 commits
- feat: sort cuda graphs in descending order (#2104) · 811a9381
  drbh authored Jun 21, 2024
  
  811a9381
- Fix `text-generation-server quantize` (#2103) · 197c47a3
  Daniël de Kok authored Jun 21, 2024
```
The subcommand did not work due to some broken imports.
```
  197c47a3
20 Jun, 2024 2 commits

Factor out sharding of packed tensors (#2059) · bcb3faa1

Daniël de Kok authored Jun 20, 2024

For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.

bcb3faa1

Support exl2-quantized Qwen2 models (#2085) · f5a98375
Daniël de Kok authored Jun 20, 2024
```
Fixes #2081.
```
f5a98375

19 Jun, 2024 1 commit
- feat: rotate tests ci token (#2091) · cdbf8028
  drbh authored Jun 19, 2024
  
  cdbf8028
18 Jun, 2024 2 commits
- CI: pass pre-commit hooks again (#2084) · 11ea9ce0
  Daniël de Kok authored Jun 18, 2024
  
  11ea9ce0
- CI: Tailscale improvements (#2079) · 4f25c67d
  Guillaume LEGENDRE authored Jun 18, 2024
```
* test local tailscale

* Update build.yaml

* Update build.yaml

* Update build.yaml

* Update build.yaml

* wait for ssh

* network host

* change step order
```
  4f25c67d
17 Jun, 2024 1 commit

Set maximum grpc message receive size to 2GiB (#2075) · c8c7ccd3

Daniël de Kok authored Jun 17, 2024

* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass

c8c7ccd3