Commits · b53b21c63a3fb0e57c23cb7e7b15d3c4588c66c0 · OpenDAS / text-generation-inference

27 Jun, 2024 2 commits
- Bumping to 2.1 (#2131) · b53b21c6
  Nicolas Patry authored Jun 27, 2024
  
  b53b21c6
- Fixing prom leak by upgrading. (#2129) · bcfcd474
  Nicolas Patry authored Jun 27, 2024
  
  bcfcd474
25 Jun, 2024 15 commits

fix: simplify kserve endpoint and fix imports (#2119) · be2d3803
drbh authored Jun 25, 2024

be2d3803

Add support for Marlin 2:4 sparsity (#2102) · f1f98e36

Daniël de Kok authored Jun 25, 2024

This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.

f1f98e36

Support AWQ quantization with bias (#2117) · 14980df2

Daniël de Kok authored Jun 25, 2024

When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.

14980df2

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024



* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------
Co-authored-by: Derek <datavistics@gmail.com>

04e1af94

Fix CI . (#2118) · a2a97b05
Nicolas Patry authored Jun 25, 2024
```
Fix clippy.
```
a2a97b05

Add pytest release marker (#2114) · fc9c3153

Daniël de Kok authored Jun 25, 2024

* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI

fc9c3153

fix cpu and xpu issue (#2116) · e563983d
Wang, Yi authored Jun 25, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
e563983d

Removing IPEX_AVAIL. (#2115) · 9e2fdf57

Nicolas Patry authored Jun 25, 2024

* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN

9e2fdf57

feat: add simple tests for weights (#2092) · 3f3b7ffd

drbh authored Jun 25, 2024

* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes

3f3b7ffd

Cpu tgi (#1936) · b64c70c9

Wang, Yi authored Jun 25, 2024



* add CPU tgi support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>

b64c70c9

fix ChatCompletion and ChatCompletionChunk object string not compatible with... · b69f0780

sunxichen authored Jun 25, 2024


fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>

b69f0780

use xpu-smi to dump used memory (#2047) · 83634dc1

Wang, Yi authored Jun 25, 2024



* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

83634dc1

corrected Pydantic warning. (#2095) · 5b2155b0

Jeff authored Jun 25, 2024



* corrected Pydantic warning.

* Update clients/python/text_generation/types.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

5b2155b0

Add OTLP Service Name Environment Variable (#2076) · 1869ee2f

KevinDuffy94 authored Jun 25, 2024

* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option

1869ee2f

Support `HF_TOKEN` environment variable (#2066) · 3447c722

Lucain authored Jun 25, 2024



* Support HF_TOKEN environement variable

* Load test.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3447c722

24 Jun, 2024 2 commits

Fix cargo-chef prepare (#2101) · 405765b1

ur4t authored Jun 25, 2024

* Fix cargo-chef prepare

In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.

* Fix Dockerfile_amd and Dockerfile_intel

405765b1

New runner. Manual squash. (#2110) · 480d3b33

Nicolas Patry authored Jun 24, 2024

* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79

480d3b33

21 Jun, 2024 2 commits
- feat: sort cuda graphs in descending order (#2104) · 811a9381
  drbh authored Jun 21, 2024
  
  811a9381
- Fix `text-generation-server quantize` (#2103) · 197c47a3
  Daniël de Kok authored Jun 21, 2024
```
The subcommand did not work due to some broken imports.
```
  197c47a3
20 Jun, 2024 2 commits

Factor out sharding of packed tensors (#2059) · bcb3faa1

Daniël de Kok authored Jun 20, 2024

For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.

bcb3faa1

Support exl2-quantized Qwen2 models (#2085) · f5a98375
Daniël de Kok authored Jun 20, 2024
```
Fixes #2081.
```
f5a98375

19 Jun, 2024 1 commit
- feat: rotate tests ci token (#2091) · cdbf8028
  drbh authored Jun 19, 2024
  
  cdbf8028
18 Jun, 2024 2 commits
- CI: pass pre-commit hooks again (#2084) · 11ea9ce0
  Daniël de Kok authored Jun 18, 2024
  
  11ea9ce0
- CI: Tailscale improvements (#2079) · 4f25c67d
  Guillaume LEGENDRE authored Jun 18, 2024
```
* test local tailscale

* Update build.yaml

* Update build.yaml

* Update build.yaml

* Update build.yaml

* wait for ssh

* network host

* change step order
```
  4f25c67d
17 Jun, 2024 4 commits

Set maximum grpc message receive size to 2GiB (#2075) · c8c7ccd3

Daniël de Kok authored Jun 17, 2024

* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass

c8c7ccd3

fix build.rs watch files (#2072) · 0f7d38e7
Ziru Niu authored Jun 17, 2024

0f7d38e7
Contributing guide & Code of Conduct (#2074) · 13183891
Lysandre Debut authored Jun 17, 2024
```
* Contributing guide & Code of Conduct

* Redirect to GitHub's tutorial on PRs
```
13183891

Support different image sizes in prefill in VLMs (#2065) · e9037708

Daniël de Kok authored Jun 17, 2024

When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.

e9037708

14 Jun, 2024 3 commits

Adding architecture document (#2044) · 445f3135

Alvaro Moran authored Jun 14, 2024



* doc: adding architecture document

* doc: add architecture to toctree

* fix: avoid cargo lock changes

* fix: avoid cargo lock tweak

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

445f3135

Update the link for qwen2 (#2068) · 96b7b40c

Tiezhen WANG authored Jun 14, 2024



* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

96b7b40c

Add support for GPTQ Marlin (#2052) · 093a27c5

Daniël de Kok authored Jun 14, 2024

Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.

093a27c5

13 Jun, 2024 2 commits

implement Open Inference Protocol endpoints (#1942) · f433f1f7

drbh authored Jun 13, 2024

* feat: add kserve feature and basic routes

* feat: implement infer endpoint wrapper around generate

* fix: refactor and improve types

* fix: improve infer and simplify

* fix: cleanup and improve api docs

* fix: refactor and encapsulate kserve feat in file

* fix: remove typos after rebase

f433f1f7

PR #2049 CI run (#2054) · 42aa8ee1

drbh authored Jun 13, 2024



* Use minijinja's pycompat mode for python methods

* fix: cargo fmt lint for pre commit

---------
Co-authored-by: Armin Ronacher <armin.ronacher@active-4.com>

42aa8ee1

12 Jun, 2024 2 commits
- fix(layers): fix SuRotaryEmbedding (#2060) · 90184df7
  OlivierDehaene authored Jun 12, 2024
```
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
```
  90184df7
- fix(server): fix OPT implementation (#2061) · 521de6ca
  OlivierDehaene authored Jun 12, 2024
  
  521de6ca
11 Jun, 2024 2 commits
- Support chat response format (#2046) · 376a0b7a
  drbh authored Jun 11, 2024
```
* feat: support response_format in chat

* fix: adjust typos

* fix: add trufflehog lint
```
  376a0b7a
- Update LLMM1 bound (#2050) · a6e4d63c
  fxmarty authored Jun 11, 2024
```
update commit
```
  a6e4d63c
10 Jun, 2024 1 commit
- fix(ci): remove unnecessary permissions (#2045) · dfca1dfc
  Luc Georges authored Jun 10, 2024
  
  dfca1dfc