Commits · fc9c3153e5c6e5db6152672a87834bd0a44263b9 · OpenDAS / text-generation-inference

25 Jun, 2024 10 commits

Add pytest release marker (#2114) · fc9c3153

Daniël de Kok authored Jun 25, 2024

* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI

fc9c3153

fix cpu and xpu issue (#2116) · e563983d
Wang, Yi authored Jun 25, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
e563983d

Removing IPEX_AVAIL. (#2115) · 9e2fdf57

Nicolas Patry authored Jun 25, 2024

* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN

9e2fdf57

feat: add simple tests for weights (#2092) · 3f3b7ffd

drbh authored Jun 25, 2024

* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes

3f3b7ffd

Cpu tgi (#1936) · b64c70c9

Wang, Yi authored Jun 25, 2024



* add CPU tgi support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>

b64c70c9

fix ChatCompletion and ChatCompletionChunk object string not compatible with... · b69f0780

sunxichen authored Jun 25, 2024


fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>

b69f0780

use xpu-smi to dump used memory (#2047) · 83634dc1

Wang, Yi authored Jun 25, 2024



* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

83634dc1

corrected Pydantic warning. (#2095) · 5b2155b0

Jeff authored Jun 25, 2024



* corrected Pydantic warning.

* Update clients/python/text_generation/types.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

5b2155b0

Add OTLP Service Name Environment Variable (#2076) · 1869ee2f

KevinDuffy94 authored Jun 25, 2024

* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option

1869ee2f

Support `HF_TOKEN` environment variable (#2066) · 3447c722

Lucain authored Jun 25, 2024



* Support HF_TOKEN environement variable

* Load test.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3447c722

24 Jun, 2024 2 commits

Fix cargo-chef prepare (#2101) · 405765b1

ur4t authored Jun 25, 2024

* Fix cargo-chef prepare

In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.

* Fix Dockerfile_amd and Dockerfile_intel

405765b1

New runner. Manual squash. (#2110) · 480d3b33

Nicolas Patry authored Jun 24, 2024

* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79

480d3b33

21 Jun, 2024 2 commits
- feat: sort cuda graphs in descending order (#2104) · 811a9381
  drbh authored Jun 21, 2024
  
  811a9381
- Fix `text-generation-server quantize` (#2103) · 197c47a3
  Daniël de Kok authored Jun 21, 2024
```
The subcommand did not work due to some broken imports.
```
  197c47a3
20 Jun, 2024 2 commits

Factor out sharding of packed tensors (#2059) · bcb3faa1

Daniël de Kok authored Jun 20, 2024

For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.

bcb3faa1

Support exl2-quantized Qwen2 models (#2085) · f5a98375
Daniël de Kok authored Jun 20, 2024
```
Fixes #2081.
```
f5a98375

19 Jun, 2024 1 commit
- feat: rotate tests ci token (#2091) · cdbf8028
  drbh authored Jun 19, 2024
  
  cdbf8028
18 Jun, 2024 2 commits
- CI: pass pre-commit hooks again (#2084) · 11ea9ce0
  Daniël de Kok authored Jun 18, 2024
  
  11ea9ce0
- CI: Tailscale improvements (#2079) · 4f25c67d
  Guillaume LEGENDRE authored Jun 18, 2024
```
* test local tailscale

* Update build.yaml

* Update build.yaml

* Update build.yaml

* Update build.yaml

* wait for ssh

* network host

* change step order
```
  4f25c67d
17 Jun, 2024 4 commits

Set maximum grpc message receive size to 2GiB (#2075) · c8c7ccd3

Daniël de Kok authored Jun 17, 2024

* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass

c8c7ccd3

fix build.rs watch files (#2072) · 0f7d38e7
Ziru Niu authored Jun 17, 2024

0f7d38e7
Contributing guide & Code of Conduct (#2074) · 13183891
Lysandre Debut authored Jun 17, 2024
```
* Contributing guide & Code of Conduct

* Redirect to GitHub's tutorial on PRs
```
13183891

Support different image sizes in prefill in VLMs (#2065) · e9037708

Daniël de Kok authored Jun 17, 2024

When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.

e9037708

14 Jun, 2024 3 commits

Adding architecture document (#2044) · 445f3135

Alvaro Moran authored Jun 14, 2024



* doc: adding architecture document

* doc: add architecture to toctree

* fix: avoid cargo lock changes

* fix: avoid cargo lock tweak

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

445f3135

Update the link for qwen2 (#2068) · 96b7b40c

Tiezhen WANG authored Jun 14, 2024



* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

96b7b40c

Add support for GPTQ Marlin (#2052) · 093a27c5

Daniël de Kok authored Jun 14, 2024

Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.

093a27c5

13 Jun, 2024 2 commits

implement Open Inference Protocol endpoints (#1942) · f433f1f7

drbh authored Jun 13, 2024

* feat: add kserve feature and basic routes

* feat: implement infer endpoint wrapper around generate

* fix: refactor and improve types

* fix: improve infer and simplify

* fix: cleanup and improve api docs

* fix: refactor and encapsulate kserve feat in file

* fix: remove typos after rebase

f433f1f7

PR #2049 CI run (#2054) · 42aa8ee1

drbh authored Jun 13, 2024



* Use minijinja's pycompat mode for python methods

* fix: cargo fmt lint for pre commit

---------
Co-authored-by: Armin Ronacher <armin.ronacher@active-4.com>

42aa8ee1

12 Jun, 2024 2 commits
- fix(layers): fix SuRotaryEmbedding (#2060) · 90184df7
  OlivierDehaene authored Jun 12, 2024
```
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
```
  90184df7
- fix(server): fix OPT implementation (#2061) · 521de6ca
  OlivierDehaene authored Jun 12, 2024
  
  521de6ca
11 Jun, 2024 2 commits
- Support chat response format (#2046) · 376a0b7a
  drbh authored Jun 11, 2024
```
* feat: support response_format in chat

* fix: adjust typos

* fix: add trufflehog lint
```
  376a0b7a
- Update LLMM1 bound (#2050) · a6e4d63c
  fxmarty authored Jun 11, 2024
```
update commit
```
  a6e4d63c
10 Jun, 2024 4 commits

fix(ci): remove unnecessary permissions (#2045) · dfca1dfc
Luc Georges authored Jun 10, 2024

dfca1dfc
feat(ci): add trufflehog secrets detection (#2038) · 4e74ec09
Luc Georges authored Jun 10, 2024

4e74ec09

Add Phi-3 medium support (#2039) · 85dfc392

Daniël de Kok authored Jun 10, 2024

Add support for Phi-3-medium

The main difference between the medium and mini models is that medium
uses grouped query attention with a packed QKV matrix. This change adds
support for GQA with packed matrixes to `Weights.get_weights_col_packed`
and uses it for Phi-3. This also allows us to remove the custom
implementation of GQA from dbrx attention loading.

85dfc392

ROCm and sliding windows fixes (#2033) · 9b3674d9

fxmarty authored Jun 10, 2024

* update vllm commit & fix models using sliding window

* update

* update commit

* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled

* enable tunableop by default

* fix sliding window

* address review

* dead code

* precise comment

* is it flaky?

9b3674d9

07 Jun, 2024 1 commit

server: use chunked inputs · bf3c8137

Daniël de Kok authored May 31, 2024

The router will now send the input as chunks besides as a single
string. This change modifies the server to process chunked input
rather than strings. This also allows us to remove the image
extraction code from the server.

bf3c8137

06 Jun, 2024 3 commits

Xpu gqa (#2013) · 4dabddb7

Wang, Yi authored Jun 07, 2024

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation

).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

4dabddb7

Revert "Enabling CI for AMD with new runner.." · 97656582
Nicolas Patry authored Jun 06, 2024
```
This reverts commit 101ac9a7.
```
97656582
Enabling CI for AMD with new runner.. · 101ac9a7
Nicolas Patry authored Jun 06, 2024

101ac9a7