Commits · ab7ccf5bc3c84e07d0faf0d950421fcdc29743b5 · OpenDAS / text-generation-inference

21 Nov, 2024 7 commits
- feat: add payload limit (#2726) · ab7ccf5b
  OlivierDehaene authored Nov 21, 2024
```
* feat: add payload limit

* update launcher
```
  ab7ccf5b
- feat: Add automatic nightly benchmarks (#2591) · d5bc6a20
  Hugo Larcher authored Nov 21, 2024
```
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
```
  d5bc6a20
- Remove guideline from API (#2762) · d012f229
  Lucain authored Nov 21, 2024
  
  d012f229
- docs: Add a README section about using Nix (#2767) · c5b5b3a1
  Daniël de Kok authored Nov 21, 2024
  
  c5b5b3a1
- fix: tweak grammar test response (#2769) · faa10ad0
  drbh authored Nov 21, 2024
  
  faa10ad0
- fix: incomplete generations w/ single tokens generations and models that did... · 8e0c161d
  OlivierDehaene authored Nov 21, 2024
```
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)

* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
```
  8e0c161d
- nix: downgrade to outlines 0.1.3 (#2768) · 3c544886
  Daniël de Kok authored Nov 21, 2024
  
  3c544886
20 Nov, 2024 5 commits
- fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766) · 6ee8d6dd
  drbh authored Nov 20, 2024
```
fix: set outlines version to 0.1.3 to avoid bug
```
  6ee8d6dd
- nix: build and cache impure devshells (#2765) · 07bed530
  Daniël de Kok authored Nov 20, 2024
```
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
```
  07bed530
- Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758) · 46a5a7e7
  Daniël de Kok authored Nov 20, 2024
```
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
```
  46a5a7e7
- nix: update for outlines 0.1.4 (#2764) · 2fda8845
  Daniël de Kok authored Nov 20, 2024
  
  2fda8845
- Install compressed-tensors in Docker CPU builds · 45013b60
  Daniël de Kok authored Nov 20, 2024
  
  45013b60
19 Nov, 2024 4 commits

fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) · bd6e8b3c
drbh authored Nov 19, 2024

bd6e8b3c

PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645) · 5489406c

drbh authored Nov 19, 2024



* add OpenAI like tool_choice for named choice

* add tests

* fix: run linter and bump api docs

* fix: consolidate changes and remove old tool type

* feat: improve, simplify and rename tool choice struct add required support and refactor

* fix: simplify tool choice logic, improve tests, openapi and rust docs

* fix: refactor away prepare_chat_input and improve tool grammar apply control flow

* feat: update docs and add tool choice configuration section

* fix: simplify naming, tool choice default and improve test

* fix: adjust tool choice none logic, add test and small refactors

* fix: add missing snapshot file

* fix: adjust tool choice type in test

* fix: adjust default when json tool choice is

* fix: remove trailing space lint after rebase

* fix: remove mostly mocked unit test

---------
Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>

5489406c

Update to moe-kernels 0.7.0 (#2720) · 2007a947
Daniël de Kok authored Nov 19, 2024
```
This version syncs with the vLLM kernels and brings some performance
improvements.
```
2007a947
Simplify two ipex conditions (#2755) · b4ec427a
Daniël de Kok authored Nov 19, 2024

b4ec427a

18 Nov, 2024 4 commits

feat: support flash attention 2 in qwen2 vl vision blocks (#2721) · 38cff84a

drbh authored Nov 18, 2024

* feat: support flash attention 2 in qwen2 vl vision blocks

* fix: calc max_seqlen once and small refactors

38cff84a

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f

add ipex moe implementation to support Mixtral and PhiMoe (#2707) · a5ecd6e5

Wang, Yi authored Nov 19, 2024



* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

a5ecd6e5

fix: improve find_segments via numpy diff (#2686) · fea62e92
drbh authored Nov 18, 2024

fea62e92

17 Nov, 2024 1 commit

Remove vLLM dependency for CUDA (#2751) · 52e48739

Daniël de Kok authored Nov 17, 2024

* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning

52e48739

15 Nov, 2024 7 commits

feat: return streaming errors as an event formatted for openai's client (#2668) · 6489f852

drbh authored Nov 15, 2024



* feat: return streaming errors as an event formatted for openai's client

* fix: propagate completions error events to stream

* fix: improve stream api error format and add status code

* fix: improve streamin error to include error_type

* Revert "fix: improve streamin error to include error_type"

This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a.

* Reworked the implementation.

* Revert "Reworked the implementation."

This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.

* Small lifting.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

6489f852

Upgrading our deps. (#2750) · 34a3bded
Nicolas Patry authored Nov 15, 2024
```
* Upgrading our deps.

* fixup.

* Fixup.
```
34a3bded

Upgrade outlines to 0.1.1 (#2742) · 4580ced0

Alex Weston authored Nov 15, 2024



* Upgrade outlines to 0.1.1

* Update for new API

* Check if allowed tokens is None

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

4580ced0

fix response type of document for Text Generation Inference (#2743) · 003eaec0
jito authored Nov 15, 2024
```
Signed-off-by: jitokim <pigberger70@gmail.com>
```
003eaec0

Fix: Change embeddings to embedding (#2738) · 4f4857a4

Billel Mokeddem authored Nov 15, 2024



fix: change embeddings to embedding
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>

4f4857a4

Fix: Change model_type from ssm to mamba (#2740) · f9ee46f7
Billel Mokeddem authored Nov 15, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
```
f9ee46f7
benchmark: fix prefill throughput (#2741) · 8442f1ac
Daniël de Kok authored Nov 15, 2024

8442f1ac

14 Nov, 2024 1 commit
- nix: update nixpkgs (#2746) · ca4f46dd
  Daniël de Kok authored Nov 14, 2024
```
Updates from Triton 2.1.0 to 3.1.0 (among other things).
```
  ca4f46dd
10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

07 Nov, 2024 1 commit
- add trust_remote_code in tokenizer to fix baichuan issue (#2725) · 97f7a22f
  Wang, Yi authored Nov 07, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  97f7a22f
04 Nov, 2024 6 commits
- fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717) · b1f9044d
  Wang, Yi authored Nov 04, 2024
```
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  b1f9044d
- nix: move to tgi-nix `main` (#2718) · 5eedb2ec
  Daniël de Kok authored Nov 04, 2024
  
  5eedb2ec
- Fixing linting on main. (#2719) · 9fde5666
  Nicolas Patry authored Nov 04, 2024
  
  9fde5666
- Fix prefix caching + speculative decoding (#2711) · aadc9cb4
  Travis Addair authored Nov 04, 2024
  
  aadc9cb4
- Hotfixing auto length (warmup max_s was wrong). (#2716) · a5593ba8
  Nicolas Patry authored Nov 04, 2024
  
  a5593ba8
- fix: add chat_tokenize endpoint to api docs (#2710) · 08c4184e
  drbh authored Nov 04, 2024
  
  08c4184e
02 Nov, 2024 1 commit

fix: create position ids for text only input (#2714) · 6e322052

drbh authored Nov 01, 2024

* fix: create position ids for text only input

* fix: prefer repeat over expand to avoid clone

6e322052

01 Nov, 2024 1 commit

fix cuda graphs for qwen2-vl (#2708) · 01dacf8e

drbh authored Oct 31, 2024



* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl

* fix: only check model type if config exists

* fix: adjust sharding and lm head logic

* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: return correct shape logits and add streaming test

* fix: remove unused import and refactor test

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

01dacf8e

30 Oct, 2024 1 commit

Support qwen2 vl (#2689) · befd9f67

drbh authored Oct 30, 2024

* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2

befd9f67