Commits · 8c3669b287a1c651cb07049e67f1ce5967828167 · OpenDAS / text-generation-inference

06 Dec, 2024 2 commits

feat: auto max_new_tokens (#2803) · 8c3669b2

OlivierDehaene authored Dec 06, 2024



* feat: auto max_new_tokens

* update default

* Fixing the tests.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

8c3669b2

use oneapi 2024 docker image directly for xpu (#2793) · 6685e8fc
Wang, Yi authored Dec 06, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
6685e8fc

04 Dec, 2024 1 commit
- fix: avoid setting use_sgmv if no kernels present (#2796) · e0db6333
  drbh authored Dec 04, 2024
  
  e0db6333
03 Dec, 2024 2 commits

Saving some VRAM. (#2790) · b57f3703

Nicolas Patry authored Dec 03, 2024

* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.

b57f3703

Sync (most) server dependencies with Nix (#2782) · 2003d8be

Daniël de Kok authored Dec 03, 2024



* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2003d8be

02 Dec, 2024 4 commits
- fix: only use eos_token_id as pad_token_id if int (#2774) · 535149d8
  Dmitry Rogozhkin authored Dec 01, 2024
```
LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
```
  535149d8
- fix: add merge-lora arg for model id (#2788) · 2c74c556
  drbh authored Dec 01, 2024
  
  2c74c556
- Removing ../ that broke the link (#2789) · a35d1e6f
  Torsten Raudssus authored Dec 02, 2024
  
  a35d1e6f
- Fix doc. (#2792) · 1d2cb356
  Nicolas Patry authored Dec 02, 2024
  
  1d2cb356
28 Nov, 2024 1 commit

Support continue final message (#2733) · d4718051

drbh authored Nov 27, 2024

* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test

d4718051

26 Nov, 2024 3 commits

Fix: docs typo (#2777) · caff779d

jp authored Nov 26, 2024

Fix: typo in model loading code

Fix typo in model loading code

caff779d

upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778) · 892a26e5

Wang, Yi authored Nov 26, 2024



upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)
Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>

892a26e5

Use FP8 KV cache when specified by compressed-tensors (#2761) · 72ab60fd

Daniël de Kok authored Nov 26, 2024

The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).

72ab60fd

25 Nov, 2024 2 commits

Move JSON grammar -> regex grammar conversion to the router (#2772) · 289aa485

Daniël de Kok authored Nov 25, 2024

* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py

289aa485

feat: concat the adapter id to the model id in chat response (#2779) · c637d68d

drbh authored Nov 25, 2024

* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response

c637d68d

22 Nov, 2024 2 commits
- chore: prepare 2.4.1 release (#2773) · 780531ec
  OlivierDehaene authored Nov 22, 2024
```
* chore: prepare 2.4.1 release

* fix tests

* fmt
```
  780531ec
- chore: Update to marlin-kernels 0.3.6 (#2771) · e87893d3
  Daniël de Kok authored Nov 22, 2024
```
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
```
  e87893d3
21 Nov, 2024 7 commits
- feat: add payload limit (#2726) · ab7ccf5b
  OlivierDehaene authored Nov 21, 2024
```
* feat: add payload limit

* update launcher
```
  ab7ccf5b
- feat: Add automatic nightly benchmarks (#2591) · d5bc6a20
  Hugo Larcher authored Nov 21, 2024
```
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
```
  d5bc6a20
- Remove guideline from API (#2762) · d012f229
  Lucain authored Nov 21, 2024
  
  d012f229
- docs: Add a README section about using Nix (#2767) · c5b5b3a1
  Daniël de Kok authored Nov 21, 2024
  
  c5b5b3a1
- fix: tweak grammar test response (#2769) · faa10ad0
  drbh authored Nov 21, 2024
  
  faa10ad0
- fix: incomplete generations w/ single tokens generations and models that did... · 8e0c161d
  OlivierDehaene authored Nov 21, 2024
```
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)

* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
```
  8e0c161d
- nix: downgrade to outlines 0.1.3 (#2768) · 3c544886
  Daniël de Kok authored Nov 21, 2024
  
  3c544886
20 Nov, 2024 5 commits
- fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766) · 6ee8d6dd
  drbh authored Nov 20, 2024
```
fix: set outlines version to 0.1.3 to avoid bug
```
  6ee8d6dd
- nix: build and cache impure devshells (#2765) · 07bed530
  Daniël de Kok authored Nov 20, 2024
```
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
```
  07bed530
- Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758) · 46a5a7e7
  Daniël de Kok authored Nov 20, 2024
```
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
```
  46a5a7e7
- nix: update for outlines 0.1.4 (#2764) · 2fda8845
  Daniël de Kok authored Nov 20, 2024
  
  2fda8845
- Install compressed-tensors in Docker CPU builds · 45013b60
  Daniël de Kok authored Nov 20, 2024
  
  45013b60
19 Nov, 2024 4 commits

fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) · bd6e8b3c
drbh authored Nov 19, 2024

bd6e8b3c

PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645) · 5489406c

drbh authored Nov 19, 2024



* add OpenAI like tool_choice for named choice

* add tests

* fix: run linter and bump api docs

* fix: consolidate changes and remove old tool type

* feat: improve, simplify and rename tool choice struct add required support and refactor

* fix: simplify tool choice logic, improve tests, openapi and rust docs

* fix: refactor away prepare_chat_input and improve tool grammar apply control flow

* feat: update docs and add tool choice configuration section

* fix: simplify naming, tool choice default and improve test

* fix: adjust tool choice none logic, add test and small refactors

* fix: add missing snapshot file

* fix: adjust tool choice type in test

* fix: adjust default when json tool choice is

* fix: remove trailing space lint after rebase

* fix: remove mostly mocked unit test

---------
Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>

5489406c

Update to moe-kernels 0.7.0 (#2720) · 2007a947
Daniël de Kok authored Nov 19, 2024
```
This version syncs with the vLLM kernels and brings some performance
improvements.
```
2007a947
Simplify two ipex conditions (#2755) · b4ec427a
Daniël de Kok authored Nov 19, 2024

b4ec427a

18 Nov, 2024 4 commits

feat: support flash attention 2 in qwen2 vl vision blocks (#2721) · 38cff84a

drbh authored Nov 18, 2024

* feat: support flash attention 2 in qwen2 vl vision blocks

* fix: calc max_seqlen once and small refactors

38cff84a

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f

add ipex moe implementation to support Mixtral and PhiMoe (#2707) · a5ecd6e5

Wang, Yi authored Nov 19, 2024



* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

a5ecd6e5

fix: improve find_segments via numpy diff (#2686) · fea62e92
drbh authored Nov 18, 2024

fea62e92

17 Nov, 2024 1 commit

Remove vLLM dependency for CUDA (#2751) · 52e48739

Daniël de Kok authored Nov 17, 2024

* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning

52e48739

15 Nov, 2024 2 commits

feat: return streaming errors as an event formatted for openai's client (#2668) · 6489f852

drbh authored Nov 15, 2024



* feat: return streaming errors as an event formatted for openai's client

* fix: propagate completions error events to stream

* fix: improve stream api error format and add status code

* fix: improve streamin error to include error_type

* Revert "fix: improve streamin error to include error_type"

This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a.

* Reworked the implementation.

* Revert "Reworked the implementation."

This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.

* Small lifting.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

6489f852

Upgrading our deps. (#2750) · 34a3bded
Nicolas Patry authored Nov 15, 2024
```
* Upgrading our deps.

* fixup.

* Fixup.
```
34a3bded