Commits · a04356fb8c0b87d5349319f0d87df3103118529d · OpenDAS / text-generation-inference

09 Dec, 2024 1 commit

Attempt for cleverer auto batch_prefill values (some simplifications). (#2808) · a04356fb

Nicolas Patry authored Dec 10, 2024



* Attempt for cleverer auto batch_prefill values (some simplifications).

* Less flaky tests.

* Fixing typo insertion.

* Update launcher/src/main.rs
Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Adding small comment for source of calculation.

* Adding L40.

* Adding L40s.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

a04356fb

06 Dec, 2024 6 commits

Enable paligemma2 (#2807) · 9f5c9a5e

drbh authored Dec 06, 2024

* feat: support loading gemma2 as vlm text model

* feat: add test for paligemma2

9f5c9a5e

Removing experimental to prefill chunking. · 08f6fa0b
Nicolas Patry authored Dec 06, 2024

08f6fa0b
Adding A100 compute. (#2806) · d96dcb17
Nicolas Patry authored Dec 06, 2024

d96dcb17

Auto max prefill (#2797) · 5df80590

Nicolas Patry authored Dec 06, 2024

* Attempt at automatic max batch prefill.

* Taking into account number of shards.

* Adding more cards.

* Adding A100 + H100

* Adding a few more cards.

* Logprobs cost too much.

* h100 better name, and keep factor of 2

* Damn inflated sparse tflops.

* Typo in h100.

* Updated the flops calculation (checked with fvcore).

* chunking by default.

* Fix prefix caching for chat completion since we removed logprobs.

* More tests.

* Dropping all the prefill logprobs.

* Add a flag that enables users to get logprobs back.

* Repairing prompt token counting.

* Fixing a few tests.

* Remove some scaffolding.

* Attempting to reduces the issues (workarounds for now).

5df80590

feat: auto max_new_tokens (#2803) · 8c3669b2

OlivierDehaene authored Dec 06, 2024



* feat: auto max_new_tokens

* update default

* Fixing the tests.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

8c3669b2

use oneapi 2024 docker image directly for xpu (#2793) · 6685e8fc
Wang, Yi authored Dec 06, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
6685e8fc

04 Dec, 2024 1 commit
- fix: avoid setting use_sgmv if no kernels present (#2796) · e0db6333
  drbh authored Dec 04, 2024
  
  e0db6333
03 Dec, 2024 2 commits

Saving some VRAM. (#2790) · b57f3703

Nicolas Patry authored Dec 03, 2024

* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.

b57f3703

Sync (most) server dependencies with Nix (#2782) · 2003d8be

Daniël de Kok authored Dec 03, 2024



* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2003d8be

02 Dec, 2024 4 commits
- fix: only use eos_token_id as pad_token_id if int (#2774) · 535149d8
  Dmitry Rogozhkin authored Dec 01, 2024
```
LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
```
  535149d8
- fix: add merge-lora arg for model id (#2788) · 2c74c556
  drbh authored Dec 01, 2024
  
  2c74c556
- Removing ../ that broke the link (#2789) · a35d1e6f
  Torsten Raudssus authored Dec 02, 2024
  
  a35d1e6f
- Fix doc. (#2792) · 1d2cb356
  Nicolas Patry authored Dec 02, 2024
  
  1d2cb356
28 Nov, 2024 1 commit

Support continue final message (#2733) · d4718051

drbh authored Nov 27, 2024

* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test

d4718051

26 Nov, 2024 3 commits

Fix: docs typo (#2777) · caff779d

jp authored Nov 26, 2024

Fix: typo in model loading code

Fix typo in model loading code

caff779d

upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778) · 892a26e5

Wang, Yi authored Nov 26, 2024



upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)
Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>

892a26e5

Use FP8 KV cache when specified by compressed-tensors (#2761) · 72ab60fd

Daniël de Kok authored Nov 26, 2024

The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).

72ab60fd

25 Nov, 2024 2 commits

Move JSON grammar -> regex grammar conversion to the router (#2772) · 289aa485

Daniël de Kok authored Nov 25, 2024

* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py

289aa485

feat: concat the adapter id to the model id in chat response (#2779) · c637d68d

drbh authored Nov 25, 2024

* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response

c637d68d

22 Nov, 2024 2 commits
- chore: prepare 2.4.1 release (#2773) · 780531ec
  OlivierDehaene authored Nov 22, 2024
```
* chore: prepare 2.4.1 release

* fix tests

* fmt
```
  780531ec
- chore: Update to marlin-kernels 0.3.6 (#2771) · e87893d3
  Daniël de Kok authored Nov 22, 2024
```
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
```
  e87893d3
21 Nov, 2024 7 commits
- feat: add payload limit (#2726) · ab7ccf5b
  OlivierDehaene authored Nov 21, 2024
```
* feat: add payload limit

* update launcher
```
  ab7ccf5b
- feat: Add automatic nightly benchmarks (#2591) · d5bc6a20
  Hugo Larcher authored Nov 21, 2024
```
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
```
  d5bc6a20
- Remove guideline from API (#2762) · d012f229
  Lucain authored Nov 21, 2024
  
  d012f229
- docs: Add a README section about using Nix (#2767) · c5b5b3a1
  Daniël de Kok authored Nov 21, 2024
  
  c5b5b3a1
- fix: tweak grammar test response (#2769) · faa10ad0
  drbh authored Nov 21, 2024
  
  faa10ad0
- fix: incomplete generations w/ single tokens generations and models that did... · 8e0c161d
  OlivierDehaene authored Nov 21, 2024
```
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)

* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
```
  8e0c161d
- nix: downgrade to outlines 0.1.3 (#2768) · 3c544886
  Daniël de Kok authored Nov 21, 2024
  
  3c544886
20 Nov, 2024 5 commits
- fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766) · 6ee8d6dd
  drbh authored Nov 20, 2024
```
fix: set outlines version to 0.1.3 to avoid bug
```
  6ee8d6dd
- nix: build and cache impure devshells (#2765) · 07bed530
  Daniël de Kok authored Nov 20, 2024
```
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
```
  07bed530
- Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758) · 46a5a7e7
  Daniël de Kok authored Nov 20, 2024
```
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
```
  46a5a7e7
- nix: update for outlines 0.1.4 (#2764) · 2fda8845
  Daniël de Kok authored Nov 20, 2024
  
  2fda8845
- Install compressed-tensors in Docker CPU builds · 45013b60
  Daniël de Kok authored Nov 20, 2024
  
  45013b60
19 Nov, 2024 4 commits

fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) · bd6e8b3c
drbh authored Nov 19, 2024

bd6e8b3c

PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645) · 5489406c

drbh authored Nov 19, 2024



* add OpenAI like tool_choice for named choice

* add tests

* fix: run linter and bump api docs

* fix: consolidate changes and remove old tool type

* feat: improve, simplify and rename tool choice struct add required support and refactor

* fix: simplify tool choice logic, improve tests, openapi and rust docs

* fix: refactor away prepare_chat_input and improve tool grammar apply control flow

* feat: update docs and add tool choice configuration section

* fix: simplify naming, tool choice default and improve test

* fix: adjust tool choice none logic, add test and small refactors

* fix: add missing snapshot file

* fix: adjust tool choice type in test

* fix: adjust default when json tool choice is

* fix: remove trailing space lint after rebase

* fix: remove mostly mocked unit test

---------
Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>

5489406c

Update to moe-kernels 0.7.0 (#2720) · 2007a947
Daniël de Kok authored Nov 19, 2024
```
This version syncs with the vLLM kernels and brings some performance
improvements.
```
2007a947
Simplify two ipex conditions (#2755) · b4ec427a
Daniël de Kok authored Nov 19, 2024

b4ec427a

18 Nov, 2024 2 commits

feat: support flash attention 2 in qwen2 vl vision blocks (#2721) · 38cff84a

drbh authored Nov 18, 2024

* feat: support flash attention 2 in qwen2 vl vision blocks

* fix: calc max_seqlen once and small refactors

38cff84a

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f