Commits · d4bccff39b8a043fdcbb781ab6e65104b28cbc3d · OpenDAS / text-generation-inference

24 Jan, 2025 1 commit
- Optimize the performance of GPTQ · d4bccff3
  xuxzh1 authored Jan 24, 2025
  
  d4bccff3
20 Jan, 2025 1 commit
- perfect the adaptation of v3.0.0 · ee3d6944
  xuxzh1 authored Jan 20, 2025
  
  ee3d6944
27 Dec, 2024 1 commit
- update README.md · 7aad7450
  xuxzh1 authored Dec 27, 2024
  
  7aad7450
24 Dec, 2024 1 commit
- update README · 4a6d5aa1
  xuxzh1 authored Dec 24, 2024
  
  4a6d5aa1
23 Dec, 2024 1 commit
- adapt v3.0.0 · 12494cf5
  xuxzh1 authored Dec 23, 2024
  
  12494cf5
09 Dec, 2024 6 commits
- Fixing lockfile. · 8f326c97
  Nicolas Patry authored Dec 09, 2024
  
  8f326c97
- Preparing for v3 release. · 7b631e21
  Nicolas Patry authored Dec 10, 2024
  
  7b631e21
- Hotfixing the link. (#2811) · a70dd299
  Nicolas Patry authored Dec 10, 2024
  
  a70dd299
- Prep new version (#2810) · 042791fb
  Nicolas Patry authored Dec 10, 2024
```
* New version.

* Link fixup.

* Update docs.

* FIxup.
```
  042791fb
- V3 doc (#2809) · 27fa83ca
  Nicolas Patry authored Dec 10, 2024
```
* V3 document.

* Updating asset.
```
  27fa83ca
- Attempt for cleverer auto batch_prefill values (some simplifications). (#2808) · a04356fb
  Nicolas Patry authored Dec 10, 2024
```
* Attempt for cleverer auto batch_prefill values (some simplifications).

* Less flaky tests.

* Fixing typo insertion.

* Update launcher/src/main.rs
Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Adding small comment for source of calculation.

* Adding L40.

* Adding L40s.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
```
  a04356fb
06 Dec, 2024 6 commits

drbh authored Dec 06, 2024

* feat: support loading gemma2 as vlm text model

* feat: add test for paligemma2

9f5c9a5e

Removing experimental to prefill chunking. · 08f6fa0b
Nicolas Patry authored Dec 06, 2024

08f6fa0b
Adding A100 compute. (#2806) · d96dcb17
Nicolas Patry authored Dec 06, 2024

d96dcb17

Auto max prefill (#2797) · 5df80590

Nicolas Patry authored Dec 06, 2024

* Attempt at automatic max batch prefill.

* Taking into account number of shards.

* Adding more cards.

* Adding A100 + H100

* Adding a few more cards.

* Logprobs cost too much.

* h100 better name, and keep factor of 2

* Damn inflated sparse tflops.

* Typo in h100.

* Updated the flops calculation (checked with fvcore).

* chunking by default.

* Fix prefix caching for chat completion since we removed logprobs.

* More tests.

* Dropping all the prefill logprobs.

* Add a flag that enables users to get logprobs back.

* Repairing prompt token counting.

* Fixing a few tests.

* Remove some scaffolding.

* Attempting to reduces the issues (workarounds for now).

5df80590

feat: auto max_new_tokens (#2803) · 8c3669b2

OlivierDehaene authored Dec 06, 2024



* feat: auto max_new_tokens

* update default

* Fixing the tests.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

8c3669b2

use oneapi 2024 docker image directly for xpu (#2793) · 6685e8fc
Wang, Yi authored Dec 06, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
6685e8fc

04 Dec, 2024 1 commit
- fix: avoid setting use_sgmv if no kernels present (#2796) · e0db6333
  drbh authored Dec 04, 2024
  
  e0db6333
03 Dec, 2024 2 commits

Saving some VRAM. (#2790) · b57f3703

Nicolas Patry authored Dec 03, 2024

* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.

b57f3703

Sync (most) server dependencies with Nix (#2782) · 2003d8be

Daniël de Kok authored Dec 03, 2024



* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2003d8be

02 Dec, 2024 4 commits
- fix: only use eos_token_id as pad_token_id if int (#2774) · 535149d8
  Dmitry Rogozhkin authored Dec 01, 2024
```
LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
```
  535149d8
- fix: add merge-lora arg for model id (#2788) · 2c74c556
  drbh authored Dec 01, 2024
  
  2c74c556
- Removing ../ that broke the link (#2789) · a35d1e6f
  Torsten Raudssus authored Dec 02, 2024
  
  a35d1e6f
- Fix doc. (#2792) · 1d2cb356
  Nicolas Patry authored Dec 02, 2024
  
  1d2cb356
28 Nov, 2024 1 commit

Support continue final message (#2733) · d4718051

drbh authored Nov 27, 2024

* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test

d4718051

26 Nov, 2024 3 commits

Fix: docs typo (#2777) · caff779d

jp authored Nov 26, 2024

Fix: typo in model loading code

Fix typo in model loading code

caff779d

upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778) · 892a26e5

Wang, Yi authored Nov 26, 2024



upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)
Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>

892a26e5

Use FP8 KV cache when specified by compressed-tensors (#2761) · 72ab60fd

Daniël de Kok authored Nov 26, 2024

The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).

72ab60fd

25 Nov, 2024 2 commits

Move JSON grammar -> regex grammar conversion to the router (#2772) · 289aa485

Daniël de Kok authored Nov 25, 2024

* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py

289aa485

feat: concat the adapter id to the model id in chat response (#2779) · c637d68d

drbh authored Nov 25, 2024

* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response

c637d68d

22 Nov, 2024 2 commits
- chore: prepare 2.4.1 release (#2773) · 780531ec
  OlivierDehaene authored Nov 22, 2024
```
* chore: prepare 2.4.1 release

* fix tests

* fmt
```
  780531ec
- chore: Update to marlin-kernels 0.3.6 (#2771) · e87893d3
  Daniël de Kok authored Nov 22, 2024
```
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
```
  e87893d3
21 Nov, 2024 7 commits
- feat: add payload limit (#2726) · ab7ccf5b
  OlivierDehaene authored Nov 21, 2024
```
* feat: add payload limit

* update launcher
```
  ab7ccf5b
- feat: Add automatic nightly benchmarks (#2591) · d5bc6a20
  Hugo Larcher authored Nov 21, 2024
```
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
```
  d5bc6a20
- Remove guideline from API (#2762) · d012f229
  Lucain authored Nov 21, 2024
  
  d012f229
- docs: Add a README section about using Nix (#2767) · c5b5b3a1
  Daniël de Kok authored Nov 21, 2024
  
  c5b5b3a1
- fix: tweak grammar test response (#2769) · faa10ad0
  drbh authored Nov 21, 2024
  
  faa10ad0
- fix: incomplete generations w/ single tokens generations and models that did... · 8e0c161d
  OlivierDehaene authored Nov 21, 2024
```
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)

* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
```
  8e0c161d
- nix: downgrade to outlines 0.1.3 (#2768) · 3c544886
  Daniël de Kok authored Nov 21, 2024
  
  3c544886
20 Nov, 2024 1 commit
- fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766) · 6ee8d6dd
  drbh authored Nov 20, 2024
```
fix: set outlines version to 0.1.3 to avoid bug
```
  6ee8d6dd