Commits · ce85efa968b54d40bb9546b5acfb3e30e236f8b5 · OpenDAS / text-generation-inference

16 Sep, 2024 1 commit

Adding a test for FD. (#2516) · 38fcafcf

Nicolas Patry authored Sep 16, 2024

* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.

38fcafcf

11 Sep, 2024 2 commits

Fix truffle (#2514) · 69e3be20

Nicolas Patry authored Sep 11, 2024

* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.

69e3be20

Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6

Nicolas Patry authored Sep 11, 2024



* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>

a4e3e8c6

02 Sep, 2024 1 commit
- nix: add punica-kernels (#2477) · de2cdeca
  Daniël de Kok authored Sep 02, 2024
```
Enables LoRA support.
```
  de2cdeca
29 Aug, 2024 2 commits

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

nix: build Torch against MKL and various other improvements (#2469) · 4e821c00

Daniël de Kok authored Aug 29, 2024

Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.

4e821c00

21 Aug, 2024 2 commits
- nix: add awq-inference-engine as server dependency (#2442) · 358ceb67
  Daniël de Kok authored Aug 21, 2024
  
  358ceb67
- Adding eetq to flake. (#2438) · 310778e0
  Nicolas Patry authored Aug 21, 2024
  
  310778e0
20 Aug, 2024 2 commits

nix: add pure server to flake, add both pure and impure devshells (#2430) · f5f11b79

Daniël de Kok authored Aug 20, 2024

* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell

f5f11b79

Prefix caching (#2402) · b70ae096

Nicolas Patry authored Aug 20, 2024



* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

b70ae096

19 Aug, 2024 1 commit
- nix: update to CUDA 12.4 (#2429) · 38773453
  Daniël de Kok authored Aug 19, 2024
```
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
```
  38773453
16 Aug, 2024 1 commit

nix: try to reduce the number of Rust rebuilds (#2424) · 1411bfb9

Daniël de Kok authored Aug 16, 2024

Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.

1411bfb9

15 Aug, 2024 1 commit
- nix: build router incrementally (#2422) · 9aaa12e7
  Daniël de Kok authored Aug 15, 2024
  
  9aaa12e7
14 Aug, 2024 1 commit

nix: partial incremental build of the router (#2416) · c5fff92b

Daniël de Kok authored Aug 14, 2024

This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.

c5fff92b

13 Aug, 2024 2 commits
- Adding more kernels to flake. (#2411) · cd9b15d1
  Nicolas Patry authored Aug 13, 2024
  
  cd9b15d1
- nix: incremental build of the launcher (#2410) · 6f4bb4f2
  Daniël de Kok authored Aug 13, 2024
  
  6f4bb4f2
12 Aug, 2024 1 commit
- Updating the flake. (#2404) · 19ea85f8
  Nicolas Patry authored Aug 12, 2024
  
  19ea85f8
09 Aug, 2024 3 commits
- Update flake for 9.0a capability in Torch (#2394) · 8dcc7d3f
  Daniël de Kok authored Aug 09, 2024
  
  8dcc7d3f
- flake: use rust-overlay (#2390) · 6e127dcc
  Daniël de Kok authored Aug 09, 2024
  
  6e127dcc
- Add experimental flake (#2384) · c6d5039c
  Daniël de Kok authored Aug 09, 2024
```
Add flake.nix
```
  c6d5039c