Commits · 0424e27f651bf6df492c9ad0ba7c7e9def60f224 · OpenDAS / text-generation-inference

06 Sep, 2024 1 commit
- nix: add pyright/ruff for proper LSP in the impure devshell (#2496) · 0424e27f
  Daniël de Kok authored Sep 06, 2024
```
We need this to ensure that pyright/ruff are part of the same
interpreter/venv.
```
  0424e27f
05 Sep, 2024 4 commits
- hotfix: fix regression of attention api change in intel platform (#2439) · 5cd8025f
  Wang, Yi authored Sep 05, 2024
```
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  5cd8025f
- Add two handy gitignores for Nix environments (#2484) · e279b38a
  Daniël de Kok authored Sep 05, 2024
  
  e279b38a
- Adding links to Adyen blogpost. (#2492) · 8b96a182
  Nicolas Patry authored Sep 05, 2024
  
  8b96a182
- hotfix: avoid non-prefilled block use when using prefix caching (#2489) · deec30f8
  Daniël de Kok authored Sep 05, 2024
```
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
```
  deec30f8
02 Sep, 2024 4 commits
- feat: support lora revisions and qkv_proj weights (#2482) · 6cb42f49
  drbh authored Sep 02, 2024
```
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
```
  6cb42f49
- fix: enable chat requests in vertex endpoint (#2481) · 47d7e344
  drbh authored Sep 02, 2024
```
* fix: enable chat requests in vertex endpoint

* feat: avoid unwrap and pre allocate future vec
```
  47d7e344
- nix: add punica-kernels (#2477) · de2cdeca
  Daniël de Kok authored Sep 02, 2024
```
Enables LoRA support.
```
  de2cdeca
- nix: improve impure devshell (#2478) · e4ab8554
  Daniël de Kok authored Sep 02, 2024
```
- Add some test dependencies.
- Install server in venv.
- Install Python client in venv.
```
  e4ab8554
29 Aug, 2024 5 commits

Tied embeddings in MLP speculator. (#2473) · d9fbbaaf

Nicolas Patry authored Aug 29, 2024

* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.

d9fbbaaf

update doc with intel cpu part (#2420) · 9883f3b4

Wang, Yi authored Aug 29, 2024



* update doc with intel cpu part
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

9883f3b4

feat: add /v1/models endpoint (#2433) · d5202c46

drbh authored Aug 29, 2024

* feat: add /v1/models endpoint

* feat: add /v1/models endpoint

* fix: remove unused type import

* fix: revert route typo

* fix: update docs with new endpoint

* fix: add to redocly ignore and lint

d5202c46

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

nix: build Torch against MKL and various other improvements (#2469) · 4e821c00

Daniël de Kok authored Aug 29, 2024

Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.

4e821c00

28 Aug, 2024 1 commit
- fix: improve regex expression (#2468) · 8f99f165
  drbh authored Aug 28, 2024
  
  8f99f165
27 Aug, 2024 3 commits

fix: bump minijinja version and add test for llama 3.1 tools (#2463) · 21187c27

drbh authored Aug 27, 2024

* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo

21187c27

Fixing CI. (#2462) · 2788d41a
Nicolas Patry authored Aug 27, 2024

2788d41a

Pr 2451 ci branch (#2454) · cfa73b5c

drbh authored Aug 26, 2024



* fix[router]: Fix tools not passed in chat template
Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>

cfa73b5c

26 Aug, 2024 1 commit

Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) · 30be1884

drbh authored Aug 26, 2024

* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613



* fix: adjust pali gemma for post layer norm and small refactors

---------
Co-authored-by: Travis Addair <tgaddair@gmail.com>

30be1884

23 Aug, 2024 1 commit

nix: add default package (#2453) · f3c5d7d9

Daniël de Kok authored Aug 23, 2024

The default package wraps the launcher and puts the server/router in the
path.

As a result, TGI can be started using something like:

```
nix run .# -- \
  --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --port 8080
```

f3c5d7d9

21 Aug, 2024 3 commits
- nix: add awq-inference-engine as server dependency (#2442) · 358ceb67
  Daniël de Kok authored Aug 21, 2024
  
  358ceb67
- Adding eetq to flake. (#2438) · 310778e0
  Nicolas Patry authored Aug 21, 2024
  
  310778e0
- nix: add `text-generation-benchmark` to pure devshell (#2431) · 94744150
  Daniël de Kok authored Aug 21, 2024
```
nix: add text-generation-benchmark to pure devshell
```
  94744150
20 Aug, 2024 2 commits

nix: add pure server to flake, add both pure and impure devshells (#2430) · f5f11b79

Daniël de Kok authored Aug 20, 2024

* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell

f5f11b79

Prefix caching (#2402) · b70ae096

Nicolas Patry authored Aug 20, 2024



* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

b70ae096

19 Aug, 2024 1 commit
- nix: update to CUDA 12.4 (#2429) · 38773453
  Daniël de Kok authored Aug 19, 2024
```
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
```
  38773453
16 Aug, 2024 6 commits

All integration tests back everywhere (too many failed CI). (#2428) · e4201f44

Nicolas Patry authored Aug 16, 2024

* All integration tests back everywhere (too many failed CI).

* Upgrade integration tests after 12.4

* Attempt to remove the specifed compute cap.

* Common arch list.

* Punica uses raw ASM which is not valid on 9.0 apparently.

e4201f44

doc: Add metrics documentation and add a 'Reference' section (#2230) · 53729b74

Hugo Larcher authored Aug 16, 2024



* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

53729b74

FIxing the CI. · cb0a2948
Nicolas Patry authored Aug 16, 2024

cb0a2948

Further fixes. (#2426) · c7ab1810

Nicolas Patry authored Aug 16, 2024

* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.

c7ab1810

Improve the Consuming TGI + Streaming docs. (#2412) · 99b662f8

Vaibhav Srivastav authored Aug 16, 2024



* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md
Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

99b662f8

nix: try to reduce the number of Rust rebuilds (#2424) · 1411bfb9

Daniël de Kok authored Aug 16, 2024

Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.

1411bfb9

15 Aug, 2024 3 commits

Upgrading the tests to match the current workings. (#2423) · 1b0aa062
Nicolas Patry authored Aug 15, 2024

1b0aa062

Fixing exl2 and other quanize tests again. (#2419) · 57b34958

Nicolas Patry authored Aug 15, 2024

* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.

57b34958

nix: build router incrementally (#2422) · 9aaa12e7
Daniël de Kok authored Aug 15, 2024

9aaa12e7

14 Aug, 2024 3 commits

More fixes trtllm (#2342) · 3f385991

Funtowicz Morgan authored Aug 14, 2024

* (backend) use parking_lot crate for RwLock fairness

* (docker) let's put rust in the TRTLLM folder when building

* (docker) build ompi with SLURM support

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

3f385991

Upgrading exl2. (#2415) · f3b5c694
Nicolas Patry authored Aug 14, 2024
```
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
```
f3b5c694

nix: partial incremental build of the router (#2416) · c5fff92b

Daniël de Kok authored Aug 14, 2024

This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.

c5fff92b

13 Aug, 2024 2 commits
- fix: adds causal to attention params (#2408) · 1cebccc7
  drbh authored Aug 13, 2024
```
fix: adds causal to attention params to check when using flash attn v1
```
  1cebccc7
- add numa to improve cpu inference perf (#2330) · 59922f9b
  Wang, Yi authored Aug 13, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  59922f9b