Commits · f512021e77beb9b780c818b30daba58f1329ac11 · OpenDAS / text-generation-inference

19 Sep, 2024 1 commit

Stream options. (#2533) · f512021e

Nicolas Patry authored Sep 19, 2024

* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow

f512021e

17 Sep, 2024 3 commits

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

fix: metrics unbounded memory (#2528) · 86984e32
OlivierDehaene authored Sep 17, 2024

86984e32
nix: pure Rust check/fmt/clippy/test (#2525) · 71e42686
Daniël de Kok authored Sep 17, 2024
```
Runs the tests in a Nix build sandbox.
```
71e42686

16 Sep, 2024 2 commits

Adding a test for FD. (#2516) · 38fcafcf

Nicolas Patry authored Sep 16, 2024

* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.

38fcafcf

Add tests for Mixtral (#2520) · 77746552
Daniël de Kok authored Sep 16, 2024
```
Disable by default because CI runners do not have enough GPUs.
```
77746552

13 Sep, 2024 1 commit
- Use `ratatui` not (deprecated) `tui` (#2521) · 9cca3e0b
  Alex Strick van Linschoten authored Sep 13, 2024
```
* use ratatui not archived tui

* bump ratatui all the way with options
```
  9cca3e0b
12 Sep, 2024 4 commits

hotfix : enable intel ipex cpu and xpu in python3.11 (#2517) · 3ac7df2b
Wang, Yi authored Sep 12, 2024
```
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
3ac7df2b
fix: pass missing revision arg for lora adapter when loading multiple… (#2510) · 628334d3
drbh authored Sep 12, 2024
```
fix: pass missing revision arg for lora adapter when loading multiple adapters
```
628334d3

Add nix test. (#2513) · d95c670a

Nicolas Patry authored Sep 12, 2024

* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.

d95c670a

nix: support Python tokenizer conversion in the router (#2515) · 94304649

Daniël de Kok authored Sep 12, 2024

Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.

94304649

11 Sep, 2024 3 commits

Fix truffle (#2514) · 69e3be20

Nicolas Patry authored Sep 11, 2024

* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.

69e3be20

Fix tokenization yi (#2507) · dae3bf1d

Nicolas Patry authored Sep 11, 2024

* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?

dae3bf1d

Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6

Nicolas Patry authored Sep 11, 2024



* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>

a4e3e8c6

07 Sep, 2024 1 commit
- Add Directory Check to Prevent Redundant Cloning in Build Process (#2486) · eabbbbda
  Vallepu Vamsi Krishna authored Sep 07, 2024
```
Update Makefile-fbgemm

Added Directory check for FBGEMM repository cloning.
```
  eabbbbda
06 Sep, 2024 6 commits
- Fixing more correctly the invalid drop of the batch. (#2498) · c1fe28d6
  Nicolas Patry authored Sep 06, 2024
  
  c1fe28d6
- Add links to Adyen blogpost (#2500) · aaea212d
  Martin Iglesias Goyanes authored Sep 06, 2024
```
* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
```
  aaea212d
- hotfix: add syrupy to the right subproject (#2499) · a3c9c62d
  Daniël de Kok authored Sep 06, 2024
  
  a3c9c62d
- radix trie: add assertions (#2491) · 379472c4
  Daniël de Kok authored Sep 06, 2024
```
These should all be cheap assertions.

Also:

* Fixup some comments.
* Delete a `remove` that was done unnecessarily twice.
```
  379472c4
- Fix incompatibility with latest `syrupy` and update in Poetry (#2497) · 2eb57a15
  Daniël de Kok authored Sep 06, 2024
  
  2eb57a15
- nix: add pyright/ruff for proper LSP in the impure devshell (#2496) · 0424e27f
  Daniël de Kok authored Sep 06, 2024
```
We need this to ensure that pyright/ruff are part of the same
interpreter/venv.
```
  0424e27f
05 Sep, 2024 4 commits
- hotfix: fix regression of attention api change in intel platform (#2439) · 5cd8025f
  Wang, Yi authored Sep 05, 2024
```
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  5cd8025f
- Add two handy gitignores for Nix environments (#2484) · e279b38a
  Daniël de Kok authored Sep 05, 2024
  
  e279b38a
- Adding links to Adyen blogpost. (#2492) · 8b96a182
  Nicolas Patry authored Sep 05, 2024
  
  8b96a182
- hotfix: avoid non-prefilled block use when using prefix caching (#2489) · deec30f8
  Daniël de Kok authored Sep 05, 2024
```
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
```
  deec30f8
02 Sep, 2024 4 commits
- feat: support lora revisions and qkv_proj weights (#2482) · 6cb42f49
  drbh authored Sep 02, 2024
```
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
```
  6cb42f49
- fix: enable chat requests in vertex endpoint (#2481) · 47d7e344
  drbh authored Sep 02, 2024
```
* fix: enable chat requests in vertex endpoint

* feat: avoid unwrap and pre allocate future vec
```
  47d7e344
- nix: add punica-kernels (#2477) · de2cdeca
  Daniël de Kok authored Sep 02, 2024
```
Enables LoRA support.
```
  de2cdeca
- nix: improve impure devshell (#2478) · e4ab8554
  Daniël de Kok authored Sep 02, 2024
```
- Add some test dependencies.
- Install server in venv.
- Install Python client in venv.
```
  e4ab8554
29 Aug, 2024 5 commits

Tied embeddings in MLP speculator. (#2473) · d9fbbaaf

Nicolas Patry authored Aug 29, 2024

* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.

d9fbbaaf

update doc with intel cpu part (#2420) · 9883f3b4

Wang, Yi authored Aug 29, 2024



* update doc with intel cpu part
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

9883f3b4

feat: add /v1/models endpoint (#2433) · d5202c46

drbh authored Aug 29, 2024

* feat: add /v1/models endpoint

* feat: add /v1/models endpoint

* fix: remove unused type import

* fix: revert route typo

* fix: update docs with new endpoint

* fix: add to redocly ignore and lint

d5202c46

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

nix: build Torch against MKL and various other improvements (#2469) · 4e821c00

Daniël de Kok authored Aug 29, 2024

Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.

4e821c00

28 Aug, 2024 1 commit
- fix: improve regex expression (#2468) · 8f99f165
  drbh authored Aug 28, 2024
  
  8f99f165
27 Aug, 2024 3 commits

fix: bump minijinja version and add test for llama 3.1 tools (#2463) · 21187c27

drbh authored Aug 27, 2024

* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo

21187c27

Fixing CI. (#2462) · 2788d41a
Nicolas Patry authored Aug 27, 2024

2788d41a

Pr 2451 ci branch (#2454) · cfa73b5c

drbh authored Aug 26, 2024



* fix[router]: Fix tools not passed in chat template
Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>

cfa73b5c

26 Aug, 2024 1 commit

Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) · 30be1884

drbh authored Aug 26, 2024

* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613



* fix: adjust pali gemma for post layer norm and small refactors

---------
Co-authored-by: Travis Addair <tgaddair@gmail.com>

30be1884

23 Aug, 2024 1 commit

nix: add default package (#2453) · f3c5d7d9

Daniël de Kok authored Aug 23, 2024

The default package wraps the launcher and puts the server/router in the
path.

As a result, TGI can be started using something like:

```
nix run .# -- \
  --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --port 8080
```

f3c5d7d9