Commits · 2358c2bb54cf60a1596267192451a46a24f03e06 · OpenDAS / text-generation-inference

04 Oct, 2024 1 commit

Add basic FP8 KV cache support (#2603) · 2358c2bb

Daniël de Kok authored Oct 04, 2024

* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml

2358c2bb

02 Oct, 2024 1 commit

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

30 Sep, 2024 4 commits

MoE Marlin: support `desc_act` for `groupsize != -1` (#2590) · 1c84a30f
Daniël de Kok authored Sep 30, 2024
```
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
```
1c84a30f

feat: support phi3.5 moe (#2479) · 93a7042d

drbh authored Sep 30, 2024



* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

93a7042d

Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a

Daniël de Kok authored Sep 30, 2024

This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.

90a1d04a

Update ROCM libs and improvements (#2579) · f9e561ec

Mohit Sharma authored Sep 30, 2024

* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile

f9e561ec

28 Sep, 2024 1 commit
- flashinfer: pass window size and dtype (#2574) · 1028996f
  Daniël de Kok authored Sep 28, 2024
  
  1028996f
27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

26 Sep, 2024 1 commit
- Add LoRA adapters support for Gemma2 (#2567) · 0b7df771
  Alvaro Bartolome authored Sep 26, 2024
```
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
```
  0b7df771
24 Sep, 2024 4 commits

More tensor cores. (#2558) · dd8691b7
Nicolas Patry authored Sep 24, 2024
```
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
```
dd8691b7
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537) · 3f14cd14
Daniël de Kok authored Sep 24, 2024
```
This replaces the custom layers in both models.
```
3f14cd14

Add support for scalar FP8 weight scales (#2550) · c29dc89c

Daniël de Kok authored Sep 24, 2024

* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print

c29dc89c

Micro cleanup. (#2555) · 74d3ce10
Nicolas Patry authored Sep 24, 2024

74d3ce10

20 Sep, 2024 1 commit
- hotfix: ipex fails since cuda moe kernel is not supported (#2532) · f478aa77
  Wang, Yi authored Sep 20, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  f478aa77
17 Sep, 2024 1 commit

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

12 Sep, 2024 2 commits
- hotfix : enable intel ipex cpu and xpu in python3.11 (#2517) · 3ac7df2b
  Wang, Yi authored Sep 12, 2024
```
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  3ac7df2b
- fix: pass missing revision arg for lora adapter when loading multiple… (#2510) · 628334d3
  drbh authored Sep 12, 2024
```
fix: pass missing revision arg for lora adapter when loading multiple adapters
```
  628334d3
11 Sep, 2024 2 commits

Fix tokenization yi (#2507) · dae3bf1d

Nicolas Patry authored Sep 11, 2024

* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?

dae3bf1d

Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6

Nicolas Patry authored Sep 11, 2024



* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>

a4e3e8c6

05 Sep, 2024 1 commit

hotfix: fix regression of attention api change in intel platform (#2439) · 5cd8025f

Wang, Yi authored Sep 05, 2024



fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5cd8025f

02 Sep, 2024 1 commit
- feat: support lora revisions and qkv_proj weights (#2482) · 6cb42f49
  drbh authored Sep 02, 2024
```
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
```
  6cb42f49
29 Aug, 2024 2 commits

Tied embeddings in MLP speculator. (#2473) · d9fbbaaf

Nicolas Patry authored Aug 29, 2024

* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.

d9fbbaaf

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

26 Aug, 2024 1 commit

Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) · 30be1884

drbh authored Aug 26, 2024

* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613



* fix: adjust pali gemma for post layer norm and small refactors

---------
Co-authored-by: Travis Addair <tgaddair@gmail.com>

30be1884

20 Aug, 2024 1 commit

Prefix caching (#2402) · b70ae096

Nicolas Patry authored Aug 20, 2024



* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

b70ae096

15 Aug, 2024 1 commit

Fixing exl2 and other quanize tests again. (#2419) · 57b34958

Nicolas Patry authored Aug 15, 2024

* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.

57b34958

14 Aug, 2024 1 commit
- Upgrading exl2. (#2415) · f3b5c694
  Nicolas Patry authored Aug 14, 2024
```
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
```
  f3b5c694
13 Aug, 2024 2 commits
- fix: adds causal to attention params (#2408) · 1cebccc7
  drbh authored Aug 13, 2024
```
fix: adds causal to attention params to check when using flash attn v1
```
  1cebccc7
- add numa to improve cpu inference perf (#2330) · 59922f9b
  Wang, Yi authored Aug 13, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  59922f9b
12 Aug, 2024 5 commits

fix: include create_exllama_buffers and set_device for exllama (#2407) · 8a7749b8
drbh authored Aug 12, 2024

8a7749b8

fix: allocate tmp based on sgmv kernel if available (#2345) · 4c3f8a70

drbh authored Aug 12, 2024

* fix: allocate tmp based on sgmv kernel if available

* fix: re add copy build artifacts step for punica kernels

4c3f8a70

feat: validate template variables before apply and improve sliding wi… (#2403) · 155f9c98

drbh authored Aug 12, 2024

* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test

155f9c98

Add support for prefix caching to the v3 router (#2392) · 8deeaca4

Daniël de Kok authored Aug 12, 2024

This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.

8deeaca4

Fixing import exl2 (#2399) · 84bc3d7b
Nicolas Patry authored Aug 12, 2024

84bc3d7b

09 Aug, 2024 3 commits

Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847

Nicolas Patry authored Aug 09, 2024

* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.

7a48a847

Update documentation for Supported models (#2386) · b2b9c427
Vaibhav Srivastav authored Aug 09, 2024
```
* Minor doc fixes

* up.

* Other minor updates.
```
b2b9c427

Add FlashInfer support (#2354) · 7830de15

Daniël de Kok authored Aug 09, 2024

This change adds support for FlashInfer. FlashInfer can be enabled using
`FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
Since this functionality is currently only for testing, FlashInfer is
not installed anywhere yet.

The FlashInfer API is quite different from FlashAttention/vLLM in that
it requires more global bookkeeping:

* A wrapper class needs to be contstructed (which we just call *state*).
  Since this is fairly expensive (due to pinned host memory allocation),
  we only do this once in a FlashCausalLM instance or for each CUDA
  Graph size.
* Each model forward call needs to be wrapped in `begin_forward` and
  `end_forward`. This sets up data structures that can be reused for all
  calls to attention for that forward call.

When calling attention, we need access to the state object. To avoid
passing an argument down the call chain (which would require changes to
all models), we use a context variable.

Each model forward call is wrapped using a context manager that does all
the bookkeeping for such a call:

* Set the context variable to the forward call's state.
* Call `begin_forward` on the state.
* Yield.
* Call `end_forward` on the state.
* Reset the context variable.

We cannot use a single shared global variable for this, since e.g. CUDA
Graphs of different sizes each have their own state.

7830de15

08 Aug, 2024 3 commits

fix: prefer hidden_activation over hidden_act in gemma2 (#2381) · f8521900
drbh authored Aug 08, 2024

f8521900

Pr 2337 ci branch (#2379) · 2ca59806

drbh authored Aug 08, 2024



* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* reable gemma2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix in regression in ipex flashattention
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

2ca59806

fix EleutherAI/gpt-neox-20b does not work in tgi (#2346) · 689b1abb
Wang, Yi authored Aug 09, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
689b1abb