Commits · f58eb70ebfe210a4813858a28d5d8b1221559cb8 · OpenDAS / text-generation-inference

23 Oct, 2024 1 commit
- Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632) · f58eb70e
  Daniël de Kok authored Oct 23, 2024
  
  f58eb70e
19 Oct, 2024 1 commit

Make handling of FP8 scales more consisent (#2666) · 5e0fb468

Daniël de Kok authored Oct 19, 2024

Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.

5e0fb468

18 Oct, 2024 1 commit

CI job. Gpt awq 4 (#2665) · 153ff374

Nicolas Patry authored Oct 18, 2024



* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

153ff374

17 Oct, 2024 4 commits

Break cycle between the attention implementations and KV cache (#2627) · 8ec57558
Daniël de Kok authored Oct 17, 2024

8ec57558

fix: prefer inplace softmax to avoid copy (#2661) · 5f32dea1

drbh authored Oct 17, 2024



* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

5f32dea1

Simplify the `attention` function (#2609) · 59ea38cb

Daniël de Kok authored Oct 17, 2024

* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support

59ea38cb

Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
Daniël de Kok authored Oct 17, 2024
```
* Support `e4m3fn` KV cache

* Make check more obvious
```
5bbe1ce0

16 Oct, 2024 2 commits

feat: prefill chunking (#2600) · a6a0c97e

OlivierDehaene authored Oct 16, 2024



* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

a6a0c97e

Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8

Mohit Sharma authored Oct 16, 2024

* (feat) fp8 fnuz support for rocm

* (review comments) Fix compression_config load, type hints

* (bug) update all has_tensor

* (review_comments) fix typo and added comments

* (nit) improved comment

704a58c8

15 Oct, 2024 1 commit
- Fixing linters. (#2650) · cf04a43f
  Nicolas Patry authored Oct 15, 2024
  
  cf04a43f
14 Oct, 2024 1 commit

feat: enable pytorch xpu support for non-attention models (#2561) · 58848cb4

Dmitry Rogozhkin authored Oct 14, 2024



XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.

Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

58848cb4

11 Oct, 2024 1 commit
- Fixing intel Supports windowing. (#2637) · 0c478846
  Nicolas Patry authored Oct 11, 2024
  
  0c478846
09 Oct, 2024 1 commit

nix: add black and isort to the closure (#2619) · 9ed0c85f

Daniël de Kok authored Oct 09, 2024

To make sure that everything is formatted with the same black version
as CI.

I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.

9ed0c85f

08 Oct, 2024 2 commits
- Add support for fused MoE Marlin for AWQ (#2616) · 64142489
  Daniël de Kok authored Oct 08, 2024
```
* Add support for fused MoE Marlin for AWQ

This uses the updated MoE Marlin kernels from vLLM.

* Add integration test for AWQ MoE
```
  64142489
- Upgrade minor rust version (Fixes rust build compilation cache) (#2617) · 8b295aa4
  Nicolas Patry authored Oct 08, 2024
```
* Upgrade minor rust version (Fixes rust build compilation cache)

* Black
```
  8b295aa4
07 Oct, 2024 2 commits
- enable mllama in intel platform (#2610) · 57f9685d
  Wang, Yi authored Oct 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  57f9685d
- Fix FP8 KV-cache condition (#2611) · 0da4df4b
  Florian Zimmermeister authored Oct 07, 2024
```
Update kv_cache.py
```
  0da4df4b
04 Oct, 2024 1 commit

Add basic FP8 KV cache support (#2603) · 2358c2bb

Daniël de Kok authored Oct 04, 2024

* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml

2358c2bb

02 Oct, 2024 1 commit

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

30 Sep, 2024 4 commits

MoE Marlin: support `desc_act` for `groupsize != -1` (#2590) · 1c84a30f
Daniël de Kok authored Sep 30, 2024
```
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
```
1c84a30f

feat: support phi3.5 moe (#2479) · 93a7042d

drbh authored Sep 30, 2024



* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

93a7042d

Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a

Daniël de Kok authored Sep 30, 2024

This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.

90a1d04a

Update ROCM libs and improvements (#2579) · f9e561ec

Mohit Sharma authored Sep 30, 2024

* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile

f9e561ec

28 Sep, 2024 1 commit
- flashinfer: pass window size and dtype (#2574) · 1028996f
  Daniël de Kok authored Sep 28, 2024
  
  1028996f
27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

26 Sep, 2024 1 commit
- Add LoRA adapters support for Gemma2 (#2567) · 0b7df771
  Alvaro Bartolome authored Sep 26, 2024
```
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
```
  0b7df771
24 Sep, 2024 4 commits

More tensor cores. (#2558) · dd8691b7
Nicolas Patry authored Sep 24, 2024
```
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
```
dd8691b7
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537) · 3f14cd14
Daniël de Kok authored Sep 24, 2024
```
This replaces the custom layers in both models.
```
3f14cd14

Add support for scalar FP8 weight scales (#2550) · c29dc89c

Daniël de Kok authored Sep 24, 2024

* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print

c29dc89c

Micro cleanup. (#2555) · 74d3ce10
Nicolas Patry authored Sep 24, 2024

74d3ce10

20 Sep, 2024 1 commit
- hotfix: ipex fails since cuda moe kernel is not supported (#2532) · f478aa77
  Wang, Yi authored Sep 20, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  f478aa77
19 Sep, 2024 1 commit
- Update to moe-kenels 0.3.1 (#2535) · c1037601
  Daniël de Kok authored Sep 19, 2024
```
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
```
  c1037601
17 Sep, 2024 1 commit

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

12 Sep, 2024 2 commits
- hotfix : enable intel ipex cpu and xpu in python3.11 (#2517) · 3ac7df2b
  Wang, Yi authored Sep 12, 2024
```
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  3ac7df2b
- fix: pass missing revision arg for lora adapter when loading multiple… (#2510) · 628334d3
  drbh authored Sep 12, 2024
```
fix: pass missing revision arg for lora adapter when loading multiple adapters
```
  628334d3
11 Sep, 2024 2 commits

Fix tokenization yi (#2507) · dae3bf1d

Nicolas Patry authored Sep 11, 2024

* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?

dae3bf1d

Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6

Nicolas Patry authored Sep 11, 2024



* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>

a4e3e8c6

07 Sep, 2024 1 commit
- Add Directory Check to Prevent Redundant Cloning in Build Process (#2486) · eabbbbda
  Vallepu Vamsi Krishna authored Sep 07, 2024
```
Update Makefile-fbgemm

Added Directory check for FBGEMM repository cloning.
```
  eabbbbda
06 Sep, 2024 2 commits
- hotfix: add syrupy to the right subproject (#2499) · a3c9c62d
  Daniël de Kok authored Sep 06, 2024
  
  a3c9c62d
- Fix incompatibility with latest `syrupy` and update in Poetry (#2497) · 2eb57a15
  Daniël de Kok authored Sep 06, 2024
  
  2eb57a15