Commits · 0c9b6cdd768558652afdf5e5053aeb49bf4bc21f · OpenDAS / text-generation-inference

28 Oct, 2024 1 commit

Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd

Nicolas Patry authored Oct 28, 2024

* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).

0c9b6cdd

25 Oct, 2024 3 commits

feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93

OlivierDehaene authored Oct 25, 2024

* feat: add triton kernels to decrease latency of large batches

* cast to int32

* fix kernel

* fix kernel

* disable triton on rocm

* fix speculation

* add slots filtering kernel

6f88bd93

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) · cece8635
Nicolas Patry authored Oct 25, 2024

cece8635

24 Oct, 2024 2 commits

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

flashinfer: reminder to remove contiguous call in the future (#2685) · 1b914f37
Daniël de Kok authored Oct 24, 2024

1b914f37

23 Oct, 2024 2 commits
- hotfix: fix flashllama · 27ff1871
  OlivierDehaene authored Oct 23, 2024
  
  27ff1871
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
19 Oct, 2024 1 commit

Make handling of FP8 scales more consisent (#2666) · 5e0fb468

Daniël de Kok authored Oct 19, 2024

Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.

5e0fb468

18 Oct, 2024 1 commit

CI job. Gpt awq 4 (#2665) · 153ff374

Nicolas Patry authored Oct 18, 2024



* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

153ff374

17 Oct, 2024 4 commits

Break cycle between the attention implementations and KV cache (#2627) · 8ec57558
Daniël de Kok authored Oct 17, 2024

8ec57558

fix: prefer inplace softmax to avoid copy (#2661) · 5f32dea1

drbh authored Oct 17, 2024



* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

5f32dea1

Simplify the `attention` function (#2609) · 59ea38cb

Daniël de Kok authored Oct 17, 2024

* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support

59ea38cb

Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
Daniël de Kok authored Oct 17, 2024
```
* Support `e4m3fn` KV cache

* Make check more obvious
```
5bbe1ce0

16 Oct, 2024 2 commits

feat: prefill chunking (#2600) · a6a0c97e

OlivierDehaene authored Oct 16, 2024



* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

a6a0c97e

Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8

Mohit Sharma authored Oct 16, 2024

* (feat) fp8 fnuz support for rocm

* (review comments) Fix compression_config load, type hints

* (bug) update all has_tensor

* (review_comments) fix typo and added comments

* (nit) improved comment

704a58c8

15 Oct, 2024 1 commit
- Fixing linters. (#2650) · cf04a43f
  Nicolas Patry authored Oct 15, 2024
  
  cf04a43f
14 Oct, 2024 1 commit

feat: enable pytorch xpu support for non-attention models (#2561) · 58848cb4

Dmitry Rogozhkin authored Oct 14, 2024



XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.

Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

58848cb4

11 Oct, 2024 1 commit
- Fixing intel Supports windowing. (#2637) · 0c478846
  Nicolas Patry authored Oct 11, 2024
  
  0c478846
08 Oct, 2024 2 commits
- Add support for fused MoE Marlin for AWQ (#2616) · 64142489
  Daniël de Kok authored Oct 08, 2024
```
* Add support for fused MoE Marlin for AWQ

This uses the updated MoE Marlin kernels from vLLM.

* Add integration test for AWQ MoE
```
  64142489
- Upgrade minor rust version (Fixes rust build compilation cache) (#2617) · 8b295aa4
  Nicolas Patry authored Oct 08, 2024
```
* Upgrade minor rust version (Fixes rust build compilation cache)

* Black
```
  8b295aa4
07 Oct, 2024 2 commits
- enable mllama in intel platform (#2610) · 57f9685d
  Wang, Yi authored Oct 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  57f9685d
- Fix FP8 KV-cache condition (#2611) · 0da4df4b
  Florian Zimmermeister authored Oct 07, 2024
```
Update kv_cache.py
```
  0da4df4b
04 Oct, 2024 1 commit

Add basic FP8 KV cache support (#2603) · 2358c2bb

Daniël de Kok authored Oct 04, 2024

* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml

2358c2bb

02 Oct, 2024 1 commit

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

30 Sep, 2024 4 commits

MoE Marlin: support `desc_act` for `groupsize != -1` (#2590) · 1c84a30f
Daniël de Kok authored Sep 30, 2024
```
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
```
1c84a30f

feat: support phi3.5 moe (#2479) · 93a7042d

drbh authored Sep 30, 2024



* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

93a7042d

Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a

Daniël de Kok authored Sep 30, 2024

This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.

90a1d04a

Update ROCM libs and improvements (#2579) · f9e561ec

Mohit Sharma authored Sep 30, 2024

* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile

f9e561ec

28 Sep, 2024 1 commit
- flashinfer: pass window size and dtype (#2574) · 1028996f
  Daniël de Kok authored Sep 28, 2024
  
  1028996f
27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

26 Sep, 2024 1 commit
- Add LoRA adapters support for Gemma2 (#2567) · 0b7df771
  Alvaro Bartolome authored Sep 26, 2024
```
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
```
  0b7df771
24 Sep, 2024 4 commits

More tensor cores. (#2558) · dd8691b7
Nicolas Patry authored Sep 24, 2024
```
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
```
dd8691b7
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537) · 3f14cd14
Daniël de Kok authored Sep 24, 2024
```
This replaces the custom layers in both models.
```
3f14cd14

Add support for scalar FP8 weight scales (#2550) · c29dc89c

Daniël de Kok authored Sep 24, 2024

* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print

c29dc89c

Micro cleanup. (#2555) · 74d3ce10
Nicolas Patry authored Sep 24, 2024

74d3ce10

20 Sep, 2024 1 commit
- hotfix: ipex fails since cuda moe kernel is not supported (#2532) · f478aa77
  Wang, Yi authored Sep 20, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  f478aa77
17 Sep, 2024 1 commit

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

12 Sep, 2024 2 commits
- hotfix : enable intel ipex cpu and xpu in python3.11 (#2517) · 3ac7df2b
  Wang, Yi authored Sep 12, 2024
```
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  3ac7df2b
- fix: pass missing revision arg for lora adapter when loading multiple… (#2510) · 628334d3
  drbh authored Sep 12, 2024
```
fix: pass missing revision arg for lora adapter when loading multiple adapters
```
  628334d3