Commits · 72ab60fdd588266be85ff469eeb07b6c42e2f56a · OpenDAS / text-generation-inference

26 Nov, 2024 1 commit

Use FP8 KV cache when specified by compressed-tensors (#2761) · 72ab60fd

Daniël de Kok authored Nov 26, 2024

The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).

72ab60fd

25 Nov, 2024 1 commit

Move JSON grammar -> regex grammar conversion to the router (#2772) · 289aa485

Daniël de Kok authored Nov 25, 2024

* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py

289aa485

21 Nov, 2024 1 commit
- feat: add payload limit (#2726) · ab7ccf5b
  OlivierDehaene authored Nov 21, 2024
```
* feat: add payload limit

* update launcher
```
  ab7ccf5b
20 Nov, 2024 1 commit
- Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758) · 46a5a7e7
  Daniël de Kok authored Nov 20, 2024
```
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
```
  46a5a7e7
19 Nov, 2024 2 commits
- fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) · bd6e8b3c
  drbh authored Nov 19, 2024
  
  bd6e8b3c
- Simplify two ipex conditions (#2755) · b4ec427a
  Daniël de Kok authored Nov 19, 2024
  
  b4ec427a
18 Nov, 2024 4 commits

feat: support flash attention 2 in qwen2 vl vision blocks (#2721) · 38cff84a

drbh authored Nov 18, 2024

* feat: support flash attention 2 in qwen2 vl vision blocks

* fix: calc max_seqlen once and small refactors

38cff84a

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f

add ipex moe implementation to support Mixtral and PhiMoe (#2707) · a5ecd6e5

Wang, Yi authored Nov 19, 2024



* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

a5ecd6e5

fix: improve find_segments via numpy diff (#2686) · fea62e92
drbh authored Nov 18, 2024

fea62e92

17 Nov, 2024 1 commit

Remove vLLM dependency for CUDA (#2751) · 52e48739

Daniël de Kok authored Nov 17, 2024

* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning

52e48739

15 Nov, 2024 4 commits

Upgrading our deps. (#2750) · 34a3bded
Nicolas Patry authored Nov 15, 2024
```
* Upgrading our deps.

* fixup.

* Fixup.
```
34a3bded

Upgrade outlines to 0.1.1 (#2742) · 4580ced0

Alex Weston authored Nov 15, 2024



* Upgrade outlines to 0.1.1

* Update for new API

* Check if allowed tokens is None

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

4580ced0

Fix: Change embeddings to embedding (#2738) · 4f4857a4

Billel Mokeddem authored Nov 15, 2024



fix: change embeddings to embedding
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>

4f4857a4

Fix: Change model_type from ssm to mamba (#2740) · f9ee46f7
Billel Mokeddem authored Nov 15, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
```
f9ee46f7

10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

04 Nov, 2024 4 commits
- fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717) · b1f9044d
  Wang, Yi authored Nov 04, 2024
```
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  b1f9044d
- Fixing linting on main. (#2719) · 9fde5666
  Nicolas Patry authored Nov 04, 2024
  
  9fde5666
- Fix prefix caching + speculative decoding (#2711) · aadc9cb4
  Travis Addair authored Nov 04, 2024
  
  aadc9cb4
- Hotfixing auto length (warmup max_s was wrong). (#2716) · a5593ba8
  Nicolas Patry authored Nov 04, 2024
  
  a5593ba8
02 Nov, 2024 1 commit

fix: create position ids for text only input (#2714) · 6e322052

drbh authored Nov 01, 2024

* fix: create position ids for text only input

* fix: prefer repeat over expand to avoid clone

6e322052

01 Nov, 2024 1 commit

fix cuda graphs for qwen2-vl (#2708) · 01dacf8e

drbh authored Oct 31, 2024



* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl

* fix: only check model type if config exists

* fix: adjust sharding and lm head logic

* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: return correct shape logits and add streaming test

* fix: remove unused import and refactor test

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

01dacf8e

30 Oct, 2024 1 commit

Support qwen2 vl (#2689) · befd9f67

drbh authored Oct 30, 2024

* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2

befd9f67

28 Oct, 2024 3 commits

Fixing auto bloom test. (#2699) · 3a9cdc32
Nicolas Patry authored Oct 28, 2024

3a9cdc32

We can have a tokenizer anywhere. (#2527) · 90b226db

Nicolas Patry authored Oct 28, 2024

* We can have a tokenizer anywhere.

* Handling potential lack of offsets (python tokenizer)

* Remove redundancy.

* Fixing the tests.

* Flake.lock update ?

* Fixing the  GIL locking.

* Fixing mamba by using the transformers version.

* Adding the legacy handle.

* Ellide lifetime.

* Lint.

* Deprecation message.

* Fixing bad rebase.

90b226db

Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd

Nicolas Patry authored Oct 28, 2024

* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).

0c9b6cdd

25 Oct, 2024 3 commits

feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93

OlivierDehaene authored Oct 25, 2024

* feat: add triton kernels to decrease latency of large batches

* cast to int32

* fix kernel

* fix kernel

* disable triton on rocm

* fix speculation

* add slots filtering kernel

6f88bd93

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) · cece8635
Nicolas Patry authored Oct 25, 2024

cece8635

24 Oct, 2024 2 commits

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

flashinfer: reminder to remove contiguous call in the future (#2685) · 1b914f37
Daniël de Kok authored Oct 24, 2024

1b914f37

23 Oct, 2024 2 commits
- hotfix: fix flashllama · 27ff1871
  OlivierDehaene authored Oct 23, 2024
  
  27ff1871
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
19 Oct, 2024 1 commit

Make handling of FP8 scales more consisent (#2666) · 5e0fb468

Daniël de Kok authored Oct 19, 2024

Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.

5e0fb468

18 Oct, 2024 1 commit

CI job. Gpt awq 4 (#2665) · 153ff374

Nicolas Patry authored Oct 18, 2024



* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

153ff374

17 Oct, 2024 4 commits

Break cycle between the attention implementations and KV cache (#2627) · 8ec57558
Daniël de Kok authored Oct 17, 2024

8ec57558

fix: prefer inplace softmax to avoid copy (#2661) · 5f32dea1

drbh authored Oct 17, 2024



* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

5f32dea1

Simplify the `attention` function (#2609) · 59ea38cb

Daniël de Kok authored Oct 17, 2024

* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support

59ea38cb

Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
Daniël de Kok authored Oct 17, 2024
```
* Support `e4m3fn` KV cache

* Make check more obvious
```
5bbe1ce0

16 Oct, 2024 1 commit

feat: prefill chunking (#2600) · a6a0c97e

OlivierDehaene authored Oct 16, 2024



* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

a6a0c97e