Commits · ee3d69447f919703cebff4b17c1949621786fe04 · OpenDAS / text-generation-inference

20 Jan, 2025 1 commit
- perfect the adaptation of v3.0.0 · ee3d6944
  xuxzh1 authored Jan 20, 2025
  
  ee3d6944
23 Dec, 2024 1 commit
- adapt v3.0.0 · 12494cf5
  xuxzh1 authored Dec 23, 2024
  
  12494cf5
09 Dec, 2024 1 commit

Attempt for cleverer auto batch_prefill values (some simplifications). (#2808) · a04356fb

Nicolas Patry authored Dec 10, 2024



* Attempt for cleverer auto batch_prefill values (some simplifications).

* Less flaky tests.

* Fixing typo insertion.

* Update launcher/src/main.rs
Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Adding small comment for source of calculation.

* Adding L40.

* Adding L40s.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

a04356fb

06 Dec, 2024 3 commits

Enable paligemma2 (#2807) · 9f5c9a5e

drbh authored Dec 06, 2024

* feat: support loading gemma2 as vlm text model

* feat: add test for paligemma2

9f5c9a5e

Removing experimental to prefill chunking. · 08f6fa0b
Nicolas Patry authored Dec 06, 2024

08f6fa0b

Auto max prefill (#2797) · 5df80590

Nicolas Patry authored Dec 06, 2024

* Attempt at automatic max batch prefill.

* Taking into account number of shards.

* Adding more cards.

* Adding A100 + H100

* Adding a few more cards.

* Logprobs cost too much.

* h100 better name, and keep factor of 2

* Damn inflated sparse tflops.

* Typo in h100.

* Updated the flops calculation (checked with fvcore).

* chunking by default.

* Fix prefix caching for chat completion since we removed logprobs.

* More tests.

* Dropping all the prefill logprobs.

* Add a flag that enables users to get logprobs back.

* Repairing prompt token counting.

* Fixing a few tests.

* Remove some scaffolding.

* Attempting to reduces the issues (workarounds for now).

5df80590

04 Dec, 2024 1 commit
- fix: avoid setting use_sgmv if no kernels present (#2796) · e0db6333
  drbh authored Dec 04, 2024
  
  e0db6333
03 Dec, 2024 2 commits

Saving some VRAM. (#2790) · b57f3703

Nicolas Patry authored Dec 03, 2024

* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.

b57f3703

Sync (most) server dependencies with Nix (#2782) · 2003d8be

Daniël de Kok authored Dec 03, 2024



* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2003d8be

02 Dec, 2024 1 commit

fix: only use eos_token_id as pad_token_id if int (#2774) · 535149d8

Dmitry Rogozhkin authored Dec 01, 2024



LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

535149d8

26 Nov, 2024 1 commit

Use FP8 KV cache when specified by compressed-tensors (#2761) · 72ab60fd

Daniël de Kok authored Nov 26, 2024

The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).

72ab60fd

25 Nov, 2024 1 commit

Move JSON grammar -> regex grammar conversion to the router (#2772) · 289aa485

Daniël de Kok authored Nov 25, 2024

* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py

289aa485

21 Nov, 2024 1 commit
- feat: add payload limit (#2726) · ab7ccf5b
  OlivierDehaene authored Nov 21, 2024
```
* feat: add payload limit

* update launcher
```
  ab7ccf5b
20 Nov, 2024 1 commit
- Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758) · 46a5a7e7
  Daniël de Kok authored Nov 20, 2024
```
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
```
  46a5a7e7
19 Nov, 2024 2 commits
- fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) · bd6e8b3c
  drbh authored Nov 19, 2024
  
  bd6e8b3c
- Simplify two ipex conditions (#2755) · b4ec427a
  Daniël de Kok authored Nov 19, 2024
  
  b4ec427a
18 Nov, 2024 4 commits

feat: support flash attention 2 in qwen2 vl vision blocks (#2721) · 38cff84a

drbh authored Nov 18, 2024

* feat: support flash attention 2 in qwen2 vl vision blocks

* fix: calc max_seqlen once and small refactors

38cff84a

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f

add ipex moe implementation to support Mixtral and PhiMoe (#2707) · a5ecd6e5

Wang, Yi authored Nov 19, 2024



* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

a5ecd6e5

fix: improve find_segments via numpy diff (#2686) · fea62e92
drbh authored Nov 18, 2024

fea62e92

17 Nov, 2024 1 commit

Remove vLLM dependency for CUDA (#2751) · 52e48739

Daniël de Kok authored Nov 17, 2024

* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning

52e48739

15 Nov, 2024 4 commits

Upgrading our deps. (#2750) · 34a3bded
Nicolas Patry authored Nov 15, 2024
```
* Upgrading our deps.

* fixup.

* Fixup.
```
34a3bded

Upgrade outlines to 0.1.1 (#2742) · 4580ced0

Alex Weston authored Nov 15, 2024



* Upgrade outlines to 0.1.1

* Update for new API

* Check if allowed tokens is None

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

4580ced0

Fix: Change embeddings to embedding (#2738) · 4f4857a4

Billel Mokeddem authored Nov 15, 2024



fix: change embeddings to embedding
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>

4f4857a4

Fix: Change model_type from ssm to mamba (#2740) · f9ee46f7
Billel Mokeddem authored Nov 15, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
```
f9ee46f7

10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

04 Nov, 2024 4 commits
- fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717) · b1f9044d
  Wang, Yi authored Nov 04, 2024
```
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  b1f9044d
- Fixing linting on main. (#2719) · 9fde5666
  Nicolas Patry authored Nov 04, 2024
  
  9fde5666
- Fix prefix caching + speculative decoding (#2711) · aadc9cb4
  Travis Addair authored Nov 04, 2024
  
  aadc9cb4
- Hotfixing auto length (warmup max_s was wrong). (#2716) · a5593ba8
  Nicolas Patry authored Nov 04, 2024
  
  a5593ba8
02 Nov, 2024 1 commit

fix: create position ids for text only input (#2714) · 6e322052

drbh authored Nov 01, 2024

* fix: create position ids for text only input

* fix: prefer repeat over expand to avoid clone

6e322052

01 Nov, 2024 1 commit

fix cuda graphs for qwen2-vl (#2708) · 01dacf8e

drbh authored Oct 31, 2024



* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl

* fix: only check model type if config exists

* fix: adjust sharding and lm head logic

* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: return correct shape logits and add streaming test

* fix: remove unused import and refactor test

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

01dacf8e

30 Oct, 2024 1 commit

Support qwen2 vl (#2689) · befd9f67

drbh authored Oct 30, 2024

* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2

befd9f67

28 Oct, 2024 3 commits

Fixing auto bloom test. (#2699) · 3a9cdc32
Nicolas Patry authored Oct 28, 2024

3a9cdc32

We can have a tokenizer anywhere. (#2527) · 90b226db

Nicolas Patry authored Oct 28, 2024

* We can have a tokenizer anywhere.

* Handling potential lack of offsets (python tokenizer)

* Remove redundancy.

* Fixing the tests.

* Flake.lock update ?

* Fixing the  GIL locking.

* Fixing mamba by using the transformers version.

* Adding the legacy handle.

* Ellide lifetime.

* Lint.

* Deprecation message.

* Fixing bad rebase.

90b226db

Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd

Nicolas Patry authored Oct 28, 2024

* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).

0c9b6cdd

25 Oct, 2024 3 commits

feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93

OlivierDehaene authored Oct 25, 2024

* feat: add triton kernels to decrease latency of large batches

* cast to int32

* fix kernel

* fix kernel

* disable triton on rocm

* fix speculation

* add slots filtering kernel

6f88bd93

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) · cece8635
Nicolas Patry authored Oct 25, 2024

cece8635

24 Oct, 2024 1 commit

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74