Commits · 3c9df21ff8f0627988728388e95f097bb1f89217 · OpenDAS / text-generation-inference

18 Nov, 2024 3 commits

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f

add ipex moe implementation to support Mixtral and PhiMoe (#2707) · a5ecd6e5

Wang, Yi authored Nov 19, 2024



* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

a5ecd6e5

fix: improve find_segments via numpy diff (#2686) · fea62e92
drbh authored Nov 18, 2024

fea62e92

17 Nov, 2024 1 commit

Remove vLLM dependency for CUDA (#2751) · 52e48739

Daniël de Kok authored Nov 17, 2024

* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning

52e48739

15 Nov, 2024 7 commits

feat: return streaming errors as an event formatted for openai's client (#2668) · 6489f852

drbh authored Nov 15, 2024



* feat: return streaming errors as an event formatted for openai's client

* fix: propagate completions error events to stream

* fix: improve stream api error format and add status code

* fix: improve streamin error to include error_type

* Revert "fix: improve streamin error to include error_type"

This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a.

* Reworked the implementation.

* Revert "Reworked the implementation."

This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.

* Small lifting.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

6489f852

Upgrading our deps. (#2750) · 34a3bded
Nicolas Patry authored Nov 15, 2024
```
* Upgrading our deps.

* fixup.

* Fixup.
```
34a3bded

Upgrade outlines to 0.1.1 (#2742) · 4580ced0

Alex Weston authored Nov 15, 2024



* Upgrade outlines to 0.1.1

* Update for new API

* Check if allowed tokens is None

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

4580ced0

fix response type of document for Text Generation Inference (#2743) · 003eaec0
jito authored Nov 15, 2024
```
Signed-off-by: jitokim <pigberger70@gmail.com>
```
003eaec0

Fix: Change embeddings to embedding (#2738) · 4f4857a4

Billel Mokeddem authored Nov 15, 2024



fix: change embeddings to embedding
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>

4f4857a4

Fix: Change model_type from ssm to mamba (#2740) · f9ee46f7
Billel Mokeddem authored Nov 15, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
```
f9ee46f7
benchmark: fix prefill throughput (#2741) · 8442f1ac
Daniël de Kok authored Nov 15, 2024

8442f1ac

14 Nov, 2024 1 commit
- nix: update nixpkgs (#2746) · ca4f46dd
  Daniël de Kok authored Nov 14, 2024
```
Updates from Triton 2.1.0 to 3.1.0 (among other things).
```
  ca4f46dd
10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

07 Nov, 2024 1 commit
- add trust_remote_code in tokenizer to fix baichuan issue (#2725) · 97f7a22f
  Wang, Yi authored Nov 07, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  97f7a22f
04 Nov, 2024 6 commits
- fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717) · b1f9044d
  Wang, Yi authored Nov 04, 2024
```
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  b1f9044d
- nix: move to tgi-nix `main` (#2718) · 5eedb2ec
  Daniël de Kok authored Nov 04, 2024
  
  5eedb2ec
- Fixing linting on main. (#2719) · 9fde5666
  Nicolas Patry authored Nov 04, 2024
  
  9fde5666
- Fix prefix caching + speculative decoding (#2711) · aadc9cb4
  Travis Addair authored Nov 04, 2024
  
  aadc9cb4
- Hotfixing auto length (warmup max_s was wrong). (#2716) · a5593ba8
  Nicolas Patry authored Nov 04, 2024
  
  a5593ba8
- fix: add chat_tokenize endpoint to api docs (#2710) · 08c4184e
  drbh authored Nov 04, 2024
  
  08c4184e
02 Nov, 2024 1 commit

fix: create position ids for text only input (#2714) · 6e322052

drbh authored Nov 01, 2024

* fix: create position ids for text only input

* fix: prefer repeat over expand to avoid clone

6e322052

01 Nov, 2024 1 commit

fix cuda graphs for qwen2-vl (#2708) · 01dacf8e

drbh authored Oct 31, 2024



* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl

* fix: only check model type if config exists

* fix: adjust sharding and lm head logic

* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: return correct shape logits and add streaming test

* fix: remove unused import and refactor test

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

01dacf8e

30 Oct, 2024 2 commits

Support qwen2 vl (#2689) · befd9f67

drbh authored Oct 30, 2024

* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2

befd9f67

add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702) · 46aeb086

Wang, Yi authored Oct 30, 2024



add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

46aeb086

28 Oct, 2024 7 commits

Monkey patching as a desperate measure. (#2704) · 98330df6
Nicolas Patry authored Oct 28, 2024
```
* Monkey patching as a desperate measure.

* New snapshot ?
```
98330df6
More timeout on docker start ? (#2701) · 513d19b9
Nicolas Patry authored Oct 28, 2024
```
* More timeout on docker start ?

* Latest upgrade.
```
513d19b9
Fixing auto bloom test. (#2699) · 3a9cdc32
Nicolas Patry authored Oct 28, 2024

3a9cdc32
Update poetry lock. (#2698) · 78ce618c
Nicolas Patry authored Oct 28, 2024

78ce618c

We can have a tokenizer anywhere. (#2527) · 90b226db

Nicolas Patry authored Oct 28, 2024

* We can have a tokenizer anywhere.

* Handling potential lack of offsets (python tokenizer)

* Remove redundancy.

* Fixing the tests.

* Flake.lock update ?

* Fixing the  GIL locking.

* Fixing mamba by using the transformers version.

* Adding the legacy handle.

* Ellide lifetime.

* Lint.

* Deprecation message.

* Fixing bad rebase.

90b226db

Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd

Nicolas Patry authored Oct 28, 2024

* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).

0c9b6cdd

Green main (#2697) · 2e4f4ba1
Nicolas Patry authored Oct 28, 2024

2e4f4ba1

26 Oct, 2024 1 commit

Avoiding timeout for bloom tests. (#2693) · 8a8794a6

Nicolas Patry authored Oct 26, 2024

* Avoiding timeout for bloom tests.

* Skip the test let's see if it's always the first tests that fails.

* Fail early.

* Pulling ?

* No early exit.

8a8794a6

25 Oct, 2024 8 commits

chore: prepare 2.4.0 release (#2695) · a6b02da9
OlivierDehaene authored Oct 25, 2024

a6b02da9

feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93

OlivierDehaene authored Oct 25, 2024

* feat: add triton kernels to decrease latency of large batches

* cast to int32

* fix kernel

* fix kernel

* disable triton on rocm

* fix speculation

* add slots filtering kernel

6f88bd93

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

Add support for stop words in TRTLLM (#2678) · ba5fc7d9

Funtowicz Morgan authored Oct 25, 2024

* feat(trtllm): rewrite health to not account for current state

* chore(looper): cleanup a bit more

* feat(post_processing): max_new_tokens is const evaluated now

* chore(ffi):formatting

* feat(trtllm): add stop words handling

# Conflicts:
#	backends/trtllm/lib/backend.cpp

* chore(trtllm): create specific parallelconfig factory and logging init methods

* chore(trtllm): define a macro for SizeType cast

* chore(trtllm): use GetParallelConfig

* chore(trtllm): minor refactoring

* chore(trtllm): validate there are enough GPus on the system for the desired model

* chore(trtllm): ensure max throughput scheduling policy is selected

* chore(trtllm): minor fix

* chore(router): minor refactorings

* feat(docker): build with-slurm ompi

* feat(docker): add python3.10 dev to runtime deps

* chore(docker): add mpi to ld_library_path

* chore(docker): install transformers

* feat(trtllm): detect stop_words from generation_config.json

ba5fc7d9

Fixing mt0 test. (#2692) · db68bd05
Nicolas Patry authored Oct 25, 2024

db68bd05
Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) · cece8635
Nicolas Patry authored Oct 25, 2024

cece8635

[TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e

Funtowicz Morgan authored Oct 25, 2024



* (backend) use parking_lot crate for RwLock fairness

# Conflicts:
#	backends/trtllm/src/backend.rs

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

* (ffi) use const for GetSamplingConfig

* (server) expose new SchedulingError

* (trt)

* (build) setup ccache if available

* (ffi) add max_new_tokens parameters

* (backend) cleanup a bit

* (backend) expose PullNewTokens

* (ffi) cleanup again

* (ffi) add missing headers imports

* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>

* (looper) new looper initial implementation

* (ffi) remove narrowing type warning

* (ffi) encode the provided user prompt within each request thread

* (misc) change scope identifiers

* (backend) implement the post_processor background thread

* (misc) missing Result types for Rust

* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step

* (server) forward auth_token to server::run

* (build) fetchcontent use archives instead of git

* (ffi) fix usage of wrong vector constructor making a capacity fill call

* (ffi) missing namespace for tle::Response

* (ffi) do not use reference capture in lambda as we are not capturing anything

* (backend) refactor & cleanup

* (Dockerfile.trtllm) delete for now

* (misc) simplify [make_]move_iterator by using c++20 type inference

* (misc) no need to move for uint32_t items

* (scheduler) rework submit/pull logic

* (post) impl postprocessing

* (misc) delete backend.rs

* (misc) rerun-if-changed all the cmake modules

* (misc) move to latest trtllm

* (fix): HOPPER_SM_MAJOR is 9 not 8

* (misc: build for sm_{75,80,86,89,90} by default

* (misc): build with trtllm 0.13.0

* (misc): increase verbosity of spdlog

* (fix): do not recreate the stateful hashmap at every it

* (misc): update dependency in trtllm dockerfile

* (misc): update dependency in trtllm dockerfile

* (misc): disable logging in release mode

* (misc): improve trtllm download script robustness

* (fix): ore fixes for Dockerfile

* misc(cuda): require 12.6

* chore(cmake): use correct policy for download_timestamp

* feat(looper): check engine and executorWorker paths exist before creating the backend

* chore(cmake): download timestamp should be before URL

* feat(looper): minor optimizations to avoid growing too much the containers

* chore(trtllm): move dockerfile to right place

* chore(trtllm): disable tokenizer parallelism by default

* chore(trtllm): fmt

* chore(trtllm): post-rebase commit

* chore(trtllm): remove unused method

* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime

* misc(router): remove SchedulingError

* feat(trtllm): do not tokenize twice

* Revert "chore(trtllm): remove unused method"

This reverts commit 31747163

* chore(rebase): fix invalid references

* chore(router): add python dependency

* Lint.

* Fix bad rebase

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

43df056e

Fixing "deadlock" when python prompts for trust_remote_code by always (#2664) · ed87b464
Nicolas Patry authored Oct 25, 2024
```
specifiying a value.
```
ed87b464