Commits · 0c9b6cdd768558652afdf5e5053aeb49bf4bc21f · OpenDAS / text-generation-inference

28 Oct, 2024 2 commits

Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd

Nicolas Patry authored Oct 28, 2024

* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).

0c9b6cdd

Green main (#2697) · 2e4f4ba1
Nicolas Patry authored Oct 28, 2024

2e4f4ba1

26 Oct, 2024 1 commit

Avoiding timeout for bloom tests. (#2693) · 8a8794a6

Nicolas Patry authored Oct 26, 2024

* Avoiding timeout for bloom tests.

* Skip the test let's see if it's always the first tests that fails.

* Fail early.

* Pulling ?

* No early exit.

8a8794a6

25 Oct, 2024 8 commits

chore: prepare 2.4.0 release (#2695) · a6b02da9
OlivierDehaene authored Oct 25, 2024

a6b02da9

feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93

OlivierDehaene authored Oct 25, 2024

* feat: add triton kernels to decrease latency of large batches

* cast to int32

* fix kernel

* fix kernel

* disable triton on rocm

* fix speculation

* add slots filtering kernel

6f88bd93

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

Add support for stop words in TRTLLM (#2678) · ba5fc7d9

Funtowicz Morgan authored Oct 25, 2024

* feat(trtllm): rewrite health to not account for current state

* chore(looper): cleanup a bit more

* feat(post_processing): max_new_tokens is const evaluated now

* chore(ffi):formatting

* feat(trtllm): add stop words handling

# Conflicts:
#	backends/trtllm/lib/backend.cpp

* chore(trtllm): create specific parallelconfig factory and logging init methods

* chore(trtllm): define a macro for SizeType cast

* chore(trtllm): use GetParallelConfig

* chore(trtllm): minor refactoring

* chore(trtllm): validate there are enough GPus on the system for the desired model

* chore(trtllm): ensure max throughput scheduling policy is selected

* chore(trtllm): minor fix

* chore(router): minor refactorings

* feat(docker): build with-slurm ompi

* feat(docker): add python3.10 dev to runtime deps

* chore(docker): add mpi to ld_library_path

* chore(docker): install transformers

* feat(trtllm): detect stop_words from generation_config.json

ba5fc7d9

Fixing mt0 test. (#2692) · db68bd05
Nicolas Patry authored Oct 25, 2024

db68bd05
Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) · cece8635
Nicolas Patry authored Oct 25, 2024

cece8635

[TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e

Funtowicz Morgan authored Oct 25, 2024



* (backend) use parking_lot crate for RwLock fairness

# Conflicts:
#	backends/trtllm/src/backend.rs

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

* (ffi) use const for GetSamplingConfig

* (server) expose new SchedulingError

* (trt)

* (build) setup ccache if available

* (ffi) add max_new_tokens parameters

* (backend) cleanup a bit

* (backend) expose PullNewTokens

* (ffi) cleanup again

* (ffi) add missing headers imports

* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>

* (looper) new looper initial implementation

* (ffi) remove narrowing type warning

* (ffi) encode the provided user prompt within each request thread

* (misc) change scope identifiers

* (backend) implement the post_processor background thread

* (misc) missing Result types for Rust

* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step

* (server) forward auth_token to server::run

* (build) fetchcontent use archives instead of git

* (ffi) fix usage of wrong vector constructor making a capacity fill call

* (ffi) missing namespace for tle::Response

* (ffi) do not use reference capture in lambda as we are not capturing anything

* (backend) refactor & cleanup

* (Dockerfile.trtllm) delete for now

* (misc) simplify [make_]move_iterator by using c++20 type inference

* (misc) no need to move for uint32_t items

* (scheduler) rework submit/pull logic

* (post) impl postprocessing

* (misc) delete backend.rs

* (misc) rerun-if-changed all the cmake modules

* (misc) move to latest trtllm

* (fix): HOPPER_SM_MAJOR is 9 not 8

* (misc: build for sm_{75,80,86,89,90} by default

* (misc): build with trtllm 0.13.0

* (misc): increase verbosity of spdlog

* (fix): do not recreate the stateful hashmap at every it

* (misc): update dependency in trtllm dockerfile

* (misc): update dependency in trtllm dockerfile

* (misc): disable logging in release mode

* (misc): improve trtllm download script robustness

* (fix): ore fixes for Dockerfile

* misc(cuda): require 12.6

* chore(cmake): use correct policy for download_timestamp

* feat(looper): check engine and executorWorker paths exist before creating the backend

* chore(cmake): download timestamp should be before URL

* feat(looper): minor optimizations to avoid growing too much the containers

* chore(trtllm): move dockerfile to right place

* chore(trtllm): disable tokenizer parallelism by default

* chore(trtllm): fmt

* chore(trtllm): post-rebase commit

* chore(trtllm): remove unused method

* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime

* misc(router): remove SchedulingError

* feat(trtllm): do not tokenize twice

* Revert "chore(trtllm): remove unused method"

This reverts commit 31747163

* chore(rebase): fix invalid references

* chore(router): add python dependency

* Lint.

* Fix bad rebase

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

43df056e

Fixing "deadlock" when python prompts for trust_remote_code by always (#2664) · ed87b464
Nicolas Patry authored Oct 25, 2024
```
specifiying a value.
```
ed87b464

24 Oct, 2024 3 commits

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

Fix Phi 3.5 MoE tests (#2684) · 14a0df3a

Daniël de Kok authored Oct 24, 2024

PR #2682 also fixed in issue in Phi MoE, but it changes the test
outputs a bit. Fix this.

14a0df3a

flashinfer: reminder to remove contiguous call in the future (#2685) · 1b914f37
Daniël de Kok authored Oct 24, 2024

1b914f37

23 Oct, 2024 4 commits
- feat: allow any supported payload on /invocations (#2683) · 41c26237
  OlivierDehaene authored Oct 23, 2024
```
* feat: allow any supported payload on /invocations

* update openAPI

* update doc
```
  41c26237
- hotfix: fix flashllama · 27ff1871
  OlivierDehaene authored Oct 23, 2024
  
  27ff1871
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
- Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632) · f58eb70e
  Daniël de Kok authored Oct 23, 2024
  
  f58eb70e
22 Oct, 2024 1 commit

Add `impureWithCuda` dev shell (#2677) · 9c9ef37c

Daniël de Kok authored Oct 22, 2024

* Add `impureWithCuda` dev shell

This shell is handy when developing some kernels jointly with TGI - it
adds nvcc and a bunch of commonly-used CUDA libraries to the environment.

We don't add this to the normal impure shell to keep the development
environment as clean as possible (avoid accidental dependencies, etc.).

* Add cuDNN

9c9ef37c

21 Oct, 2024 2 commits

break when there's nothing to read (#2582) · 058d3061
Wang, Yi authored Oct 21, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
058d3061

Test Marlin MoE with `desc_act=true` (#2622) · 7f54b733

Daniël de Kok authored Oct 21, 2024

Update the Mixtral GPTQ test to use a model with `desc_act=true` and
`group_size!=-1` to ensure that we are checking activation
sorting/non-full K (with tensor parallelism). The `desc_act=false` case
is already checked by the Mixtral AWQ test.

7f54b733

19 Oct, 2024 1 commit

Make handling of FP8 scales more consisent (#2666) · 5e0fb468

Daniël de Kok authored Oct 19, 2024

Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.

5e0fb468

18 Oct, 2024 1 commit

CI job. Gpt awq 4 (#2665) · 153ff374

Nicolas Patry authored Oct 18, 2024



* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

153ff374

17 Oct, 2024 5 commits

Break cycle between the attention implementations and KV cache (#2627) · 8ec57558
Daniël de Kok authored Oct 17, 2024

8ec57558

fix: prefer inplace softmax to avoid copy (#2661) · 5f32dea1

drbh authored Oct 17, 2024



* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

5f32dea1

fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663) · 1b97e084

oOraph authored Oct 17, 2024

tgi-entrypoint: exec instead of spawning a child process

reason: otherwise parent will receive the signals when we'd like tgi to receive them
keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child
Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

1b97e084

Simplify the `attention` function (#2609) · 59ea38cb

Daniël de Kok authored Oct 17, 2024

* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support

59ea38cb

Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
Daniël de Kok authored Oct 17, 2024
```
* Support `e4m3fn` KV cache

* Make check more obvious
```
5bbe1ce0

16 Oct, 2024 2 commits

feat: prefill chunking (#2600) · a6a0c97e

OlivierDehaene authored Oct 16, 2024



* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

a6a0c97e

Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8

Mohit Sharma authored Oct 16, 2024

* (feat) fp8 fnuz support for rocm

* (review comments) Fix compression_config load, type hints

* (bug) update all has_tensor

* (review_comments) fix typo and added comments

* (nit) improved comment

704a58c8

15 Oct, 2024 3 commits

Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651) · ffe05ccd

Alvaro Bartolome authored Oct 15, 2024

As spotted by @philschmid, the payload was compliant with Vertex AI, but
just partially, since ideally the most compliant version would be with
the generation kwargs flattened to be on the same level as the
`messages`; meaning that Vertex AI would still expect a list of
instances, but each instance would be an OpenAI-compatible instance,
which is more clear; and more aligned with the SageMaker integration
too, so kudos to him for spotting that; and sorry from my end for any
inconvenience @Narsil.

ffe05ccd

Use flashinfer for Gemma 2. · ce7e3565
Daniël de Kok authored Oct 15, 2024

ce7e3565
Fixing linters. (#2650) · cf04a43f
Nicolas Patry authored Oct 15, 2024

cf04a43f

14 Oct, 2024 5 commits

feat: enable pytorch xpu support for non-attention models (#2561) · 58848cb4

Dmitry Rogozhkin authored Oct 14, 2024



XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.

Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

58848cb4

update ipex to fix incorrect output of mllama in cpu (#2640) · 7a82ddcb
Wang, Yi authored Oct 14, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
7a82ddcb
Clarify gated description and quicktour (#2631) · 51f54018
Omar Sanseviero authored Oct 14, 2024
```
Update quicktour.md
```
51f54018

Cpu perf (#2596) · 3ea82d00

Nicolas Patry authored Oct 14, 2024



* break when there's nothing to read
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Different approach, only listen on stdin when `LOG_LEVEL=debug` (which
is where dropping to a debugger is important).

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

3ea82d00

Small fixes for supported models (#2471) · ce28ee88

Omar Sanseviero authored Oct 14, 2024



* Small improvements for docs

* Update _toctree.yml

* Updating the doc (we keep the list actually).

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

ce28ee88

11 Oct, 2024 1 commit
- Fixing intel Supports windowing. (#2637) · 0c478846
  Nicolas Patry authored Oct 11, 2024
  
  0c478846
10 Oct, 2024 1 commit

Intel ci (#2630) · 3dbdf63e

Nicolas Patry authored Oct 10, 2024

* Intel CI ?

* Let's try non sharded gemma.

* Snapshot rename

* Apparently container can be gone already.

3dbdf63e