Commits · 43df056eee06eb71e2762ec3aa6cb22c5646054e · OpenDAS / text-generation-inference

25 Oct, 2024 2 commits

[TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e

Funtowicz Morgan authored Oct 25, 2024



* (backend) use parking_lot crate for RwLock fairness

# Conflicts:
#	backends/trtllm/src/backend.rs

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

* (ffi) use const for GetSamplingConfig

* (server) expose new SchedulingError

* (trt)

* (build) setup ccache if available

* (ffi) add max_new_tokens parameters

* (backend) cleanup a bit

* (backend) expose PullNewTokens

* (ffi) cleanup again

* (ffi) add missing headers imports

* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>

* (looper) new looper initial implementation

* (ffi) remove narrowing type warning

* (ffi) encode the provided user prompt within each request thread

* (misc) change scope identifiers

* (backend) implement the post_processor background thread

* (misc) missing Result types for Rust

* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step

* (server) forward auth_token to server::run

* (build) fetchcontent use archives instead of git

* (ffi) fix usage of wrong vector constructor making a capacity fill call

* (ffi) missing namespace for tle::Response

* (ffi) do not use reference capture in lambda as we are not capturing anything

* (backend) refactor & cleanup

* (Dockerfile.trtllm) delete for now

* (misc) simplify [make_]move_iterator by using c++20 type inference

* (misc) no need to move for uint32_t items

* (scheduler) rework submit/pull logic

* (post) impl postprocessing

* (misc) delete backend.rs

* (misc) rerun-if-changed all the cmake modules

* (misc) move to latest trtllm

* (fix): HOPPER_SM_MAJOR is 9 not 8

* (misc: build for sm_{75,80,86,89,90} by default

* (misc): build with trtllm 0.13.0

* (misc): increase verbosity of spdlog

* (fix): do not recreate the stateful hashmap at every it

* (misc): update dependency in trtllm dockerfile

* (misc): update dependency in trtllm dockerfile

* (misc): disable logging in release mode

* (misc): improve trtllm download script robustness

* (fix): ore fixes for Dockerfile

* misc(cuda): require 12.6

* chore(cmake): use correct policy for download_timestamp

* feat(looper): check engine and executorWorker paths exist before creating the backend

* chore(cmake): download timestamp should be before URL

* feat(looper): minor optimizations to avoid growing too much the containers

* chore(trtllm): move dockerfile to right place

* chore(trtllm): disable tokenizer parallelism by default

* chore(trtllm): fmt

* chore(trtllm): post-rebase commit

* chore(trtllm): remove unused method

* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime

* misc(router): remove SchedulingError

* feat(trtllm): do not tokenize twice

* Revert "chore(trtllm): remove unused method"

This reverts commit 31747163

* chore(rebase): fix invalid references

* chore(router): add python dependency

* Lint.

* Fix bad rebase

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

43df056e

Fixing "deadlock" when python prompts for trust_remote_code by always (#2664) · ed87b464
Nicolas Patry authored Oct 25, 2024
```
specifiying a value.
```
ed87b464

24 Oct, 2024 3 commits

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

Fix Phi 3.5 MoE tests (#2684) · 14a0df3a

Daniël de Kok authored Oct 24, 2024

PR #2682 also fixed in issue in Phi MoE, but it changes the test
outputs a bit. Fix this.

14a0df3a

flashinfer: reminder to remove contiguous call in the future (#2685) · 1b914f37
Daniël de Kok authored Oct 24, 2024

1b914f37

23 Oct, 2024 4 commits
- feat: allow any supported payload on /invocations (#2683) · 41c26237
  OlivierDehaene authored Oct 23, 2024
```
* feat: allow any supported payload on /invocations

* update openAPI

* update doc
```
  41c26237
- hotfix: fix flashllama · 27ff1871
  OlivierDehaene authored Oct 23, 2024
  
  27ff1871
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
- Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632) · f58eb70e
  Daniël de Kok authored Oct 23, 2024
  
  f58eb70e
22 Oct, 2024 1 commit

Add `impureWithCuda` dev shell (#2677) · 9c9ef37c

Daniël de Kok authored Oct 22, 2024

* Add `impureWithCuda` dev shell

This shell is handy when developing some kernels jointly with TGI - it
adds nvcc and a bunch of commonly-used CUDA libraries to the environment.

We don't add this to the normal impure shell to keep the development
environment as clean as possible (avoid accidental dependencies, etc.).

* Add cuDNN

9c9ef37c

21 Oct, 2024 2 commits

break when there's nothing to read (#2582) · 058d3061
Wang, Yi authored Oct 21, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
058d3061

Test Marlin MoE with `desc_act=true` (#2622) · 7f54b733

Daniël de Kok authored Oct 21, 2024

Update the Mixtral GPTQ test to use a model with `desc_act=true` and
`group_size!=-1` to ensure that we are checking activation
sorting/non-full K (with tensor parallelism). The `desc_act=false` case
is already checked by the Mixtral AWQ test.

7f54b733

19 Oct, 2024 1 commit

Make handling of FP8 scales more consisent (#2666) · 5e0fb468

Daniël de Kok authored Oct 19, 2024

Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.

5e0fb468

18 Oct, 2024 1 commit

CI job. Gpt awq 4 (#2665) · 153ff374

Nicolas Patry authored Oct 18, 2024



* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

153ff374

17 Oct, 2024 5 commits

Break cycle between the attention implementations and KV cache (#2627) · 8ec57558
Daniël de Kok authored Oct 17, 2024

8ec57558

fix: prefer inplace softmax to avoid copy (#2661) · 5f32dea1

drbh authored Oct 17, 2024



* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

5f32dea1

fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663) · 1b97e084

oOraph authored Oct 17, 2024

tgi-entrypoint: exec instead of spawning a child process

reason: otherwise parent will receive the signals when we'd like tgi to receive them
keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child
Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

1b97e084

Simplify the `attention` function (#2609) · 59ea38cb

Daniël de Kok authored Oct 17, 2024

* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support

59ea38cb

Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
Daniël de Kok authored Oct 17, 2024
```
* Support `e4m3fn` KV cache

* Make check more obvious
```
5bbe1ce0

16 Oct, 2024 2 commits

feat: prefill chunking (#2600) · a6a0c97e

OlivierDehaene authored Oct 16, 2024



* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

a6a0c97e

Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8

Mohit Sharma authored Oct 16, 2024

* (feat) fp8 fnuz support for rocm

* (review comments) Fix compression_config load, type hints

* (bug) update all has_tensor

* (review_comments) fix typo and added comments

* (nit) improved comment

704a58c8

15 Oct, 2024 3 commits

Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651) · ffe05ccd

Alvaro Bartolome authored Oct 15, 2024

As spotted by @philschmid, the payload was compliant with Vertex AI, but
just partially, since ideally the most compliant version would be with
the generation kwargs flattened to be on the same level as the
`messages`; meaning that Vertex AI would still expect a list of
instances, but each instance would be an OpenAI-compatible instance,
which is more clear; and more aligned with the SageMaker integration
too, so kudos to him for spotting that; and sorry from my end for any
inconvenience @Narsil.

ffe05ccd

Use flashinfer for Gemma 2. · ce7e3565
Daniël de Kok authored Oct 15, 2024

ce7e3565
Fixing linters. (#2650) · cf04a43f
Nicolas Patry authored Oct 15, 2024

cf04a43f

14 Oct, 2024 5 commits

feat: enable pytorch xpu support for non-attention models (#2561) · 58848cb4

Dmitry Rogozhkin authored Oct 14, 2024



XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.

Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

58848cb4

update ipex to fix incorrect output of mllama in cpu (#2640) · 7a82ddcb
Wang, Yi authored Oct 14, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
7a82ddcb
Clarify gated description and quicktour (#2631) · 51f54018
Omar Sanseviero authored Oct 14, 2024
```
Update quicktour.md
```
51f54018

Cpu perf (#2596) · 3ea82d00

Nicolas Patry authored Oct 14, 2024



* break when there's nothing to read
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Different approach, only listen on stdin when `LOG_LEVEL=debug` (which
is where dropping to a debugger is important).

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

3ea82d00

Small fixes for supported models (#2471) · ce28ee88

Omar Sanseviero authored Oct 14, 2024



* Small improvements for docs

* Update _toctree.yml

* Updating the doc (we keep the list actually).

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

ce28ee88

11 Oct, 2024 1 commit
- Fixing intel Supports windowing. (#2637) · 0c478846
  Nicolas Patry authored Oct 11, 2024
  
  0c478846
10 Oct, 2024 3 commits

Intel ci (#2630) · 3dbdf63e

Nicolas Patry authored Oct 10, 2024

* Intel CI ?

* Let's try non sharded gemma.

* Snapshot rename

* Apparently container can be gone already.

3dbdf63e

Update documentation to most recent stable version of TGI. (#2625) · d912f0bf
vb authored Oct 10, 2024
```
Update to most recent stable version of TGI.
```
d912f0bf

feat: allow tool calling to respond without a tool (#2614) · e36dfaa8

drbh authored Oct 10, 2024



* feat: process token stream before returning to client

* fix: expect content in test

* fix: improve comparison via ruff lint

* fix: return event in all cases

* fix: always send event on error, avoid unwraps, refactor and improve tests

* fix: prefer no_tool over notify_error to improve reponse

* fix: adjust chat input test for no_tool

* fix: adjust test expected content

---------
Co-authored-by: System administrator <root@ip-10-90-0-186.ec2.internal>

e36dfaa8

09 Oct, 2024 2 commits

AMD CI (#2589) · 43f39f68

Nicolas Patry authored Oct 09, 2024

* Only run 1 valid test.

* TRying the tailscale action quickly.

* ?

* bash spaces.

* Remove tailscale.

* More quotes.

* mnt2 ?

* Othername to avoid recursive directories.

* Good old tmate.

* Remove tmate.

* Trying a few things.

* Remove some stuff.

* Sleep ?

* Tmp

* busybox

* Launcher tgi

* Starting hello

* Busybox in python

* No device.

* Removing all variables ?

* A un moment donné.

* Tmp

* Tmp2

* DEvice request, no container name

* No device requests

* Without pytest.

* No pytest.

* from env

* Start with devices

* Attemp #1

* Remove stdin messing

* Only 1 test, no container name

* Raw tgi

* Sending args.

* Show pip freeze.

* Start downloading with token

* Giving HIP devices.

* Mount volume + port forward

* Without pytest.

* No token

* Repeated arguments

* Wrong kwarg.

* On 2 GPUs

* Fallback to single shard CI test.

* Testing

* yaml

* Common cache ?

* Trailing slash ?

* Docker volume split.

* Fix docker volume

* Fixing ?

* ?

* Try no devices ?

* Flash llama on intel CPU ?

* Fix nvidia ?

* Temp deactivate intel, activate nvidia ?

43f39f68

nix: add black and isort to the closure (#2619) · 9ed0c85f

Daniël de Kok authored Oct 09, 2024

To make sure that everything is formatted with the same black version
as CI.

I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.

9ed0c85f

08 Oct, 2024 4 commits
- CI (2599): Update ToolType input schema (#2601) · 8ad20daf
  drbh authored Oct 08, 2024
```
* Update ToolType input schema

* lint

* fix: run formatter

* fix: allow tool choide to be null

---------
Co-authored-by: Wauplin <lucainp@gmail.com>
```
  8ad20daf
- nix: move back to the tgi-nix main branch (#2620) · 6db3bcb7
  Daniël de Kok authored Oct 08, 2024
  
  6db3bcb7
- Add support for fused MoE Marlin for AWQ (#2616) · 64142489
  Daniël de Kok authored Oct 08, 2024
```
* Add support for fused MoE Marlin for AWQ

This uses the updated MoE Marlin kernels from vLLM.

* Add integration test for AWQ MoE
```
  64142489
- Upgrade minor rust version (Fixes rust build compilation cache) (#2617) · 8b295aa4
  Nicolas Patry authored Oct 08, 2024
```
* Upgrade minor rust version (Fixes rust build compilation cache)

* Black
```
  8b295aa4
07 Oct, 2024 1 commit
- enable mllama in intel platform (#2610) · 57f9685d
  Wang, Yi authored Oct 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  57f9685d