Commits · 7451041ecdb131e0ce05e45cd7d46f6d1d9bbcb3 · OpenDAS / text-generation-inference

31 Jul, 2024 4 commits

Erik Kaunismäki authored Jul 31, 2024



* refactor usage stats

* Update docs/source/usage_statistics.md
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update router/src/server.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* changes based on feedback

* run python3 udpate_doc.py

* fix pre-commit

* Update router/src/server.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete option around usage stats arg

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

7451041e

Pr 2290 ci run (#2329) · f7f61876

drbh authored Jul 31, 2024



* MODEL_ID propagation fix

* fix: remove global model id

---------
Co-authored-by: root <root@tw031.pit.tensorwave.lan>

f7f61876

Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300) · 34f7dcfd

Daniël de Kok authored Jul 31, 2024

The `GPTWeightLoader` was structured like this in pseudocode:

if marlin:
  Set up tensors in a way that GPTQ-Marlin expects
else:
  Set up tensors in a way that ExLlama/GPTQ/AWQ expect

However, the GPT-Marlin implementation details should really be in the
`marlin` module. So move the former part out to a separate
`GPTQMarlinWeightsLoader`.

34f7dcfd

Rebase TRT-llm (#2331) · 2b19d671

Nicolas Patry authored Jul 31, 2024

* wip

wip

refacto

refacto

Initial setup for CXX binding to TRTLLM

Working FFI call for TGI and TRTLLM backend

Remove unused parameters annd force tokenizer name to be set

Overall build TRTLLM and deps through CMake build system

Enable end to end CMake build

First version loading engines and making it ready for inference

Remembering to check how we can detect support for chunked context

Move to latest TensorRT-LLM version

Specify which default log level to use depending on CMake build type

make leader executor mode working

unconditionally call InitializeBackend on the FFI layer

bind to CUDA::nvml to retrieve compute capabilities at runtime

updated logic and comment to detect cuda compute capabilities

implement the Stream method to send new tokens through a callback

use spdlog release 1.14.1 moving forward

update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c

correctly tell cmake to build dependent tensorrt-llm required libraries

create cmake install target to put everything relevant in installation folder

add auth_token CLI argument to provide hf hub authentification token

allow converting huggingface::tokenizers error to TensorRtLlmBackendError

use correct include for spdlog

include guard to build example in cmakelists

working setup of the ffi layer

remove fmt import

use external fmt lib

end to end ffi flow working

make sure to track include/ffi.h to trigger rebuild from cargo

impl the rust backend which currently cannot move the actual computation in background thread

expose shutdown function at ffi layer

impl RwLock scenario for TensorRtLllmBackend

oops missing c++ backend definitions

compute the number of maximum new tokens for each request independently

make sure the context is not dropped in the middle of the async decoding.

remove unnecessary log

add all the necessary plumbery to return the generated content

update invalid doc in cpp file

correctly forward back the log probabilities

remove unneeded scope variable for now

refactor Stream impl for Generation to factorise code

expose the internal missing start/queue timestamp

forward tgi parameters rep/freq penalty

add some more validation about grammar not supported

define a shared struct to hold the result of a decoding step

expose information about potential error happening while decoding

remove logging

add logging in case of decoding error

make sure executor_worker is provided

add initial Dockerfile for TRTLLM backend

add some more information in CMakeLists.txt to correctly install executorWorker

add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper

simplify prebuilt trtllm libraries name definition

do the same name definition stuff for tensorrt_llm_executor_static

leverage pkg-config to probe libraries paths and reuse new install structure from cmake

fix bad copy/past missing nvinfer linkage direction

align all the linker search dependency

add missing pkgconfig folder for MPI in Dockerfile

correctly setup linking search path for runtime layer

fix missing / before tgi lib path

adding missing ld_library_path for cuda stubs in Dockerfile

update tgi entrypoint

commenting out Python part for TensorRT installation

refactored docker image

move to TensorRT-LLM v0.11.0

make docker linter happy with same capitalization rule

fix typo

refactor the compute capabilities detection along with num gpus

update TensorRT-LLM to latest version

update TensorRT install script to latest

update build.rs to link to cuda 12.5

add missing dependant libraries for linking

clean up a bit

install to decoder_attention target

add some custom stuff for nccl linkage

fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time

use std::env::const::ARCH

make sure variable live long enough...

look for cuda 12.5

add some more basic info in README.md

* Rebase.

* Fix autodocs.

* Let's try to enable trtllm backend.

* Ignore backends/v3 by default.

* Fixing client.

* Fix makefile + autodocs.

* Updating the schema thing + redocly.

* Fix trtllm lint.

* Adding pb files ?

* Remove cargo fmt temporarily.

* ?

* Tmp.

* Remove both check + clippy  ?

* Backporting telemetry.

* Backporting 457fb0a1



* Remove PB from git.

* Fixing PB with default member backends/client

* update TensorRT-LLM to latest version

* provided None for api_key

* link against libtensorrt_llm and not libtensorrt-llm

---------
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>

2b19d671

30 Jul, 2024 1 commit
- server quantize: store quantizer config in standard format (#2299) · 53aec273
  Daniël de Kok authored Jul 30, 2024
```
- Create `quantization_config` option in the model config.
- Don't store the quantizer config in tensors anymore.
```
  53aec273
29 Jul, 2024 6 commits

fix: adjust test snapshots and small refactors (#2323) · 0b95693f
drbh authored Jul 29, 2024
```
* fix: adjust test snapshots and small refactors

* fix: revert non snapshot changes
```
0b95693f
patch-error-on-invalid-grammar (#2282) · 3d7f4f41
Erik Kaunismäki authored Jul 29, 2024
```
* quick fix

* allow silent failure

* explicit todo that this is only short term
```
3d7f4f41
fix: reject grammars without properties (#2309) · f15e808d
drbh authored Jul 29, 2024

f15e808d
Install Marlin from standalone package (#2320) · 922732b2
Daniël de Kok authored Jul 29, 2024

922732b2

Run ci api key (#2315) · 583d37a2

Erik Kaunismäki authored Jul 29, 2024



* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------
Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>

583d37a2

fix: fix buildkit config in ci · fd2e0631
Adrien authored Jul 29, 2024
```
Signed-off-by: Adrien <adrien@huggingface.co>
```
fd2e0631

26 Jul, 2024 2 commits

feat: add ruff and resolve issue (#2262) · bab02ff2

drbh authored Jul 26, 2024

* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check

bab02ff2

Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) · 4b49c50f
Daniël de Kok authored Jul 26, 2024

4b49c50f

25 Jul, 2024 4 commits
- Fix registry name (#2307) · 3905f854
  Adrien authored Jul 25, 2024
  
  3905f854
- Fixing idefics on g6 tests. (#2306) · 17ed42be
  Nicolas Patry authored Jul 25, 2024
  
  17ed42be
- Some small fixes for the Torch 2.4.0 update (#2304) · 9256d7c3
  Daniël de Kok authored Jul 25, 2024
```
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update
```
  9256d7c3
- Using g6 instead of g5. (#2281) · 26614057
  Nicolas Patry authored Jul 25, 2024
```
* Using g6 instead of g5.

* Update the idefics2 snapshot.
```
  26614057
24 Jul, 2024 4 commits

fix: refactor adapter weight loading and mapping (#2193) · 5d85a958

drbh authored Jul 24, 2024

* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo

5d85a958

Split up `layers.marlin` into several files (#2292) · 93d2b9fe
Daniël de Kok authored Jul 24, 2024
```
The marlin.py file was getting large, split it up.
```
93d2b9fe

fix of use of unquantized weights in cohere GQA loading, also enable … (#2291) · 86422506

Wang, Yi authored Jul 24, 2024



fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

86422506

fix crash in multi-modal (#2245) · 5ad39dd3

Wang, Yi authored Jul 24, 2024



* fix crash in multi-modal
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

5ad39dd3

23 Jul, 2024 9 commits

hotfix: update nccl · a8950294
OlivierDehaene authored Jul 23, 2024

a8950294
chore: update to torch 2.4 (#2259) · e7e3aa6c
OlivierDehaene authored Jul 23, 2024
```
* chore: update to torch 2.4

* remove un-necessary patch

* fix
```
e7e3aa6c
hotfix: pin numpy (#2289) · bc9593a5
Daniël de Kok authored Jul 23, 2024

bc9593a5
Add support for Llama 3 rotary embeddings (#2286) · 4ab41737
Daniël de Kok authored Jul 23, 2024
```
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
```
4ab41737

Preparing for release. (#2285) · 5d121a97

Nicolas Patry authored Jul 23, 2024

* Preparing for release.

* Updating docs.

* Fixing token within the docker image for the launcher.

5d121a97

[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254) · 3961e323

shaltielshmid authored Jul 23, 2024



* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

3961e323

Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c

Daniël de Kok authored Jul 23, 2024

* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.

9935720c

fix(l4): fix fp8 logic on l4 (#2277) · 5fca30ee

OlivierDehaene authored Jul 23, 2024

* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89

5fca30ee

Fixing mistral nemo. (#2276) · abc32537
Nicolas Patry authored Jul 23, 2024

abc32537

22 Jul, 2024 6 commits

use proper name for ci (#2274) · 47004651
Adrien authored Jul 22, 2024

47004651

Softcapping for gemma2. (#2273) · 6aeb6690

Nicolas Patry authored Jul 22, 2024

* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.

6aeb6690

fix(server): fix fp8 weight loading (#2268) · 4844ff79

OlivierDehaene authored Jul 22, 2024

* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype

4844ff79

fix(ci): test new instances (#2272) · 6aebf44f

Adrien authored Jul 22, 2024



* test new instances
Signed-off-by: Adrien <adrien@huggingface.co>

* improve build ci
Signed-off-by: Adrien <adrien@huggingface.co>

---------
Signed-off-by: Adrien <adrien@huggingface.co>

6aebf44f

legacy warning on text_generation client (#2271) · 07441f5a
Erik Kaunismäki authored Jul 22, 2024
```
Update README.md

point to huggingface_hub inference clients instead
```
07441f5a

Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) · 4e420722

icyboy™ authored Jul 22, 2024

* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading

4e420722

21 Jul, 2024 1 commit
- fix(server): fix deepseekv2 loading (#2266) · f3435bab
  OlivierDehaene authored Jul 21, 2024
  
  f3435bab
20 Jul, 2024 3 commits

feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79

OlivierDehaene authored Jul 20, 2024

* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout

53ec0b79

Add FP8 release test (#2261) · e5c1d6d6
Daniël de Kok authored Jul 20, 2024

e5c1d6d6

re-push to internal registry (#2242) · 11123a8e

Adrien authored Jul 20, 2024



* re-push to internal registry
Signed-off-by: Adrien <adrien@huggingface.co>

* fix name
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip debug
Signed-off-by: Adrien <adrien@huggingface.co>

* add debug
Signed-off-by: Adrien <adrien@huggingface.co>

* should
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* ww
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* ww
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* wip
Signed-off-by: Adrien <adrien@huggingface.co>

* debug
Signed-off-by: Adrien <adrien@huggingface.co>

* w
Signed-off-by: Adrien <adrien@huggingface.co>

* revert tests
Signed-off-by: Adrien <adrien@huggingface.co>

* last reverts
Signed-off-by: Adrien <adrien@huggingface.co>

* another one
Signed-off-by: Adrien <adrien@huggingface.co>

---------
Signed-off-by: Adrien <adrien@huggingface.co>

11123a8e