Commits · 4e821c003a7cb055a358cf142dbf01a2f4c1916f · OpenDAS / text-generation-inference

12 Aug, 2024 1 commit

Add support for prefix caching to the v3 router (#2392) · 8deeaca4

Daniël de Kok authored Aug 12, 2024

This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.

8deeaca4

31 Jul, 2024 1 commit

Rebase TRT-llm (#2331) · 2b19d671

Nicolas Patry authored Jul 31, 2024

* wip

wip

refacto

refacto

Initial setup for CXX binding to TRTLLM

Working FFI call for TGI and TRTLLM backend

Remove unused parameters annd force tokenizer name to be set

Overall build TRTLLM and deps through CMake build system

Enable end to end CMake build

First version loading engines and making it ready for inference

Remembering to check how we can detect support for chunked context

Move to latest TensorRT-LLM version

Specify which default log level to use depending on CMake build type

make leader executor mode working

unconditionally call InitializeBackend on the FFI layer

bind to CUDA::nvml to retrieve compute capabilities at runtime

updated logic and comment to detect cuda compute capabilities

implement the Stream method to send new tokens through a callback

use spdlog release 1.14.1 moving forward

update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c

correctly tell cmake to build dependent tensorrt-llm required libraries

create cmake install target to put everything relevant in installation folder

add auth_token CLI argument to provide hf hub authentification token

allow converting huggingface::tokenizers error to TensorRtLlmBackendError

use correct include for spdlog

include guard to build example in cmakelists

working setup of the ffi layer

remove fmt import

use external fmt lib

end to end ffi flow working

make sure to track include/ffi.h to trigger rebuild from cargo

impl the rust backend which currently cannot move the actual computation in background thread

expose shutdown function at ffi layer

impl RwLock scenario for TensorRtLllmBackend

oops missing c++ backend definitions

compute the number of maximum new tokens for each request independently

make sure the context is not dropped in the middle of the async decoding.

remove unnecessary log

add all the necessary plumbery to return the generated content

update invalid doc in cpp file

correctly forward back the log probabilities

remove unneeded scope variable for now

refactor Stream impl for Generation to factorise code

expose the internal missing start/queue timestamp

forward tgi parameters rep/freq penalty

add some more validation about grammar not supported

define a shared struct to hold the result of a decoding step

expose information about potential error happening while decoding

remove logging

add logging in case of decoding error

make sure executor_worker is provided

add initial Dockerfile for TRTLLM backend

add some more information in CMakeLists.txt to correctly install executorWorker

add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper

simplify prebuilt trtllm libraries name definition

do the same name definition stuff for tensorrt_llm_executor_static

leverage pkg-config to probe libraries paths and reuse new install structure from cmake

fix bad copy/past missing nvinfer linkage direction

align all the linker search dependency

add missing pkgconfig folder for MPI in Dockerfile

correctly setup linking search path for runtime layer

fix missing / before tgi lib path

adding missing ld_library_path for cuda stubs in Dockerfile

update tgi entrypoint

commenting out Python part for TensorRT installation

refactored docker image

move to TensorRT-LLM v0.11.0

make docker linter happy with same capitalization rule

fix typo

refactor the compute capabilities detection along with num gpus

update TensorRT-LLM to latest version

update TensorRT install script to latest

update build.rs to link to cuda 12.5

add missing dependant libraries for linking

clean up a bit

install to decoder_attention target

add some custom stuff for nccl linkage

fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time

use std::env::const::ARCH

make sure variable live long enough...

look for cuda 12.5

add some more basic info in README.md

* Rebase.

* Fix autodocs.

* Let's try to enable trtllm backend.

* Ignore backends/v3 by default.

* Fixing client.

* Fix makefile + autodocs.

* Updating the schema thing + redocly.

* Fix trtllm lint.

* Adding pb files ?

* Remove cargo fmt temporarily.

* ?

* Tmp.

* Remove both check + clippy  ?

* Backporting telemetry.

* Backporting 457fb0a1



* Remove PB from git.

* Fixing PB with default member backends/client

* update TensorRT-LLM to latest version

* provided None for api_key

* link against libtensorrt_llm and not libtensorrt-llm

---------
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>

2b19d671

25 Jun, 2024 1 commit

Enable multiple LoRa adapters (#2010) · 04e1af94

drbh authored Jun 25, 2024



* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------
Co-authored-by: Derek <datavistics@gmail.com>

04e1af94

05 Jun, 2024 1 commit
- feat: move allocation logic to rust (#1835) · 8aece3bd
  OlivierDehaene authored Jun 05, 2024
```
Close #2007
```
  8aece3bd
04 Jun, 2024 1 commit

feat: add SchedulerV3 (#1996) · 757223b3

OlivierDehaene authored Jun 04, 2024

- Refactor code to allow supporting multiple versions of the
generate.proto at the same time
- Add v3/generate.proto (ISO to generate.proto for now but allow for
future changes without impacting v2 backends)
- Add Schedule trait to abstract queuing and batching mechanisms that
will be different in the future
- Add SchedulerV2/V3 impl

757223b3

09 Feb, 2024 1 commit
- feat(router): add max_batch_size (#1542) · 53214633
  OlivierDehaene authored Feb 09, 2024
```
Some hardware require a maximum batch size.
```
  53214633
22 Jan, 2024 1 commit

chore: bump rust version and annotate/fix all clippy warnings (#1455) · becd0997

drbh authored Jan 22, 2024

This PR just bumps the latest rust version and makes clippy happy

```bash
cargo clippy --all -- -D warnings
#    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
```

becd0997

14 Dec, 2023 1 commit
- feat: add more latency metrics in forward (#1346) · 50b495f3
  OlivierDehaene authored Dec 14, 2023
  
  50b495f3
20 Oct, 2023 1 commit

#1049 CI (#1178) · 5e28f44a

OlivierDehaene authored Oct 20, 2023



See #1049

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>

5e28f44a

24 Jul, 2023 1 commit
- feat: add cuda memory fraction (#659) · 73a4d65d
  OlivierDehaene authored Jul 24, 2023
```
Close #673
```
  73a4d65d
19 Jul, 2023 1 commit
- feat(server): auto max_batch_total_tokens for flash att models (#630) · fe80f536
  OlivierDehaene authored Jul 19, 2023
  
  fe80f536
30 Jun, 2023 1 commit
- feat(server): add paged attention to flash models (#516) · e74bd41e
  OlivierDehaene authored Jun 30, 2023
```
Closes #478
```
  e74bd41e
24 May, 2023 1 commit
- feat: decrease IPC proto size (#367) · 218c9ada
  OlivierDehaene authored May 24, 2023
```
Closes #307 #308
```
  218c9ada
10 May, 2023 1 commit
- feat(server): shard token decode (#303) · 68e9d6ab
  OlivierDehaene authored May 10, 2023
  
  68e9d6ab
27 Apr, 2023 1 commit
- feat(server): add watermarking tests (#248) · f092ba9b
  Ehsan M. Kermani authored Apr 27, 2023
  
  f092ba9b
26 Apr, 2023 1 commit

feat(router): new healthcheck that skips the queue (#244) · db2b4e07

Nicolas Patry authored Apr 26, 2023


Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

db2b4e07

24 Apr, 2023 1 commit
- feat(router): use number of tokens in batch as input for dynamic batching (#226) · ebc74d56
  OlivierDehaene authored Apr 24, 2023
```
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  ebc74d56
21 Apr, 2023 1 commit
- feat(router): add device and dtype info (#215) · 343437c7
  OlivierDehaene authored Apr 21, 2023
  
  343437c7
09 Apr, 2023 1 commit
- fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162) · 5cddc055
  OlivierDehaene authored Apr 09, 2023
  
  5cddc055
28 Mar, 2023 1 commit
- feat(server): clear cache on error (#143) · f0000689
  OlivierDehaene authored Mar 28, 2023
  
  f0000689
13 Feb, 2023 1 commit
- feat: add distributed tracing (#62) · 9af45414
  OlivierDehaene authored Feb 13, 2023
  
  9af45414
31 Jan, 2023 3 commits

feat: Add token streaming using ServerSideEvents support (#41) · 017a2a8c
OlivierDehaene authored Jan 31, 2023

017a2a8c
Revert "feat: Add token streaming using ServerSideEvents support" (#40) · 4f9ac67c
OlivierDehaene authored Jan 31, 2023
```
Reverts huggingface/text-generation-inference#36
```
4f9ac67c

feat: Add token streaming using ServerSideEvents support (#36) · 7fbfbb0d

OlivierDehaene authored Jan 31, 2023

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is: 

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

7fbfbb0d

27 Oct, 2022 1 commit
- feat(server): Support bitsandbytes · 09674e6d
  OlivierDehaene authored Oct 27, 2022
  
  09674e6d
22 Oct, 2022 1 commit
- feat(client): Simplify sharded logic · beb55212
  OlivierDehaene authored Oct 22, 2022
  
  beb55212
20 Oct, 2022 1 commit
- v0.1.0 · f16f2f5a
  Olivier Dehaene authored Oct 18, 2022
  
  f16f2f5a
17 Oct, 2022 1 commit
- feat: Improve error handling · 5e5d8766
  Olivier Dehaene authored Oct 17, 2022
  
  5e5d8766
11 Oct, 2022 1 commit
- Refactored gRPC interface · 4c693e65
  Olivier Dehaene authored Oct 11, 2022
```
Added validation logic
```
  4c693e65
08 Oct, 2022 1 commit
- Init · 295831a4
  Olivier Dehaene authored Oct 08, 2022
  
  295831a4