Commits · 3011639ff7a6db7e6aaa5506ff516b9df8bc443e · OpenDAS / text-generation-inference

03 Oct, 2024 1 commit
- Revert "Unroll notify error into generate response" (#2605) · 3011639f
  drbh authored Oct 03, 2024
```
Revert "Unroll notify error into generate response (#2597)"

This reverts commit d22b0c1f.
```
  3011639f
02 Oct, 2024 3 commits

Unroll notify error into generate response (#2597) · d22b0c1f

drbh authored Oct 02, 2024

* feat: unroll notify_error if no tool is choosen

* fix: expect simple message when no tool is selected

* fix: improve test to avoid notify_error

* fix: improve docs and indicate change in expected response

* fix: adjust linting in test file

d22b0c1f

Max token capacity metric (#2595) · 0204946d

Nicolas Patry authored Oct 02, 2024



* adding max_token_capacity_metric

* added tgi to name of metric

* Adding max capacity metric.

* Add description for the metrics

---------
Co-authored-by: Edwinhr716 <Edandres249@gmail.com>

0204946d

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

30 Sep, 2024 1 commit

feat: support phi3.5 moe (#2479) · 93a7042d

drbh authored Sep 30, 2024



* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

93a7042d

27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

26 Sep, 2024 1 commit
- Fix build with `--features google` (#2566) · 0aa66d69
  Alvaro Bartolome authored Sep 26, 2024
```
* Fix `cargo build --features google`

* Add `cargo test --features google`
```
  0aa66d69
24 Sep, 2024 2 commits

Cleanup Vertex + Chat (#2553) · c032280b

Nicolas Patry authored Sep 24, 2024

* Cleanup Vertex + Chat

* logprobs defaults to false.

* Parameters are optional

* Fix  docs.

* Changing back this logprobs default.

* Fixup doc.

* Let's debug that.

* Not unstable.

* Updating Cargo ?

* Wat?

* Dummy change.

* Trying some other install.

* Trying smething.

* Revert everything.

* Update Cargo lock.

* Fixing the pre-commit after rebase.

c032280b

chore: Add old V2 backend (#2551) · 10e6f292
OlivierDehaene authored Sep 24, 2024
```
* wip

* added v2
```
10e6f292

19 Sep, 2024 1 commit

Stream options. (#2533) · f512021e

Nicolas Patry authored Sep 19, 2024

* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow

f512021e

17 Sep, 2024 1 commit
- fix: metrics unbounded memory (#2528) · 86984e32
  OlivierDehaene authored Sep 17, 2024
  
  86984e32
11 Sep, 2024 2 commits

Fix tokenization yi (#2507) · dae3bf1d

Nicolas Patry authored Sep 11, 2024

* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?

dae3bf1d

Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6

Nicolas Patry authored Sep 11, 2024



* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>

a4e3e8c6

02 Sep, 2024 1 commit

fix: enable chat requests in vertex endpoint (#2481) · 47d7e344

drbh authored Sep 02, 2024

* fix: enable chat requests in vertex endpoint

* feat: avoid unwrap and pre allocate future vec

47d7e344

29 Aug, 2024 2 commits

feat: add /v1/models endpoint (#2433) · d5202c46

drbh authored Aug 29, 2024

* feat: add /v1/models endpoint

* feat: add /v1/models endpoint

* fix: remove unused type import

* fix: revert route typo

* fix: update docs with new endpoint

* fix: add to redocly ignore and lint

d5202c46

Lots of improvements (Still 2 allocators) (#2449) · e415b690

Nicolas Patry authored Aug 29, 2024



* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

e415b690

27 Aug, 2024 2 commits

fix: bump minijinja version and add test for llama 3.1 tools (#2463) · 21187c27

drbh authored Aug 27, 2024

* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo

21187c27

Pr 2451 ci branch (#2454) · cfa73b5c

drbh authored Aug 26, 2024



* fix[router]: Fix tools not passed in chat template
Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>

cfa73b5c

16 Aug, 2024 1 commit

doc: Add metrics documentation and add a 'Reference' section (#2230) · 53729b74

Hugo Larcher authored Aug 16, 2024



* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

53729b74

12 Aug, 2024 5 commits

Pr 2395 ci run (#2406) · 9a7830bd

drbh authored Aug 12, 2024



* fix(router): Fix appending to message content

* feat: add message and chat template test

---------
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>

9a7830bd

fix: improve completions to send a final chunk with usage details (#2336) · 30395b09

drbh authored Aug 12, 2024

* fix: improve completions to send a final chunk with usage details

* fix: include finish reason string

* fix: remove dev debug trait and unneeded mut

* fix: update openapi schema

30395b09

feat: validate template variables before apply and improve sliding wi… (#2403) · 155f9c98

drbh authored Aug 12, 2024

* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test

155f9c98

Keeping the benchmark somewhere (#2401) · 136bcc81
Nicolas Patry authored Aug 12, 2024
```
Co-authored-by: Daniël de Kok <me@danieldk.eu>
```
136bcc81

Add support for prefix caching to the v3 router (#2392) · 8deeaca4

Daniël de Kok authored Aug 12, 2024

This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.

8deeaca4

09 Aug, 2024 3 commits

feat: add guideline to chat request and template (#2391) · 0d06aed0
drbh authored Aug 09, 2024
```
* feat: add guideline to chat request and template

* fix: add template test and update docs
```
0d06aed0

Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847

Nicolas Patry authored Aug 09, 2024

* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.

7a48a847

Pr 2352 ci branch (#2382) · 6d06473c

drbh authored Aug 09, 2024



* Fix unsigned integer underflow

Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.

In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
    break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.

Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

* fix: update v3 scheduler and ensure max_batch_size > 0

---------
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>

6d06473c

08 Aug, 2024 1 commit

add gptj modeling in TGI #2366 (CI RUN) (#2372) · 21267f3c

drbh authored Aug 07, 2024



* add gptj modeling
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

21267f3c

06 Aug, 2024 3 commits
- feat: return the generated text when parsing fails (#2353) · 1768c00b
  drbh authored Aug 06, 2024
  
  1768c00b
- feat: prefer stop over eos_token to align with openai finish_reason (#2344) · f8a5b381
  drbh authored Aug 06, 2024
  
  f8a5b381
- feat: implement a templated endpoint for visibility into chat requests (#2333) · e11f5f1c
  drbh authored Aug 06, 2024
```
* feat: implement a templated endpoint for visibility into chat requests

* feat: improve to tokenize too

* fix: adjust return type

* feat: simplify prepare_chat_input logic and adjust start stop chars
```
  e11f5f1c
31 Jul, 2024 2 commits

refactor usage stats (#2339) · 7451041e

Erik Kaunismäki authored Jul 31, 2024



* refactor usage stats

* Update docs/source/usage_statistics.md
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update router/src/server.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* changes based on feedback

* run python3 udpate_doc.py

* fix pre-commit

* Update router/src/server.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete option around usage stats arg

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

7451041e

Rebase TRT-llm (#2331) · 2b19d671

Nicolas Patry authored Jul 31, 2024

* wip

wip

refacto

refacto

Initial setup for CXX binding to TRTLLM

Working FFI call for TGI and TRTLLM backend

Remove unused parameters annd force tokenizer name to be set

Overall build TRTLLM and deps through CMake build system

Enable end to end CMake build

First version loading engines and making it ready for inference

Remembering to check how we can detect support for chunked context

Move to latest TensorRT-LLM version

Specify which default log level to use depending on CMake build type

make leader executor mode working

unconditionally call InitializeBackend on the FFI layer

bind to CUDA::nvml to retrieve compute capabilities at runtime

updated logic and comment to detect cuda compute capabilities

implement the Stream method to send new tokens through a callback

use spdlog release 1.14.1 moving forward

update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c

correctly tell cmake to build dependent tensorrt-llm required libraries

create cmake install target to put everything relevant in installation folder

add auth_token CLI argument to provide hf hub authentification token

allow converting huggingface::tokenizers error to TensorRtLlmBackendError

use correct include for spdlog

include guard to build example in cmakelists

working setup of the ffi layer

remove fmt import

use external fmt lib

end to end ffi flow working

make sure to track include/ffi.h to trigger rebuild from cargo

impl the rust backend which currently cannot move the actual computation in background thread

expose shutdown function at ffi layer

impl RwLock scenario for TensorRtLllmBackend

oops missing c++ backend definitions

compute the number of maximum new tokens for each request independently

make sure the context is not dropped in the middle of the async decoding.

remove unnecessary log

add all the necessary plumbery to return the generated content

update invalid doc in cpp file

correctly forward back the log probabilities

remove unneeded scope variable for now

refactor Stream impl for Generation to factorise code

expose the internal missing start/queue timestamp

forward tgi parameters rep/freq penalty

add some more validation about grammar not supported

define a shared struct to hold the result of a decoding step

expose information about potential error happening while decoding

remove logging

add logging in case of decoding error

make sure executor_worker is provided

add initial Dockerfile for TRTLLM backend

add some more information in CMakeLists.txt to correctly install executorWorker

add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper

simplify prebuilt trtllm libraries name definition

do the same name definition stuff for tensorrt_llm_executor_static

leverage pkg-config to probe libraries paths and reuse new install structure from cmake

fix bad copy/past missing nvinfer linkage direction

align all the linker search dependency

add missing pkgconfig folder for MPI in Dockerfile

correctly setup linking search path for runtime layer

fix missing / before tgi lib path

adding missing ld_library_path for cuda stubs in Dockerfile

update tgi entrypoint

commenting out Python part for TensorRT installation

refactored docker image

move to TensorRT-LLM v0.11.0

make docker linter happy with same capitalization rule

fix typo

refactor the compute capabilities detection along with num gpus

update TensorRT-LLM to latest version

update TensorRT install script to latest

update build.rs to link to cuda 12.5

add missing dependant libraries for linking

clean up a bit

install to decoder_attention target

add some custom stuff for nccl linkage

fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time

use std::env::const::ARCH

make sure variable live long enough...

look for cuda 12.5

add some more basic info in README.md

* Rebase.

* Fix autodocs.

* Let's try to enable trtllm backend.

* Ignore backends/v3 by default.

* Fixing client.

* Fix makefile + autodocs.

* Updating the schema thing + redocly.

* Fix trtllm lint.

* Adding pb files ?

* Remove cargo fmt temporarily.

* ?

* Tmp.

* Remove both check + clippy  ?

* Backporting telemetry.

* Backporting 457fb0a1



* Remove PB from git.

* Fixing PB with default member backends/client

* update TensorRT-LLM to latest version

* provided None for api_key

* link against libtensorrt_llm and not libtensorrt-llm

---------
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>

2b19d671

29 Jul, 2024 2 commits

fix: reject grammars without properties (#2309) · f15e808d
drbh authored Jul 29, 2024

f15e808d

Run ci api key (#2315) · 583d37a2

Erik Kaunismäki authored Jul 29, 2024



* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------
Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>

583d37a2

19 Jul, 2024 2 commits

fix: adjust default tool choice (#2244) · 68a9685f

drbh authored Jul 19, 2024

* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs

68a9685f

usage stats and crash reports (#2220) · 4c19593a

Erik Kaunismäki authored Jul 19, 2024



* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>

4c19593a

15 Jul, 2024 1 commit

fix custom cache dir (#2226) · 457fb0a1

Erik Kaunismäki authored Jul 15, 2024

* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace

457fb0a1

11 Jul, 2024 1 commit
- fix: append DONE message to chat stream (#2221) · d789de32
  drbh authored Jul 11, 2024
```
* fix: append DONE message to chat stream

* fix: update completions endpoint
```
  d789de32
09 Jul, 2024 1 commit

Updating the self check (#2209) · 4c976fb4

Nicolas Patry authored Jul 09, 2024

* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.

4c976fb4