Commits · c7ab1810d4cc1baa25b869ced1adff3ac2c8e3c3 · OpenDAS / text-generation-inference

16 Aug, 2024 3 commits

Nicolas Patry authored Aug 16, 2024

* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.

c7ab1810

Improve the Consuming TGI + Streaming docs. (#2412) · 99b662f8

Vaibhav Srivastav authored Aug 16, 2024



* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md
Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

99b662f8

nix: try to reduce the number of Rust rebuilds (#2424) · 1411bfb9

Daniël de Kok authored Aug 16, 2024

Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.

1411bfb9

15 Aug, 2024 3 commits

Upgrading the tests to match the current workings. (#2423) · 1b0aa062
Nicolas Patry authored Aug 15, 2024

1b0aa062

Fixing exl2 and other quanize tests again. (#2419) · 57b34958

Nicolas Patry authored Aug 15, 2024

* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.

57b34958

nix: build router incrementally (#2422) · 9aaa12e7
Daniël de Kok authored Aug 15, 2024

9aaa12e7

14 Aug, 2024 3 commits

More fixes trtllm (#2342) · 3f385991

Funtowicz Morgan authored Aug 14, 2024

* (backend) use parking_lot crate for RwLock fairness

* (docker) let's put rust in the TRTLLM folder when building

* (docker) build ompi with SLURM support

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

3f385991

Upgrading exl2. (#2415) · f3b5c694
Nicolas Patry authored Aug 14, 2024
```
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
```
f3b5c694

nix: partial incremental build of the router (#2416) · c5fff92b

Daniël de Kok authored Aug 14, 2024

This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.

c5fff92b

13 Aug, 2024 4 commits
- fix: adds causal to attention params (#2408) · 1cebccc7
  drbh authored Aug 13, 2024
```
fix: adds causal to attention params to check when using flash attn v1
```
  1cebccc7
- add numa to improve cpu inference perf (#2330) · 59922f9b
  Wang, Yi authored Aug 13, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  59922f9b
- Adding more kernels to flake. (#2411) · cd9b15d1
  Nicolas Patry authored Aug 13, 2024
  
  cd9b15d1
- nix: incremental build of the launcher (#2410) · 6f4bb4f2
  Daniël de Kok authored Aug 13, 2024
  
  6f4bb4f2
12 Aug, 2024 13 commits

fix: include create_exllama_buffers and set_device for exllama (#2407) · 8a7749b8
drbh authored Aug 12, 2024

8a7749b8

Pr 2395 ci run (#2406) · 9a7830bd

drbh authored Aug 12, 2024



* fix(router): Fix appending to message content

* feat: add message and chat template test

---------
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>

9a7830bd

Updating the flake. (#2404) · 19ea85f8
Nicolas Patry authored Aug 12, 2024

19ea85f8

fix: improve completions to send a final chunk with usage details (#2336) · 30395b09

drbh authored Aug 12, 2024

* fix: improve completions to send a final chunk with usage details

* fix: include finish reason string

* fix: remove dev debug trait and unneeded mut

* fix: update openapi schema

30395b09

fix: allocate tmp based on sgmv kernel if available (#2345) · 4c3f8a70

drbh authored Aug 12, 2024

* fix: allocate tmp based on sgmv kernel if available

* fix: re add copy build artifacts step for punica kernels

4c3f8a70

feat: validate template variables before apply and improve sliding wi… (#2403) · 155f9c98

drbh authored Aug 12, 2024

* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test

155f9c98

Keeping the benchmark somewhere (#2401) · 136bcc81
Nicolas Patry authored Aug 12, 2024
```
Co-authored-by: Daniël de Kok <me@danieldk.eu>
```
136bcc81

Add support for prefix caching to the v3 router (#2392) · 8deeaca4

Daniël de Kok authored Aug 12, 2024

This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.

8deeaca4

Cpu dockerimage (#2367) · b6bb1d51

Wang, Yi authored Aug 12, 2024



add intel-cpu docker image
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

b6bb1d51

Fixing import exl2 (#2399) · 84bc3d7b
Nicolas Patry authored Aug 12, 2024

84bc3d7b
Adding launcher to build. (#2397) · 730fa00e
Nicolas Patry authored Aug 12, 2024

730fa00e
Upgrade fbgemm (#2398) · 9c739651
Nicolas Patry authored Aug 12, 2024
```
* Upgrade fbgemm

* Fix fbgemm version
```
9c739651
nix: add router to the devshell (#2396) · 01a515de
Daniël de Kok authored Aug 12, 2024

01a515de

09 Aug, 2024 10 commits

Update flake for 9.0a capability in Torch (#2394) · 8dcc7d3f
Daniël de Kok authored Aug 09, 2024

8dcc7d3f
feat: add guideline to chat request and template (#2391) · 0d06aed0
drbh authored Aug 09, 2024
```
* feat: add guideline to chat request and template

* fix: add template test and update docs
```
0d06aed0

Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847

Nicolas Patry authored Aug 09, 2024

* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.

7a48a847

flake: use rust-overlay (#2390) · 6e127dcc
Daniël de Kok authored Aug 09, 2024

6e127dcc
Update documentation for Supported models (#2386) · b2b9c427
Vaibhav Srivastav authored Aug 09, 2024
```
* Minor doc fixes

* up.

* Other minor updates.
```
b2b9c427
flake: add fmt and clippy (#2389) · 977534bc
Daniël de Kok authored Aug 09, 2024

977534bc
Using HF_HOME instead of CACHE to get token read in addition to models. (#2288) · 952b450a
Nicolas Patry authored Aug 09, 2024

952b450a
Add experimental flake (#2384) · c6d5039c
Daniël de Kok authored Aug 09, 2024
```
Add flake.nix
```
c6d5039c

Add FlashInfer support (#2354) · 7830de15

Daniël de Kok authored Aug 09, 2024

This change adds support for FlashInfer. FlashInfer can be enabled using
`FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
Since this functionality is currently only for testing, FlashInfer is
not installed anywhere yet.

The FlashInfer API is quite different from FlashAttention/vLLM in that
it requires more global bookkeeping:

* A wrapper class needs to be contstructed (which we just call *state*).
  Since this is fairly expensive (due to pinned host memory allocation),
  we only do this once in a FlashCausalLM instance or for each CUDA
  Graph size.
* Each model forward call needs to be wrapped in `begin_forward` and
  `end_forward`. This sets up data structures that can be reused for all
  calls to attention for that forward call.

When calling attention, we need access to the state object. To avoid
passing an argument down the call chain (which would require changes to
all models), we use a context variable.

Each model forward call is wrapped using a context manager that does all
the bookkeeping for such a call:

* Set the context variable to the forward call's state.
* Call `begin_forward` on the state.
* Yield.
* Call `end_forward` on the state.
* Reset the context variable.

We cannot use a single shared global variable for this, since e.g. CUDA
Graphs of different sizes each have their own state.

7830de15

Pr 2352 ci branch (#2382) · 6d06473c

drbh authored Aug 09, 2024



* Fix unsigned integer underflow

Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.

In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
    break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.

Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

* fix: update v3 scheduler and ensure max_batch_size > 0

---------
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>

6d06473c

08 Aug, 2024 4 commits

Update Quantization docs and minor doc fix. (#2368) · cb3ae302

Vaibhav Srivastav authored Aug 08, 2024



* Update Quantization docs and minor doc fix.

* update readme with latest quants info

* Apply suggestions from code review
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* up

---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

cb3ae302

fix: prefer hidden_activation over hidden_act in gemma2 (#2381) · f8521900
drbh authored Aug 08, 2024

f8521900

Pr 2337 ci branch (#2379) · 2ca59806

drbh authored Aug 08, 2024



* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* reable gemma2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix in regression in ipex flashattention
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

2ca59806

fix EleutherAI/gpt-neox-20b does not work in tgi (#2346) · 689b1abb
Wang, Yi authored Aug 09, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
689b1abb