- 16 Aug, 2024 3 commits
-
-
Nicolas Patry authored
* Further fixes. * Update the conftest to allow NaN (first logprob). * Fix the condition.
-
Vaibhav Srivastav authored
* Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by:
Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by:
Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by:
Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by:
Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by:
Lucain <lucainp@gmail.com> Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com>
-
Daniël de Kok authored
Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.
-
- 15 Aug, 2024 3 commits
-
-
Nicolas Patry authored
-
Nicolas Patry authored
* Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.
-
Daniël de Kok authored
-
- 14 Aug, 2024 3 commits
-
-
Funtowicz Morgan authored
* (backend) use parking_lot crate for RwLock fairness * (docker) let's put rust in the TRTLLM folder when building * (docker) build ompi with SLURM support * (launcher) default new server::run parameters to false for now * (chore) fmt ... why?
-
Nicolas Patry authored
* Upgrading exl2. * Fixing the other pathways. * Fix idefics.
-
Daniël de Kok authored
This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.
-
- 13 Aug, 2024 4 commits
-
-
drbh authored
fix: adds causal to attention params to check when using flash attn v1
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
-
Daniël de Kok authored
-
- 12 Aug, 2024 13 commits
-
-
drbh authored
-
drbh authored
* fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by:Simone Rossi <simone.rossi.93@gmail.com>
-
Nicolas Patry authored
-
drbh authored
* fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema
-
drbh authored
* fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels
-
drbh authored
* feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test
-
Nicolas Patry authored
Co-authored-by:Daniël de Kok <me@danieldk.eu>
-
Daniël de Kok authored
This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.
-
Wang, Yi authored
add intel-cpu docker image Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* Upgrade fbgemm * Fix fbgemm version
-
Daniël de Kok authored
-
- 09 Aug, 2024 10 commits
-
-
Daniël de Kok authored
-
drbh authored
* feat: add guideline to chat request and template * fix: add template test and update docs
-
Nicolas Patry authored
* Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.
-
Daniël de Kok authored
-
Vaibhav Srivastav authored
* Minor doc fixes * up. * Other minor updates.
-
Daniël de Kok authored
-
Nicolas Patry authored
-
Daniël de Kok authored
Add flake.nix
-
Daniël de Kok authored
This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call *state*). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.
-
drbh authored
* Fix unsigned integer underflow Passing --max-batch-size to the launcher actually had no effect because after a few requests the max_size passed to State::next_batch would underflow becoming a largo positive number. In the scheduler, as soon as the cached batch size reached the max_batch_size the max_size passed to next_batch becomes 0. Since the only check in that funcion is ``` if Some(batch_requests.len()) == max_size { break; } ``` and it's called after the `batch_requests.len()` has become 1, it doesn't do anything to prevent more than 0 requests from being batched. Now we have cached batch in the server that is large than max_batch_size and `max_size - batch_size as usize` underflows. Signed-off-by:Max de Bayser <mbayser@br.ibm.com> * fix: update v3 scheduler and ensure max_batch_size > 0 --------- Signed-off-by:
Max de Bayser <mbayser@br.ibm.com> Co-authored-by:
Max de Bayser <mbayser@br.ibm.com>
-
- 08 Aug, 2024 4 commits
-
-
Vaibhav Srivastav authored
* Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by:
Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by:
Pedro Cuenca <pedro@huggingface.co>
-
drbh authored
-
drbh authored
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-