- 28 Aug, 2024 1 commit
-
-
drbh authored
-
- 27 Aug, 2024 3 commits
-
-
drbh authored
* fix: support tojson and avoid message indexing issue in template * fix: prefer minijinja native methods and prefer workspace level dependency * fix: adjust comment typo
-
Nicolas Patry authored
-
drbh authored
* fix[router]: Fix tools not passed in chat template Signed-off-by:
GitHub <noreply@github.com> * feat: improve default tool serialization and lints * feat: refactor tool logic to include notify_error in prompt and adjust typing * fix: adjust non tool template apply * fix: simplify tool grammar logic and improve schema * feat: avoid skip tool test and avoid empty tool prompts * fix: increase test client timeout for grammar compilation tests --------- Signed-off-by:
GitHub <noreply@github.com> Co-authored-by:
Simone Rossi <simone.rossi.93@gmail.com>
-
- 26 Aug, 2024 1 commit
-
-
drbh authored
* Fix: don't apply post layernorm in SiglipVisionTransformer This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813). This also makes Siglip consistent with the existing Clip implementation: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613 * fix: adjust pali gemma for post layer norm and small refactors --------- Co-authored-by:
Travis Addair <tgaddair@gmail.com>
-
- 23 Aug, 2024 1 commit
-
-
Daniël de Kok authored
The default package wraps the launcher and puts the server/router in the path. As a result, TGI can be started using something like: ``` nix run .# -- \ --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \ --port 8080 ```
-
- 21 Aug, 2024 3 commits
-
-
Daniël de Kok authored
-
Nicolas Patry authored
-
Daniël de Kok authored
nix: add text-generation-benchmark to pure devshell
-
- 20 Aug, 2024 2 commits
-
-
Daniël de Kok authored
* nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell
-
Nicolas Patry authored
* Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by:Daniël de Kok <me@danieldk.eu>
-
- 19 Aug, 2024 1 commit
-
-
Daniël de Kok authored
* Update to CUDA 12.4 * poetry2nix: follow tgi-nix nixpkgs
-
- 16 Aug, 2024 6 commits
-
-
Nicolas Patry authored
* All integration tests back everywhere (too many failed CI). * Upgrade integration tests after 12.4 * Attempt to remove the specifed compute cap. * Common arch list. * Punica uses raw ASM which is not valid on 9.0 apparently.
-
Hugo Larcher authored
* doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
-
Nicolas Patry authored
* Further fixes. * Update the conftest to allow NaN (first logprob). * Fix the condition.
-
Vaibhav Srivastav authored
* Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by:
Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by:
Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by:
Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by:
Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by:
Lucain <lucainp@gmail.com> Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com>
-
Daniël de Kok authored
Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.
-
- 15 Aug, 2024 3 commits
-
-
Nicolas Patry authored
-
Nicolas Patry authored
* Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.
-
Daniël de Kok authored
-
- 14 Aug, 2024 3 commits
-
-
Funtowicz Morgan authored
* (backend) use parking_lot crate for RwLock fairness * (docker) let's put rust in the TRTLLM folder when building * (docker) build ompi with SLURM support * (launcher) default new server::run parameters to false for now * (chore) fmt ... why?
-
Nicolas Patry authored
* Upgrading exl2. * Fixing the other pathways. * Fix idefics.
-
Daniël de Kok authored
This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.
-
- 13 Aug, 2024 4 commits
-
-
drbh authored
fix: adds causal to attention params to check when using flash attn v1
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
-
Daniël de Kok authored
-
- 12 Aug, 2024 12 commits
-
-
drbh authored
-
drbh authored
* fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by:Simone Rossi <simone.rossi.93@gmail.com>
-
Nicolas Patry authored
-
drbh authored
* fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema
-
drbh authored
* fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels
-
drbh authored
* feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test
-
Nicolas Patry authored
Co-authored-by:Daniël de Kok <me@danieldk.eu>
-
Daniël de Kok authored
This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.
-
Wang, Yi authored
add intel-cpu docker image Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* Upgrade fbgemm * Fix fbgemm version
-