- 08 Oct, 2024 4 commits
-
-
drbh authored
* Update ToolType input schema * lint * fix: run formatter * fix: allow tool choide to be null --------- Co-authored-by:Wauplin <lucainp@gmail.com>
-
Daniël de Kok authored
-
Daniël de Kok authored
* Add support for fused MoE Marlin for AWQ This uses the updated MoE Marlin kernels from vLLM. * Add integration test for AWQ MoE
-
Nicolas Patry authored
* Upgrade minor rust version (Fixes rust build compilation cache) * Black
-
- 07 Oct, 2024 2 commits
-
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Florian Zimmermeister authored
Update kv_cache.py
-
- 04 Oct, 2024 2 commits
-
-
Daniël de Kok authored
* Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml
-
Daniël de Kok authored
-
- 03 Oct, 2024 2 commits
-
-
Nicolas Patry authored
* New release 2.3.1 * Update doc number
- 02 Oct, 2024 4 commits
-
-
drbh authored
* feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file
-
drbh authored
allow revision for lora adapters from launcher Co-authored-by:
Sida <sida@kulamind.com> Co-authored-by:
teamclouday <teamclouday@gmail.com>
-
Nicolas Patry authored
* adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by:Edwinhr716 <Edandres249@gmail.com>
-
Nicolas Patry authored
* Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0
-
- 01 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* nix: experimental support for building a Docker image Run using something like: ``` docker run \ --device nvidia.com/gpu=all \ -it --rm -p 8080:80 \ -v $PWD/data:/data \ -v $PWD/tmp:/tmp \ tgi-docker:latest \ --model-id <model_id> ``` * Example of building the Docker image using Nix inside Docker * Stream to make the builder image smaller This avoids storing a Docker image tarball in the image. Instead, stream the layers while doing `docker run`. * Don't spam journalctl on Linux * Other dockerfile. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 30 Sep, 2024 7 commits
-
-
Daniël de Kok authored
This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.
-
Daniël de Kok authored
-
drbh authored
* feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.
-
Mohit Sharma authored
* style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile
-
Ikram Ul Haq authored
-
Daniël de Kok authored
Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.
-
- 28 Sep, 2024 1 commit
-
-
Daniël de Kok authored
-
- 27 Sep, 2024 1 commit
-
-
Daniël de Kok authored
* Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s
-
- 26 Sep, 2024 2 commits
-
-
Alvaro Bartolome authored
* Fix `cargo build --features google` * Add `cargo test --features google`
-
Alvaro Bartolome authored
* Add LoRA adapters support for Gemma2 * Make `black` formatting happy
-
- 24 Sep, 2024 13 commits
-
-
Nicholas Broad authored
specify how to call local adapters
-
Nicolas Patry authored
* More tensor cores. * Fixing the logic. * Gemma is modified by this.
-
Nicolas Patry authored
* Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.
-
Nicolas Patry authored
-
Aritra Roy Gosthipaty authored
* chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by:
vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by:
vb <vaibhavs10@gmail.com> --------- Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com> Co-authored-by:
vb <vaibhavs10@gmail.com>
-
Orhun Parmaksız authored
-
Orhun Parmaksız authored
-
Daniël de Kok authored
This replaces the custom layers in both models.
-
Daniël de Kok authored
* Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Alvaro Bartolome authored
-
OlivierDehaene authored
* wip * added v2
-
- 23 Sep, 2024 1 commit
-
-
Daniël de Kok authored
-