- 17 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Support `e4m3fn` KV cache * Make check more obvious
-
- 16 Oct, 2024 2 commits
-
-
OlivierDehaene authored
* wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Mohit Sharma authored
* (feat) fp8 fnuz support for rocm * (review comments) Fix compression_config load, type hints * (bug) update all has_tensor * (review_comments) fix typo and added comments * (nit) improved comment
-
- 15 Oct, 2024 3 commits
-
-
Alvaro Bartolome authored
As spotted by @philschmid, the payload was compliant with Vertex AI, but just partially, since ideally the most compliant version would be with the generation kwargs flattened to be on the same level as the `messages`; meaning that Vertex AI would still expect a list of instances, but each instance would be an OpenAI-compatible instance, which is more clear; and more aligned with the SageMaker integration too, so kudos to him for spotting that; and sorry from my end for any inconvenience @Narsil.
-
Daniël de Kok authored
-
Nicolas Patry authored
-
- 14 Oct, 2024 5 commits
-
-
Dmitry Rogozhkin authored
XPU backend is available natively (without IPEX) in pytorch starting from pytorch 2.4. This commit extends TGI to cover the case when user has XPU support thru pytorch 2.4, but does not have IPEX installed. Models which don't require attention can work. For attention required models more work is needed to provide attention implementation. Tested with the following models: * teknium/OpenHermes-2.5-Mistral-7B * bigscience/bloom-560m * google/gemma-7b * google/flan-t5-xxl Signed-off-by:Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Omar Sanseviero authored
Update quicktour.md
-
Nicolas Patry authored
* break when there's nothing to read Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Different approach, only listen on stdin when `LOG_LEVEL=debug` (which is where dropping to a debugger is important). --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
Omar Sanseviero authored
* Small improvements for docs * Update _toctree.yml * Updating the doc (we keep the list actually). --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 11 Oct, 2024 1 commit
-
-
Nicolas Patry authored
-
- 10 Oct, 2024 3 commits
-
-
Nicolas Patry authored
* Intel CI ? * Let's try non sharded gemma. * Snapshot rename * Apparently container can be gone already.
-
vb authored
Update to most recent stable version of TGI.
-
drbh authored
* feat: process token stream before returning to client * fix: expect content in test * fix: improve comparison via ruff lint * fix: return event in all cases * fix: always send event on error, avoid unwraps, refactor and improve tests * fix: prefer no_tool over notify_error to improve reponse * fix: adjust chat input test for no_tool * fix: adjust test expected content --------- Co-authored-by:System administrator <root@ip-10-90-0-186.ec2.internal>
-
- 09 Oct, 2024 2 commits
-
-
Nicolas Patry authored
* Only run 1 valid test. * TRying the tailscale action quickly. * ? * bash spaces. * Remove tailscale. * More quotes. * mnt2 ? * Othername to avoid recursive directories. * Good old tmate. * Remove tmate. * Trying a few things. * Remove some stuff. * Sleep ? * Tmp * busybox * Launcher tgi * Starting hello * Busybox in python * No device. * Removing all variables ? * A un moment donné. * Tmp * Tmp2 * DEvice request, no container name * No device requests * Without pytest. * No pytest. * from env * Start with devices * Attemp #1 * Remove stdin messing * Only 1 test, no container name * Raw tgi * Sending args. * Show pip freeze. * Start downloading with token * Giving HIP devices. * Mount volume + port forward * Without pytest. * No token * Repeated arguments * Wrong kwarg. * On 2 GPUs * Fallback to single shard CI test. * Testing * yaml * Common cache ? * Trailing slash ? * Docker volume split. * Fix docker volume * Fixing ? * ? * Try no devices ? * Flash llama on intel CPU ? * Fix nvidia ? * Temp deactivate intel, activate nvidia ?
-
Daniël de Kok authored
To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black.
-
- 08 Oct, 2024 4 commits
-
-
drbh authored
* Update ToolType input schema * lint * fix: run formatter * fix: allow tool choide to be null --------- Co-authored-by:Wauplin <lucainp@gmail.com>
-
Daniël de Kok authored
-
Daniël de Kok authored
* Add support for fused MoE Marlin for AWQ This uses the updated MoE Marlin kernels from vLLM. * Add integration test for AWQ MoE
-
Nicolas Patry authored
* Upgrade minor rust version (Fixes rust build compilation cache) * Black
-
- 07 Oct, 2024 2 commits
-
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Florian Zimmermeister authored
Update kv_cache.py
-
- 04 Oct, 2024 2 commits
-
-
Daniël de Kok authored
* Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml
-
Daniël de Kok authored
-
- 03 Oct, 2024 2 commits
-
-
Nicolas Patry authored
* New release 2.3.1 * Update doc number
- 02 Oct, 2024 4 commits
-
-
drbh authored
* feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file
-
drbh authored
allow revision for lora adapters from launcher Co-authored-by:
Sida <sida@kulamind.com> Co-authored-by:
teamclouday <teamclouday@gmail.com>
-
Nicolas Patry authored
* adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by:Edwinhr716 <Edandres249@gmail.com>
-
Nicolas Patry authored
* Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0
-
- 01 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* nix: experimental support for building a Docker image Run using something like: ``` docker run \ --device nvidia.com/gpu=all \ -it --rm -p 8080:80 \ -v $PWD/data:/data \ -v $PWD/tmp:/tmp \ tgi-docker:latest \ --model-id <model_id> ``` * Example of building the Docker image using Nix inside Docker * Stream to make the builder image smaller This avoids storing a Docker image tarball in the image. Instead, stream the layers while doing `docker run`. * Don't spam journalctl on Linux * Other dockerfile. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 30 Sep, 2024 7 commits
-
-
Daniël de Kok authored
This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.
-
Daniël de Kok authored
-
drbh authored
* feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.
-
Mohit Sharma authored
* style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile
-
Ikram Ul Haq authored
-
Daniël de Kok authored
Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.
-
- 28 Sep, 2024 1 commit
-
-
Daniël de Kok authored
-