- 23 Oct, 2024 2 commits
-
-
OlivierDehaene authored
* feat: natively support Granite models * Update doc
-
Daniël de Kok authored
-
- 22 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN
-
- 21 Oct, 2024 2 commits
-
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
Update the Mixtral GPTQ test to use a model with `desc_act=true` and `group_size!=-1` to ensure that we are checking activation sorting/non-full K (with tensor parallelism). The `desc_act=false` case is already checked by the Mixtral AWQ test.
-
- 19 Oct, 2024 1 commit
-
-
Daniël de Kok authored
Change `fp8_quantize` so that we can pass around reciprocals everywhere, so scales are always passed around in the checkpoint format. I also noticed that we ignore any input scales that we might have when fbgemm is available. Skip this path if we already have a scale.
-
- 18 Oct, 2024 1 commit
-
-
Nicolas Patry authored
* add gptq and awq int4 support in intel platform Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 17 Oct, 2024 5 commits
-
-
Daniël de Kok authored
-
drbh authored
* fix: prefer inplace softmax to avoid copy * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> --------- Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
oOraph authored
tgi-entrypoint: exec instead of spawning a child process reason: otherwise parent will receive the signals when we'd like tgi to receive them keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child Signed-off-by:
Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by:
Raphael Glon <oOraph@users.noreply.github.com>
-
Daniël de Kok authored
* Simplify the `attention` function - Use one definition rather than multiple. - Add `key`/`value` arguments, so that we don't need the `PREFILL_IN_KVCACHE` constant. - Make it kwargs-only (to avoid mixing up the various `Tensor` args). * Fixup flashinfer support
-
Daniël de Kok authored
* Support `e4m3fn` KV cache * Make check more obvious
-
- 16 Oct, 2024 2 commits
-
-
OlivierDehaene authored
* wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Mohit Sharma authored
* (feat) fp8 fnuz support for rocm * (review comments) Fix compression_config load, type hints * (bug) update all has_tensor * (review_comments) fix typo and added comments * (nit) improved comment
-
- 15 Oct, 2024 3 commits
-
-
Alvaro Bartolome authored
As spotted by @philschmid, the payload was compliant with Vertex AI, but just partially, since ideally the most compliant version would be with the generation kwargs flattened to be on the same level as the `messages`; meaning that Vertex AI would still expect a list of instances, but each instance would be an OpenAI-compatible instance, which is more clear; and more aligned with the SageMaker integration too, so kudos to him for spotting that; and sorry from my end for any inconvenience @Narsil.
-
Daniël de Kok authored
-
Nicolas Patry authored
-
- 14 Oct, 2024 5 commits
-
-
Dmitry Rogozhkin authored
XPU backend is available natively (without IPEX) in pytorch starting from pytorch 2.4. This commit extends TGI to cover the case when user has XPU support thru pytorch 2.4, but does not have IPEX installed. Models which don't require attention can work. For attention required models more work is needed to provide attention implementation. Tested with the following models: * teknium/OpenHermes-2.5-Mistral-7B * bigscience/bloom-560m * google/gemma-7b * google/flan-t5-xxl Signed-off-by:Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Omar Sanseviero authored
Update quicktour.md
-
Nicolas Patry authored
* break when there's nothing to read Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Different approach, only listen on stdin when `LOG_LEVEL=debug` (which is where dropping to a debugger is important). --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
Omar Sanseviero authored
* Small improvements for docs * Update _toctree.yml * Updating the doc (we keep the list actually). --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 11 Oct, 2024 1 commit
-
-
Nicolas Patry authored
-
- 10 Oct, 2024 3 commits
-
-
Nicolas Patry authored
* Intel CI ? * Let's try non sharded gemma. * Snapshot rename * Apparently container can be gone already.
-
vb authored
Update to most recent stable version of TGI.
-
drbh authored
* feat: process token stream before returning to client * fix: expect content in test * fix: improve comparison via ruff lint * fix: return event in all cases * fix: always send event on error, avoid unwraps, refactor and improve tests * fix: prefer no_tool over notify_error to improve reponse * fix: adjust chat input test for no_tool * fix: adjust test expected content --------- Co-authored-by:System administrator <root@ip-10-90-0-186.ec2.internal>
-
- 09 Oct, 2024 2 commits
-
-
Nicolas Patry authored
* Only run 1 valid test. * TRying the tailscale action quickly. * ? * bash spaces. * Remove tailscale. * More quotes. * mnt2 ? * Othername to avoid recursive directories. * Good old tmate. * Remove tmate. * Trying a few things. * Remove some stuff. * Sleep ? * Tmp * busybox * Launcher tgi * Starting hello * Busybox in python * No device. * Removing all variables ? * A un moment donné. * Tmp * Tmp2 * DEvice request, no container name * No device requests * Without pytest. * No pytest. * from env * Start with devices * Attemp #1 * Remove stdin messing * Only 1 test, no container name * Raw tgi * Sending args. * Show pip freeze. * Start downloading with token * Giving HIP devices. * Mount volume + port forward * Without pytest. * No token * Repeated arguments * Wrong kwarg. * On 2 GPUs * Fallback to single shard CI test. * Testing * yaml * Common cache ? * Trailing slash ? * Docker volume split. * Fix docker volume * Fixing ? * ? * Try no devices ? * Flash llama on intel CPU ? * Fix nvidia ? * Temp deactivate intel, activate nvidia ?
-
Daniël de Kok authored
To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black.
-
- 08 Oct, 2024 4 commits
-
-
drbh authored
* Update ToolType input schema * lint * fix: run formatter * fix: allow tool choide to be null --------- Co-authored-by:Wauplin <lucainp@gmail.com>
-
Daniël de Kok authored
-
Daniël de Kok authored
* Add support for fused MoE Marlin for AWQ This uses the updated MoE Marlin kernels from vLLM. * Add integration test for AWQ MoE
-
Nicolas Patry authored
* Upgrade minor rust version (Fixes rust build compilation cache) * Black
-
- 07 Oct, 2024 2 commits
-
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Florian Zimmermeister authored
Update kv_cache.py
-
- 04 Oct, 2024 2 commits
-
-
Daniël de Kok authored
* Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml
-
Daniël de Kok authored
-
- 03 Oct, 2024 2 commits
-
-
Nicolas Patry authored
* New release 2.3.1 * Update doc number
- 02 Oct, 2024 2 commits
-
-
drbh authored
* feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file
-
drbh authored
allow revision for lora adapters from launcher Co-authored-by:
Sida <sida@kulamind.com> Co-authored-by:
teamclouday <teamclouday@gmail.com>
-