- 31 Jul, 2024 4 commits
-
-
Erik Kaunismäki authored
* refactor usage stats * Update docs/source/usage_statistics.md Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> * Update router/src/server.rs Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> * changes based on feedback * run python3 udpate_doc.py * fix pre-commit * Update router/src/server.rs Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> * delete option around usage stats arg --------- Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
drbh authored
* MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by:root <root@tw031.pit.tensorwave.lan>
-
Daniël de Kok authored
The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.
-
Nicolas Patry authored
* wip wip refacto refacto Initial setup for CXX binding to TRTLLM Working FFI call for TGI and TRTLLM backend Remove unused parameters annd force tokenizer name to be set Overall build TRTLLM and deps through CMake build system Enable end to end CMake build First version loading engines and making it ready for inference Remembering to check how we can detect support for chunked context Move to latest TensorRT-LLM version Specify which default log level to use depending on CMake build type make leader executor mode working unconditionally call InitializeBackend on the FFI layer bind to CUDA::nvml to retrieve compute capabilities at runtime updated logic and comment to detect cuda compute capabilities implement the Stream method to send new tokens through a callback use spdlog release 1.14.1 moving forward update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c correctly tell cmake to build dependent tensorrt-llm required libraries create cmake install target to put everything relevant in installation folder add auth_token CLI argument to provide hf hub authentification token allow converting huggingface::tokenizers error to TensorRtLlmBackendError use correct include for spdlog include guard to build example in cmakelists working setup of the ffi layer remove fmt import use external fmt lib end to end ffi flow working make sure to track include/ffi.h to trigger rebuild from cargo impl the rust backend which currently cannot move the actual computation in background thread expose shutdown function at ffi layer impl RwLock scenario for TensorRtLllmBackend oops missing c++ backend definitions compute the number of maximum new tokens for each request independently make sure the context is not dropped in the middle of the async decoding. remove unnecessary log add all the necessary plumbery to return the generated content update invalid doc in cpp file correctly forward back the log probabilities remove unneeded scope variable for now refactor Stream impl for Generation to factorise code expose the internal missing start/queue timestamp forward tgi parameters rep/freq penalty add some more validation about grammar not supported define a shared struct to hold the result of a decoding step expose information about potential error happening while decoding remove logging add logging in case of decoding error make sure executor_worker is provided add initial Dockerfile for TRTLLM backend add some more information in CMakeLists.txt to correctly install executorWorker add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper simplify prebuilt trtllm libraries name definition do the same name definition stuff for tensorrt_llm_executor_static leverage pkg-config to probe libraries paths and reuse new install structure from cmake fix bad copy/past missing nvinfer linkage direction align all the linker search dependency add missing pkgconfig folder for MPI in Dockerfile correctly setup linking search path for runtime layer fix missing / before tgi lib path adding missing ld_library_path for cuda stubs in Dockerfile update tgi entrypoint commenting out Python part for TensorRT installation refactored docker image move to TensorRT-LLM v0.11.0 make docker linter happy with same capitalization rule fix typo refactor the compute capabilities detection along with num gpus update TensorRT-LLM to latest version update TensorRT install script to latest update build.rs to link to cuda 12.5 add missing dependant libraries for linking clean up a bit install to decoder_attention target add some custom stuff for nccl linkage fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time use std::env::const::ARCH make sure variable live long enough... look for cuda 12.5 add some more basic info in README.md * Rebase. * Fix autodocs. * Let's try to enable trtllm backend. * Ignore backends/v3 by default. * Fixing client. * Fix makefile + autodocs. * Updating the schema thing + redocly. * Fix trtllm lint. * Adding pb files ? * Remove cargo fmt temporarily. * ? * Tmp. * Remove both check + clippy ? * Backporting telemetry. * Backporting 457fb0a1 * Remove PB from git. * Fixing PB with default member backends/client * update TensorRT-LLM to latest version * provided None for api_key * link against libtensorrt_llm and not libtensorrt-llm --------- Co-authored-by:
OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by:
Morgan Funtowicz <morgan@huggingface.co>
-
- 30 Jul, 2024 1 commit
-
-
Daniël de Kok authored
- Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.
-
- 29 Jul, 2024 6 commits
-
-
drbh authored
* fix: adjust test snapshots and small refactors * fix: revert non snapshot changes
-
Erik Kaunismäki authored
* quick fix * allow silent failure * explicit todo that this is only short term
-
drbh authored
-
Daniël de Kok authored
-
Erik Kaunismäki authored
* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by:Kevin Duffy <kevin.duffy94@gmail.com>
-
Adrien authored
Signed-off-by:Adrien <adrien@huggingface.co>
-
- 26 Jul, 2024 2 commits
-
-
drbh authored
* feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check
-
Daniël de Kok authored
-
- 25 Jul, 2024 4 commits
-
-
Adrien authored
-
Nicolas Patry authored
-
Daniël de Kok authored
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update
-
Nicolas Patry authored
* Using g6 instead of g5. * Update the idefics2 snapshot.
-
- 24 Jul, 2024 4 commits
-
-
drbh authored
* fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo
-
Daniël de Kok authored
The marlin.py file was getting large, split it up.
-
Wang, Yi authored
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Wang, Yi authored
* fix crash in multi-modal Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 23 Jul, 2024 9 commits
-
-
OlivierDehaene authored
-
OlivierDehaene authored
* chore: update to torch 2.4 * remove un-necessary patch * fix
-
Daniël de Kok authored
-
Daniël de Kok authored
* Add support for Llama 3 rotary embeddings * Update transformers to 4.43
-
Nicolas Patry authored
* Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.
-
shaltielshmid authored
* Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
* Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.
-
OlivierDehaene authored
* fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89
-
Nicolas Patry authored
-
- 22 Jul, 2024 6 commits
-
-
Adrien authored
-
Nicolas Patry authored
* Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.
-
OlivierDehaene authored
* fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype
-
Adrien authored
* test new instances Signed-off-by:
Adrien <adrien@huggingface.co> * improve build ci Signed-off-by:
Adrien <adrien@huggingface.co> --------- Signed-off-by:
Adrien <adrien@huggingface.co>
-
Erik Kaunismäki authored
Update README.md point to huggingface_hub inference clients instead
-
icyboy™ authored
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading
-
- 21 Jul, 2024 1 commit
-
-
OlivierDehaene authored
-
- 20 Jul, 2024 3 commits
-
-
OlivierDehaene authored
* feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout
-
Daniël de Kok authored
-
Adrien authored
* re-push to internal registry Signed-off-by:
Adrien <adrien@huggingface.co> * fix name Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip debug Signed-off-by:
Adrien <adrien@huggingface.co> * add debug Signed-off-by:
Adrien <adrien@huggingface.co> * should Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * ww Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * ww Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * w Signed-off-by:
Adrien <adrien@huggingface.co> * revert tests Signed-off-by:
Adrien <adrien@huggingface.co> * last reverts Signed-off-by:
Adrien <adrien@huggingface.co> * another one Signed-off-by:
Adrien <adrien@huggingface.co> --------- Signed-off-by:
Adrien <adrien@huggingface.co>
-