- 29 Jul, 2024 2 commits
-
-
Erik Kaunismäki authored
* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by:Kevin Duffy <kevin.duffy94@gmail.com>
-
Adrien authored
Signed-off-by:Adrien <adrien@huggingface.co>
-
- 26 Jul, 2024 2 commits
-
-
drbh authored
* feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check
-
Daniël de Kok authored
-
- 25 Jul, 2024 4 commits
-
-
Adrien authored
-
Nicolas Patry authored
-
Daniël de Kok authored
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update
-
Nicolas Patry authored
* Using g6 instead of g5. * Update the idefics2 snapshot.
-
- 24 Jul, 2024 4 commits
-
-
drbh authored
* fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo
-
Daniël de Kok authored
The marlin.py file was getting large, split it up.
-
Wang, Yi authored
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Wang, Yi authored
* fix crash in multi-modal Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 23 Jul, 2024 9 commits
-
-
OlivierDehaene authored
-
OlivierDehaene authored
* chore: update to torch 2.4 * remove un-necessary patch * fix
-
Daniël de Kok authored
-
Daniël de Kok authored
* Add support for Llama 3 rotary embeddings * Update transformers to 4.43
-
Nicolas Patry authored
* Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.
-
shaltielshmid authored
* Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
* Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.
-
OlivierDehaene authored
* fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89
-
Nicolas Patry authored
-
- 22 Jul, 2024 6 commits
-
-
Adrien authored
-
Nicolas Patry authored
* Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.
-
OlivierDehaene authored
* fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype
-
Adrien authored
* test new instances Signed-off-by:
Adrien <adrien@huggingface.co> * improve build ci Signed-off-by:
Adrien <adrien@huggingface.co> --------- Signed-off-by:
Adrien <adrien@huggingface.co>
-
Erik Kaunismäki authored
Update README.md point to huggingface_hub inference clients instead
-
icyboy™ authored
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading
-
- 21 Jul, 2024 1 commit
-
-
OlivierDehaene authored
-
- 20 Jul, 2024 3 commits
-
-
OlivierDehaene authored
* feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout
-
Daniël de Kok authored
-
Adrien authored
* re-push to internal registry Signed-off-by:
Adrien <adrien@huggingface.co> * fix name Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip debug Signed-off-by:
Adrien <adrien@huggingface.co> * add debug Signed-off-by:
Adrien <adrien@huggingface.co> * should Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * ww Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * ww Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * w Signed-off-by:
Adrien <adrien@huggingface.co> * revert tests Signed-off-by:
Adrien <adrien@huggingface.co> * last reverts Signed-off-by:
Adrien <adrien@huggingface.co> * another one Signed-off-by:
Adrien <adrien@huggingface.co> --------- Signed-off-by:
Adrien <adrien@huggingface.co>
-
- 19 Jul, 2024 9 commits
-
-
Daniël de Kok authored
Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts. -
drbh authored
* fix: adjust default tool choice * feat: improve tool choice syntax and response parsing/errors * fix: remove dev tests * feat: add ToolChoice to docs
-
Erik Kaunismäki authored
quick fix
-
Erik Kaunismäki authored
* draft of usage stats * fix wrong link * launcher doesn't need sysinfo dep * only tokenizer class instead of hole struct * unused import * fix clippy errors * update openAPI doc * cargo fmt * fix error in passing flags to router * try again to update docs * run pre-commit locally * Update router/src/main.rs Co-authored-by:
Hugo Larcher <hugo.larcher@huggingface.co> * Update router/src/main.rs Co-authored-by:
Hugo Larcher <hugo.larcher@huggingface.co> * on crash use anonymous error event * delete json_output and ngrok * more robust way of checking if is in container * more robust nvidia smi * parse xpu more robustly * fix errors * add nvidia-smi details in docs * cargo fmt * fix clippy * should make docs check pass * Update router/src/usage_stats.rs Co-authored-by:
Hugo Larcher <hugo.larcher@huggingface.co> * error reason can't be in nested json * cargo fmt --------- Co-authored-by:
Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by:
Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
-
Daniël de Kok authored
-
Daniël de Kok authored
-
Daniël de Kok authored
-
Daniël de Kok authored
-
Daniël de Kok authored
* Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama
-