"vscode:/vscode.git/clone" did not exist on "1afc21855eb1f5575bd61037a7ee44522ccf401e"
- 16 Jul, 2024 3 commits
-
-
Daniël de Kok authored
Fixes #2236.
-
Daniël de Kok authored
-
Daniël de Kok authored
Fixes #2036.
-
- 15 Jul, 2024 3 commits
-
-
Hugo Larcher authored
Remove bitsandbytes installation when running cpu-only install
-
Erik Kaunismäki authored
* fix to not ignore HUGGINGFACE_HUB_CACHE in cache * delete printlns * delete newlines * maybe fix trailing whitespace
-
drbh authored
* feat: simple mistral lora integration tests * fix: include args in docker launcher * fix: disable cuda graphs with lora and warn * fix: adjust docs and precommit issues * fix: re update docs
-
- 12 Jul, 2024 2 commits
-
-
Daniël de Kok authored
Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.
-
SeongBeomLEE authored
-
- 11 Jul, 2024 2 commits
-
-
drbh authored
* fix: append DONE message to chat stream * fix: update completions endpoint
-
Daniël de Kok authored
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by:Florian Zimmermeister <flozi00.fz@gmail.com>
-
- 09 Jul, 2024 4 commits
-
-
Daniël de Kok authored
Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.
-
Nicolas Patry authored
* Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.
-
vinkamath authored
Co-authored-by:Vinayak Kamath <Vinayak.Kamath@target.com>
-
Nicolas Patry authored
-
- 08 Jul, 2024 10 commits
-
-
Guillaume LEGENDRE authored
* Update build.yaml * Update build.yaml * change to S3 cache * change to CPU Runners * remove comments
-
fxmarty authored
* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD
-
drbh authored
-
Wang, Yi authored
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Javier Martinez authored
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
Daniël de Kok authored
We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.
-
Daniël de Kok authored
Fix number of KV heads
-
icyboy™ authored
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug
-
- 05 Jul, 2024 6 commits
-
-
Daniël de Kok authored
* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes
-
Daniël de Kok authored
* Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build
-
Daniël de Kok authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.
-
Aaron Mihalik authored
Adding "longrope" for phi-3
-
- 04 Jul, 2024 1 commit
-
-
Nicolas Patry authored
-
- 03 Jul, 2024 5 commits
-
-
Nicolas Patry authored
* Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.
-
Nicolas Patry authored
-
Nicolas Patry authored
This reverts commit 2bbb7fa4.
-
Nicolas Patry authored
-
drbh authored
* feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff
-
- 02 Jul, 2024 4 commits
-
-
Nicolas Patry authored
-
Guillaume LEGENDRE authored
* first test with registry mirror * change push registry * remove comments * Move cache to push registry * fix registry url * Update .github/workflows/ci_build.yaml --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
-
drbh authored
-