- 09 Jul, 2024 2 commits
-
-
vinkamath authored
Co-authored-by:Vinayak Kamath <Vinayak.Kamath@target.com>
-
Nicolas Patry authored
-
- 08 Jul, 2024 10 commits
-
-
Guillaume LEGENDRE authored
* Update build.yaml * Update build.yaml * change to S3 cache * change to CPU Runners * remove comments
-
fxmarty authored
* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD
-
drbh authored
-
Wang, Yi authored
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Javier Martinez authored
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
Daniël de Kok authored
We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.
-
Daniël de Kok authored
Fix number of KV heads
-
icyboy™ authored
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug
-
- 05 Jul, 2024 6 commits
-
-
Daniël de Kok authored
* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes
-
Daniël de Kok authored
* Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build
-
Daniël de Kok authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.
-
Aaron Mihalik authored
Adding "longrope" for phi-3
-
- 04 Jul, 2024 1 commit
-
-
Nicolas Patry authored
-
- 03 Jul, 2024 5 commits
-
-
Nicolas Patry authored
* Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.
-
Nicolas Patry authored
-
Nicolas Patry authored
This reverts commit 2bbb7fa4.
-
Nicolas Patry authored
-
drbh authored
* feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff
-
- 02 Jul, 2024 6 commits
-
-
Nicolas Patry authored
-
Guillaume LEGENDRE authored
* first test with registry mirror * change push registry * remove comments * Move cache to push registry * fix registry url * Update .github/workflows/ci_build.yaml --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
-
drbh authored
-
Wang, Yi authored
install triton because GPTQParams needs it. Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
-
- 01 Jul, 2024 10 commits
-
-
Nicolas Patry authored
* Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually. -
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
-
drbh authored
* fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Wang, Yi authored
* refine get xpu free memory Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
-
Daniël de Kok authored
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).
-
drbh authored
-
drbh authored
-