"vscode:/vscode.git/clone" did not exist on "92e4e9c650fafb42294a80a42b6d394e10b5f3c4"
- 01 Jul, 2024 8 commits
-
-
drbh authored
* fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Wang, Yi authored
* refine get xpu free memory Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
-
Daniël de Kok authored
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).
-
drbh authored
-
drbh authored
-
Nicolas Patry authored
-
Wang, Yi authored
* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices] Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
- 28 Jun, 2024 1 commit
-
-
Nicolas Patry authored
-
- 27 Jun, 2024 6 commits
-
-
drbh authored
* fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor
-
Nicolas Patry authored
* Fixing gemma2. * Adding new model.
-
Nicolas Patry authored
* Fixing malformed rust tokenizers * Fix for deepseek too.
-
Daniël de Kok authored
Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.
-
Nicolas Patry authored
-
Nicolas Patry authored
-
- 25 Jun, 2024 15 commits
-
-
drbh authored
-
Daniël de Kok authored
This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.
-
Daniël de Kok authored
When the AWQ quantizer was used with a layer that uses a bias, the bias tensor was not correctly passed/used. Instead, the value `true`/`1.0` was added to the linear transformation. Correctly pass through the bias when it is not `None`. Fixes #2106.
-
drbh authored
* feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by:Derek <datavistics@gmail.com>
-
Nicolas Patry authored
Fix clippy.
-
Daniël de Kok authored
* Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
* Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN
-
drbh authored
* feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes
-
Wang, Yi authored
* add CPU tgi support Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
-
sunxichen authored
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089) Co-authored-by:sunxichen <sun.xc@digitalcnzz.com>
-
Wang, Yi authored
* use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
Jeff authored
* corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
KevinDuffy94 authored
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option
-
Lucain authored
* Support HF_TOKEN environement variable * Load test. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 24 Jun, 2024 2 commits
-
-
ur4t authored
* Fix cargo-chef prepare In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly. If Cargo.lock is not present, cargo-chef will generate a new one first, which might vary a lot and invalidate docker build caches. * Fix Dockerfile_amd and Dockerfile_intel
-
Nicolas Patry authored
* New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79
-
- 21 Jun, 2024 2 commits
-
-
drbh authored
-
Daniël de Kok authored
The subcommand did not work due to some broken imports.
-
- 20 Jun, 2024 2 commits
-
-
Daniël de Kok authored
For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.
-
Daniël de Kok authored
Fixes #2081.
-
- 19 Jun, 2024 1 commit
-
-
drbh authored
-
- 18 Jun, 2024 2 commits
-
-
Daniël de Kok authored
-
Guillaume LEGENDRE authored
* test local tailscale * Update build.yaml * Update build.yaml * Update build.yaml * Update build.yaml * wait for ssh * network host * change step order
-
- 17 Jun, 2024 1 commit
-
-
Daniël de Kok authored
* Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass
-