"vscode:/vscode.git/clone" did not exist on "0de1de3dcd1f092e0ee268b84561d5dc93767144"
- 15 Nov, 2024 2 commits
-
-
Nicolas Patry authored
* Upgrading our deps. * fixup. * Fixup.
-
Alex Weston authored
* Upgrade outlines to 0.1.1 * Update for new API * Check if allowed tokens is None --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 10 Nov, 2024 1 commit
-
-
Daniël de Kok authored
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.
-
- 24 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer
-
- 16 Oct, 2024 2 commits
-
-
OlivierDehaene authored
* wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Mohit Sharma authored
* (feat) fp8 fnuz support for rocm * (review comments) Fix compression_config load, type hints * (bug) update all has_tensor * (review_comments) fix typo and added comments * (nit) improved comment
-
- 14 Oct, 2024 1 commit
-
-
Dmitry Rogozhkin authored
XPU backend is available natively (without IPEX) in pytorch starting from pytorch 2.4. This commit extends TGI to cover the case when user has XPU support thru pytorch 2.4, but does not have IPEX installed. Models which don't require attention can work. For attention required models more work is needed to provide attention implementation. Tested with the following models: * teknium/OpenHermes-2.5-Mistral-7B * bigscience/bloom-560m * google/gemma-7b * google/flan-t5-xxl Signed-off-by:Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
-
- 24 Sep, 2024 1 commit
-
-
Nicolas Patry authored
-
- 17 Sep, 2024 1 commit
-
-
Daniël de Kok authored
* Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner
-
- 12 Sep, 2024 1 commit
-
-
drbh authored
fix: pass missing revision arg for lora adapter when loading multiple adapters
-
- 02 Sep, 2024 1 commit
-
-
drbh authored
* feat: support lora revisions and qkv_proj weights * fix: add qkv_proj weights to weight test
-
- 12 Aug, 2024 1 commit
-
-
drbh authored
* fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels
-
- 08 Aug, 2024 1 commit
-
-
drbh authored
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 31 Jul, 2024 1 commit
-
-
Daniël de Kok authored
The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.
-
- 29 Jul, 2024 1 commit
-
-
Erik Kaunismäki authored
* quick fix * allow silent failure * explicit todo that this is only short term
-
- 26 Jul, 2024 1 commit
-
-
drbh authored
* feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check
-
- 24 Jul, 2024 1 commit
-
-
drbh authored
* fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo
-
- 23 Jul, 2024 1 commit
-
-
Daniël de Kok authored
* Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.
-
- 22 Jul, 2024 1 commit
-
-
OlivierDehaene authored
* fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype
-
- 20 Jul, 2024 1 commit
-
-
OlivierDehaene authored
* feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout
-
- 19 Jul, 2024 2 commits
-
-
Daniël de Kok authored
Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts. -
Daniël de Kok authored
* Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama
-
- 16 Jul, 2024 2 commits
-
-
Daniël de Kok authored
Fixes #2236.
-
Daniël de Kok authored
Fixes #2036.
-
- 12 Jul, 2024 1 commit
-
-
Daniël de Kok authored
Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.
-
- 09 Jul, 2024 1 commit
-
-
Daniël de Kok authored
Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.
-
- 01 Jul, 2024 2 commits
-
-
Wang, Yi authored
* refine get xpu free memory Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).
-
- 25 Jun, 2024 5 commits
-
-
Daniël de Kok authored
This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.
-
drbh authored
* feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by:Derek <datavistics@gmail.com>
-
Nicolas Patry authored
* Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN
-
Wang, Yi authored
* add CPU tgi support Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
-
Wang, Yi authored
* use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
- 20 Jun, 2024 1 commit
-
-
Daniël de Kok authored
For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.
-
- 14 Jun, 2024 1 commit
-
-
Daniël de Kok authored
Add support for GPTQ Marlin kernels GPTQ Marlin extends the Marlin kernels to support common GPTQ configurations: - bits: 4 or 8 - groupsize: -1, 32, 64, or 128 - desc_act: true/false Using the GPTQ Marlin kernels requires repacking the parameters in the Marlin quantizer format. The kernels were contributed by Neural Magic to VLLM. We vendor them here for convenience.
-
- 10 Jun, 2024 1 commit
-
-
Daniël de Kok authored
Add support for Phi-3-medium The main difference between the medium and mini models is that medium uses grouped query attention with a packed QKV matrix. This change adds support for GQA with packed matrixes to `Weights.get_weights_col_packed` and uses it for Phi-3. This also allows us to remove the custom implementation of GQA from dbrx attention loading.
-
- 07 Jun, 2024 1 commit
-
-
Daniël de Kok authored
The router will now send the input as chunks besides as a single string. This change modifies the server to process chunked input rather than strings. This also allows us to remove the image extraction code from the server.
-
- 06 Jun, 2024 2 commits
-
-
Daniël de Kok authored
-
Daniël de Kok authored
This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit. Tested with: - Llama 2 - Llama 3 - Phi 3
-
- 04 Jun, 2024 1 commit
-
-
Daniël de Kok authored
# What does this PR do? The GPTQ code path for column-packed packed tensors assumed that this is always a QKV matrix. However, models (e.g. Phi-3) can also have column-packed MLP up/gate matrices. <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
-