- 09 Aug, 2024 2 commits
-
-
Vaibhav Srivastav authored
* Minor doc fixes * up. * Other minor updates.
-
Daniël de Kok authored
This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call *state*). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.
-
- 08 Aug, 2024 5 commits
-
-
drbh authored
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
drbh authored
* Update __init__.py Fix issue with NoneType comparison for max_input_tokens and sliding_window - Add default values for max_input_tokens and sliding_window to handle None cases. - Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError. - This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'. * Update __init__.py Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness. * fix: syntax/style tweak --------- Co-authored-by:Praz <prazanth2006@gmail.com>
-
drbh authored
* Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by:Sadra Barikbin <sadraqazvin1@yahoo.com>
-
drbh authored
* add gptj modeling Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 07 Aug, 2024 1 commit
-
-
almersawi authored
Co-authored-by:Islam Almersawi <islam.almersawi@openinnovation.ai>
-
- 06 Aug, 2024 2 commits
- 05 Aug, 2024 1 commit
-
-
drbh authored
* fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak
-
- 01 Aug, 2024 2 commits
-
-
Daniël de Kok authored
- Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 31 Jul, 2024 1 commit
-
-
drbh authored
* MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by:root <root@tw031.pit.tensorwave.lan>
-
- 26 Jul, 2024 2 commits
-
-
drbh authored
* feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check
-
Daniël de Kok authored
-
- 24 Jul, 2024 3 commits
-
-
drbh authored
* fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo
-
Wang, Yi authored
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Wang, Yi authored
* fix crash in multi-modal Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 23 Jul, 2024 2 commits
-
-
shaltielshmid authored
* Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
-
- 22 Jul, 2024 3 commits
-
-
Nicolas Patry authored
* Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.
-
OlivierDehaene authored
* fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype
-
icyboy™ authored
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading
-
- 21 Jul, 2024 1 commit
-
-
OlivierDehaene authored
-
- 20 Jul, 2024 1 commit
-
-
OlivierDehaene authored
* feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout
-
- 19 Jul, 2024 6 commits
-
-
Daniël de Kok authored
Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts. -
Daniël de Kok authored
-
Daniël de Kok authored
-
Daniël de Kok authored
-
Daniël de Kok authored
-
Daniël de Kok authored
* Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama
-
- 18 Jul, 2024 1 commit
-
-
OlivierDehaene authored
-
- 16 Jul, 2024 1 commit
-
-
Daniël de Kok authored
Fixes #2036.
-
- 09 Jul, 2024 1 commit
-
-
Daniël de Kok authored
Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.
-
- 08 Jul, 2024 4 commits
-
-
Daniël de Kok authored
-
Daniël de Kok authored
We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.
-
Daniël de Kok authored
Fix number of KV heads
-
icyboy™ authored
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug
-
- 05 Jul, 2024 1 commit
-
-
Daniël de Kok authored
* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes
-