"docs/vscode:/vscode.git/clone" did not exist on "d38c804320192c3844ff0bc7deed83e8b8cb7856"
- 26 Aug, 2024 1 commit
-
-
drbh authored
* Fix: don't apply post layernorm in SiglipVisionTransformer This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813). This also makes Siglip consistent with the existing Clip implementation: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613 * fix: adjust pali gemma for post layer norm and small refactors --------- Co-authored-by:
Travis Addair <tgaddair@gmail.com>
-
- 20 Aug, 2024 1 commit
-
-
Nicolas Patry authored
* Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by:Daniël de Kok <me@danieldk.eu>
-
- 15 Aug, 2024 1 commit
-
-
Nicolas Patry authored
* Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.
-
- 14 Aug, 2024 1 commit
-
-
Nicolas Patry authored
* Upgrading exl2. * Fixing the other pathways. * Fix idefics.
-
- 13 Aug, 2024 2 commits
-
-
drbh authored
fix: adds causal to attention params to check when using flash attn v1
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 12 Aug, 2024 5 commits
-
-
drbh authored
-
drbh authored
* fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels
-
drbh authored
* feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test
-
Daniël de Kok authored
This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.
-
Nicolas Patry authored
-
- 09 Aug, 2024 3 commits
-
-
Nicolas Patry authored
* Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.
-
Vaibhav Srivastav authored
* Minor doc fixes * up. * Other minor updates.
-
Daniël de Kok authored
This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call *state*). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.
-
- 08 Aug, 2024 6 commits
-
-
drbh authored
-
drbh authored
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
drbh authored
* Update __init__.py Fix issue with NoneType comparison for max_input_tokens and sliding_window - Add default values for max_input_tokens and sliding_window to handle None cases. - Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError. - This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'. * Update __init__.py Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness. * fix: syntax/style tweak --------- Co-authored-by:Praz <prazanth2006@gmail.com>
-
drbh authored
* Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by:Sadra Barikbin <sadraqazvin1@yahoo.com>
-
drbh authored
* add gptj modeling Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 07 Aug, 2024 1 commit
-
-
almersawi authored
Co-authored-by:Islam Almersawi <islam.almersawi@openinnovation.ai>
-
- 06 Aug, 2024 3 commits
- 05 Aug, 2024 1 commit
-
-
drbh authored
* fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak
-
- 01 Aug, 2024 2 commits
-
-
Daniël de Kok authored
- Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 31 Jul, 2024 2 commits
-
-
drbh authored
* MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by:root <root@tw031.pit.tensorwave.lan>
-
Daniël de Kok authored
The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.
-
- 30 Jul, 2024 1 commit
-
-
Daniël de Kok authored
- Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.
-
- 29 Jul, 2024 2 commits
-
-
Erik Kaunismäki authored
* quick fix * allow silent failure * explicit todo that this is only short term
-
Daniël de Kok authored
-
- 26 Jul, 2024 2 commits
-
-
drbh authored
* feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check
-
Daniël de Kok authored
-
- 25 Jul, 2024 1 commit
-
-
Daniël de Kok authored
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update
-
- 24 Jul, 2024 4 commits
-
-
drbh authored
* fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo
-
Daniël de Kok authored
The marlin.py file was getting large, split it up.
-
Wang, Yi authored
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Wang, Yi authored
* fix crash in multi-modal Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 23 Jul, 2024 1 commit
-
-
Daniël de Kok authored
* Add support for Llama 3 rotary embeddings * Update transformers to 4.43
-