1. 05 Jun, 2024 1 commit
    • Cyril Vallez's avatar
      Reduce by 2 the memory requirement in `generate()` 馃敟馃敟馃敟 (#30536) · bd5091df
      Cyril Vallez authored
      * Fix contrastive_search for new cache structure, and improve performance by removing inneficient torch.stack(torch.split(x, top_k, dim=0))
      
      * Fix _contrastive_search for non-standard cache using ellipsis slicing
      
      * Fix all outputs.logits memory leaks for all decoding strategies!
      
      * Fix small error in _contrastive_search()
      
      * Make all necessary change and revert for the new class
      
      * Apply coding style
      
      * Remove pipes in type hints for compatibility
      
      * correct type hint
      
      * apply style
      
      * Use DynamicCache by default and solve conflicts
      
      * Fix rebase issues
      
      * Add `_supports_dynamic_cache_class` in models for models that support DynamicCache but not other caches to make DynamicCache the default for more models
      
      * Create generation config to return legacy format by default, or to choose not to
      
      * style
      
      * Fix case when use_cache is False
      
      * Remove default DynamicCache in assiste_decoding if assistant_model does not support it + fix _seen_tokens when cropping cache
      
      * Update prepare_inputs_for_generation() for case with empty DynamicCache
      
      * Correct return of args in _assisted_decoding
      
      * Remove EfficientDynamicCache as it is no longer needed
      
      * Correct mistake in generation config
      
      * Move cache logic of assisted decoding to AssistedCandidateGenerator.__init__
      
      * change DynamicCache function names from "split" to "batch_split" for readability + apply coding style
      
      * Remove `_supports_dynamic_cache_class` attribute after rebase
      
      * Correct missing line lost in conflict resolution during rebasing
      
      * Add special case for Jamba
      
      * Fix jamba test
      
      * Coding style
      
      * coding style
      
      * Correct missing import in rebasing
      
      * Simplify _validate_model_kwargs based on removal of _supports_dynamic_cache attribute
      
      * Simplify code paths in _contrastive_search
      
      * coding style
      
      * Update docstrings of cache methods
      
      * Update prepare_inputs_for_generation() -> past_key_values are always Cache objects
      bd5091df
  2. 13 May, 2024 1 commit
  3. 01 May, 2024 1 commit
  4. 30 Apr, 2024 1 commit
  5. 18 Apr, 2024 1 commit
    • tomeras91's avatar
      Add jamba (#29943) · 3f20877d
      tomeras91 authored
      * Add jamba arch
      
      * apply "make fix-copies" changes
      
      * fix link to model in JambaConfig docstring
      
      * Add n_ctx in modeling file because repo-consistency wants that
      
      * Add jamba to flash attention and sdpa documentation
      
      * mamba dt_proj quant fix now works for LoRA as well
      
      * override test_left_padding_compatibility and use a more permissive tolerance. left padding numerical difference are accentuated by mamba layers
      
      * add jamba to tokenization auto
      
      * fix comments of shape (PR #24 in the model page: https://huggingface.co/ai21labs/Jamba-v0.1/discussions/24)
      
      * simple PR fixes
      
      * remove unnecessary kwargs from JambaAttentionDecoderLayer and JambaMambaDecoderLayer
      
      * remove the LoRA hack for the mamba dt_proj bias. It was solved in huggingface/peft#1530 (https://github.com/huggingface/peft/pull/1530)
      
      * Add copied comment on JambaMLP (it's the same as MixtralMLP)
      
      * remove padding_mask warnings. It's not supported anymore
      
      * fix docstring. Float instead of int
      
      * A few more minor PR fixes
      
      * (1) lowercase names for mamba layernorms (2) remove _apply_inner_layernorms and do it directly in the forward pass
      
      * Return None attention weights from mamba layers. Append to all attentions only if not None.
      
      * remove some leftover jamba archive lists
      
      * Better separation between expert vs non-expert layers. non-expert layers return None as router_logits, and it is not concatenated to all_router_logits returned from JambaModel
      
      * no need to take router_logits at config.expert_layer_offset anymore. result.router_logits now holds results only for expert layers
      
      * Add Jamba paper on READMEs
      
      * (1) rename n_ctx -> max_position_embeddings (2) don't use it in the modeling file since it's not needed (set it as an exception to check_config_attributes)
      
      * Add copied from comment
      
      * remove the code path for apply_inner_layernorms=False. Jamba always has the inner mamba layernorms
      
      * clearer docstring for _convert_to_standard_cache
      
      * style fixes
      
      * Change calc_logits_for_entire_prompt (bool) to num_logits_to_keep (int). Adapt assisted decoding code tp use it. Also small change in low memory beam search decoding path to support this new int value in model_inputs
      
      * rename test so it still overrides what its meant to override
      
      * draft
      
      * oups
      
      * nit
      
      * remove more complexe logic
      
      * fix names used in config
      
      * fix fix fix
      
      * style
      
      * fix some more failing tests
      
      * generate did not init the cache 馃檭
      
      
      
      * more small nits
      
      * typo
      
      * config.mamba_expand * config.hidden_size for the intermediate size of the mamba shapes
      
      * fix init of pkv with torch.tensor()
      
      * empty tensor
      
      * fix some init issues
      
      * stupid changes required by generate because it does not even support it's own DynamicCache class
      
      * more fixes
      
      * fix general assisted gen cache_position bug
      
      * tests passing
      
      * Add offsets and periods as SPECIAL_CASES_TO_ALLOW in check_config_attributes.py
      
      * fix reorder_cache to reorder mamba states and override some more functions in HybridMambaAttentionDynamicCache
      
      * no need to override test_past_key_values_format() and _check_past_key_values_for_generate() in tests anymore
      
      * fix docstrings and typehints for past_key_values
      
      * style fixes
      
      * fix docs
      
      * change typehint due to copy from Mixtral
      
      * forgot import
      
      * import order
      
      * Add configuration_jamba and modeling_jamba to not_doctested because the model is too big to download (in docstring of JambaForCausalLM.forward)
      
      * Add integration test with tiny tandom Jamba model on hub
      
      * fix flash attention cache shapes
      
      * bring back forgotten hidden states
      
      * rename HybridMambaAttentionDynamicCache.seqlen_offset to has_previous_state (and make bool) and bugfix - it should be set to True after a finished forward pass of the entire model
      
      * align integration test after modeling fixes
      
      * bugfix - mamba can use precomputed states only of forward pass is on a single token
      
      * bugfix - mamba can use precomputed states only if they match the batch size
      
      * typo
      
      * remove making _prepare_4d_causal_attention_mask a leaf function
      
      * stop using past_seq_len.get_seq_length(). Use cache positions instead. Adjust test (test_decoder_model_past_with_large_inputs) accordingly
      
      ---------
      Co-authored-by: default avatarArthur Zucker <arthur.zucker@gmail.com>
      Co-authored-by: default avatarJoao Gante <joao@huggingface.co>
      3f20877d
  6. 10 Apr, 2024 1 commit
  7. 27 Mar, 2024 1 commit
  8. 22 Mar, 2024 1 commit
  9. 06 Mar, 2024 2 commits
  10. 26 Feb, 2024 1 commit
  11. 16 Feb, 2024 1 commit
  12. 19 Jan, 2024 1 commit
  13. 13 Jan, 2024 2 commits
  14. 11 Jan, 2024 1 commit
  15. 20 Dec, 2023 1 commit
  16. 14 Dec, 2023 1 commit
  17. 12 Dec, 2023 1 commit