1. 14 Jul, 2024 1 commit
  2. 02 Jul, 2024 1 commit
  3. 26 Jun, 2024 1 commit
  4. 06 Mar, 2024 1 commit
  5. 20 Feb, 2024 1 commit
  6. 15 Feb, 2024 1 commit
    • Arthur's avatar
      Fix static generation when compiling! (#28937) · f3788b09
      Arthur authored
      
      
      * wow I was scared!
      
      * fix everything
      
      * nits
      
      * make it BC?
      
      * add todo
      
      * nits
      
      * is_tracing should still be used to pass tracing tests
      
      * nits
      
      * some nits to make sure genration works with static cache uncompiled
      
      * fix sdpa
      
      * fix FA2 for both static and dynamic in a better way?
      
      * style
      
      * fix-copies
      
      * fix fix copies
      
      * fix sequential beam searcg
      
      * style
      
      * use `keys_to_ignore`
      
      * nit
      
      * correct dtype inference when init
      
      * :( the fix for FA2 is still not optimal to investigate!
      
      * styling
      
      * nits
      
      * nit
      
      * this might work better
      
      * add comment
      
      * Update src/transformers/models/llama/modeling_llama.py
      
      * "position_ids" -> "cache_position"
      
      * style
      
      * nit
      
      * Remove changes that should no be propagatted just yet
      
      * Apply suggestions from code review
      
      * Styling
      
      * make sure we raise an errir for static cache with FA2 enabled
      
      * move  to the bottom of the signature
      
      * style
      
      * Update src/transformers/models/llama/modeling_llama.py
      Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>
      
      * Update src/transformers/models/llama/modeling_llama.py
      
      * nit in the name
      
      ---------
      Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>
      f3788b09
  7. 13 Feb, 2024 1 commit
  8. 08 Feb, 2024 1 commit
  9. 08 Dec, 2023 2 commits
    • Joao Gante's avatar
      ce0bbd51
    • Tom Aarsen's avatar
      Generate: New `Cache` abstraction and Attention Sinks support (#26681) · 633215ba
      Tom Aarsen authored
      * Draft version of new KV Caching
      
      This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks)
      / StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented
      in a third-party or in transformers directly
      
      * Address numerous PR suggestions
      
      1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic.
      2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls.
      3. Remove __bool__ and __getitem__ magic as they're confusing.
      4. past_key_values.update(key, value, idx) now returns key, value.
      5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR.
      6. Separate key_cache and value_cache.
      
      Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method.
      
      * Implement the SinkCache through backward+forward rotations
      
      * Integrate (Sink)Cache with Llama FA2
      
      * Set use_legacy_cache=True as default, allows for test passes
      
      * Move from/to_legacy_cache to ...Model class
      
      * Undo unnecessary newline change
      
      * Remove copy utility from deprecated OpenLlama
      
      * Match import style
      
      * manual rebase with main
      
      * Cache class working with generate (#1)
      
      * Draft version of new KV Caching
      
      This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks)
      / StreamingLLM (https://arxiv.org/abs/2309.17453
      
      ) to be easily implemented
      in a third-party or in transformers directly
      
      * Address numerous PR suggestions
      
      1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic.
      2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls.
      3. Remove __bool__ and __getitem__ magic as they're confusing.
      4. past_key_values.update(key, value, idx) now returns key, value.
      5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR.
      6. Separate key_cache and value_cache.
      
      Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method.
      
      * Integrate (Sink)Cache with Llama FA2
      
      * Move from/to_legacy_cache to ...Model class
      
      * Undo unnecessary newline change
      
      * Match import style
      
      * working generate
      
      * Add tests; Simplify code; Apply changes to Mistral and Persimmon
      
      * fix rebase mess
      
      * a few more manual fixes
      
      * last manual fix
      
      * propagate changes to phi
      
      * upgrade test
      
      * add use_legacy_cache docstring; beef up tests
      
      * reintroduce unwanted deletes
      
      ---------
      Co-authored-by: default avatarTom Aarsen <Cubiegamedev@gmail.com>
      
      * move import
      
      * add default to model_kwargs.get('use_legacy_cache')
      
      * correct failing test
      
      * Apply suggestions from code review
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * apply PR suggestions
      
      * fix failing test
      
      * Apply suggestions from code review
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      Co-authored-by: default avatarTom Aarsen <37621491+tomaarsen@users.noreply.github.com>
      
      * PR comments
      
      * tmp commit
      
      * add docstrings
      
      * more tests, more docstrings, add to docs
      
      * derp
      
      * tmp commit
      
      * tmp dbg
      
      * more dbg
      
      * fix beam search bug
      
      * cache can be a list of tuples in some models
      
      * fix group beam search
      
      * all but sinkcache integration tests
      
      * fix sink cache and add hard integration test
      
      * now also compatible with input_embeds input
      
      * PR comments
      
      * add Cache support to Phi+FA2
      
      * make fixup
      
      ---------
      Co-authored-by: default avatarJoao Gante <joao@huggingface.co>
      Co-authored-by: default avatarJoao Gante <joaofranciscocardosogante@gmail.com>
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      633215ba