• Mitch Naylor's avatar
    Add Mega: Moving Average Equipped Gated Attention (#21766) · 57f25f4b
    Mitch Naylor authored
    
    
    * add mega file structure and plain pytorch version of mega source code
    
    * added config class with old naming conventions
    
    * filled in mega documentation
    
    * added config class and embeddings with optional token types
    
    * updated notes
    
    * starting the conversion process, deleted intermediate and added use_cache back to config
    
    * renamed config attributes in modeling_mega.py
    
    * checkpointing before refactoring incremental decoding functions
    
    * removed stateful incremental key/values for EMA and self-attention
    
    * refactored MovingAverageGatedAttention to remove stateful k/v history and use unified attention mask
    
    * MovingAverageGatedAttention works with incremental decoding + past values, added sequence length enforcement
    
    * more comments in MovingAverageGatedAttention + checkpointing before GatedCrossAttention
    
    * bug fix in attention mask handling in MovingAverageGatedAttention
    
    * removed incremental state from GatedCrossAttention and removed IncrementalState class
    
    * finished gated cross attention and got MegaLayer working
    
    * fixed causal masking in mega decoder
    
    * fixed how padding and causal masks are passed through MegaLayer with and without k/v caching
    
    * finished MegaModel; tested with encoder, decoder-only, and cross-attention type inputs; started work on downstream classes; removed mentions of position_ids
    
    * added optional dense hidden layer for masked and causal LM classes
    
    * docstring updates in MultiHeadEMA and GatedCrossAttention, removed unnecessary inputs in cross-attention
    
    * removed before_attn_fn in Mega class and updated docstrings and comments up to there
    
    * bug fix in MovingAverageGatedAttention masking
    
    * working conversion of MLM checkpoint in scratchpad script -- perfect matches
    
    * moved arg for hidden dense layer in LM head to config; discovered issue where from_pretrained is renaming gamma and beta parameters
    
    * renamed gamma and beta parameters to avoid HF renaming when loading from checkpoint
    
    * finished checkpoint conversion script
    
    * cleanup old class in mega config script
    
    * removed 'copied from' statements and passing integration tests
    
    * added num_attention_heads=1 to config for integration compatibility, decoder tests working, generation tests failing
    
    * fixed tuple output of megamodel
    
    * all common tests passing after fixing issues in decoder, gradient retention, and initialization
    
    * added mega-specific tests, ready for more documentation and style checks
    
    * updated docstrings; checkpoint before style fixes
    
    * style and quality checks, fixed initialization problem in float_tensor, ready for PR
    
    * added mega to toctree
    
    * removed unnecessary arg in megaconfig
    
    * removed unused arg and fixed code samples with leftover roberta models
    
    * Apply suggestions from code review
    
    Applied all suggestions except the one renaming a class, as I'll need to update that througout
    Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
    
    * fixed issue where .view breaks batch dimension, conversion script fixed with absolute imports, updated readme with Mega->MEGA
    
    * removed asserts in Mega code, renamed sequencenorm, gatedcrossattention, and NFFN, replaced get_activation_fn with ACTFN, and added sequencenorm to layer norms
    
    * reformatted .forward() docstrings to match style and removed unused mask input in cross-attention
    
    * removed all reset_parameters() methods and rolled into MegaPreTrainedModel._init_weights()
    
    * renamed all single-letter variables and improved readability in tensor size comments, Mega->MEGA in 2 documentation files
    
    * variable names in NFFN
    
    * manual Mega->MEGA changes in docs
    
    * Mega->MEGA in config auto
    
    * style and quality fixes
    
    * Apply suggestions from code review
    Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
    
    * renamed parameters and variables with confusing names, added copied from statements, moved fft conv to its own method, other cleanup from PR comments
    
    * commit before dealing with merge conflicts
    
    * made new attention activation functions available in ACT2FN and added generation test from OPT
    
    * style and quality in activations and tests
    
    * documentation fixes, renaming variables in dropout and rotary positions, used built-in causal masking, encoders->layers in MegaModel, moved comments into docstrings
    
    * style and quality fixes after latest updates, before rotary position ids
    
    * causal mask in MegaBlock docstring + added missing device passing
    
    * Apply suggestions from code review
    Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
    
    * Update README.md
    Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
    
    * added Mega prefixes where missing, reverted MegaSequenceNorm to if-else, other module renaming requested in PR
    
    * style and quality fixes + readme updates pointing to main
    
    ---------
    Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
    Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
    57f25f4b
README_zh-hant.md 80.3 KB