• Khai Mai's avatar
    Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517) · c5c69096
    Khai Mai authored
    * fix the function load_balancing_loss_func in Mixtral_Moe to include attention_mask
    
    * format code using black and ruff
    
    * skip computing mask if attention_mask=None
    
    * add tests for load balancing loss Mixtral-Moe
    
    * fix assert loss is different in mixtral_test
    
    * fix pad_leng
    
    * use assertNotAlmostEqual and print to debug
    
    * remove print for debug
    
    * minor updates
    
    * reduce rtol and atol
    c5c69096
test_modeling_mixtral.py 22.6 KB