* adapt model weight initialization for methods in Pytorch nn.init
* remove hybrid adam in test_moe_zero_optim * fix activation checkpointing and its unitest
* add cpu adam counter for all cpu adam * fixed updating error in adam kernel