Fix non-deterministic Megatron-LM checkpoint name (#24674)
Fix non-deterministic checkpoint name `os.listdir`'s order is not deterministic, which is a problem when querying the first listed file as in the code (`os.listdir(...)[0]`). This can return a checkpoint name such as `distrib_optim.pt`, which does not include desired information such as the saved arguments originally given to Megatron-LM.
Showing
Please register or sign in to comment