• Sayak Paul's avatar
    [Examples] TPU-based training of a language model using TensorFlow (#21657) · 390e121f
    Sayak Paul authored
    
    
    * add: tokenizer training script for TF TPU LM training.
    
    * add: script for preparing the TFRecord shards.
    
    * add: sequence of execution to readme.
    
    * remove limit from the tfrecord shard name.
    
    * Add initial train_model.py
    
    * Add basic training arguments and model init
    
    * Get up to the point of writing the data collator
    
    * Pushing progress so far!
    
    * Complete first draft of model training code
    
    * feat: grouping of texts efficiently.
    Co-authored-by: default avatarMatt <rocketknight1@gmail.com>
    
    * Add proper masking collator and get training loop working
    
    * fix: things.
    
    * Read sample counts from filenames
    
    * Read sample counts from filenames
    
    * Draft README
    
    * Improve TPU warning
    
    * Use distribute instead of distribute.experimental
    
    * Apply suggestions from code review
    Co-authored-by: default avatarMatt <Rocketknight1@users.noreply.github.com>
    
    * Modularize loading and add MLM probability as arg
    
    * minor refactoring to better use the cli args.
    
    * readme fillup.
    
    * include tpu and inference sections in the readme.
    
    * table of contents.
    
    * parallelize maps.
    
    * polish readme.
    
    * change script name to run_mlm.py
    
    * address PR feedback (round I).
    
    ---------
    Co-authored-by: default avatarMatt <rocketknight1@gmail.com>
    Co-authored-by: default avatarMatt <Rocketknight1@users.noreply.github.com>
    390e121f
data_collator.py 75.1 KB