1. 17 Apr, 2023 1 commit
    • csric's avatar
      [chatgpt] Detached PPO Training (#3195) · e3551443
      csric authored
      
      
      * run the base
      
      * working on dist ppo
      
      * sync
      
      * detached trainer
      
      * update detached trainer. no maker update function
      
      * facing init problem
      
      * 1 maker 1 trainer detached run. but no model update
      
      * facing cuda problem
      
      * fix save functions
      
      * verified maker update
      
      * nothing
      
      * add ignore
      
      * analyize loss issue
      
      * remove some debug codes
      
      * facing 2m1t stuck issue
      
      * 2m1t verified
      
      * do not use torchrun
      
      * working on 2m2t
      
      * working on 2m2t
      
      * initialize strategy in ray actor env
      
      * facing actor's init order issue
      
      * facing ddp model update issue (need unwarp ddp)
      
      * unwrap ddp actor
      
      * checking 1m2t stuck problem
      
      * nothing
      
      * set timeout for trainer choosing. It solves the stuck problem!
      
      * delete some debug output
      
      * rename to sync with upstream
      
      * rename to sync with upstream
      
      * coati rename
      
      * nothing
      
      * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations
      
      * experience_maker_holder performs target-revolving _send_experience() instead of length comparison.
      
      * move code to ray subfolder
      
      * working on pipeline inference
      
      * apply comments
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      e3551443
  2. 28 Mar, 2023 1 commit
  3. 14 Feb, 2023 1 commit
  4. 10 Jan, 2023 1 commit
  5. 09 Jan, 2023 1 commit
    • Frank Lee's avatar
      [worfklow] added coverage test (#2399) · 53bb8682
      Frank Lee authored
      * [worfklow] added coverage test
      
      * polish code
      
      * polish code
      
      * polish code
      
      * polish code
      
      * polish code
      
      * polish code
      
      * polish code
      
      * polish code
      53bb8682
  6. 06 Jan, 2023 1 commit
  7. 30 Nov, 2022 1 commit
  8. 08 Nov, 2022 1 commit
  9. 01 Apr, 2022 1 commit
  10. 15 Feb, 2022 1 commit
    • アマデウス's avatar
      moved env variables to global variables; (#215) · 9ee197d0
      アマデウス authored
      added branch context;
      added vocab parallel layers;
      moved split_batch from load_batch to tensor parallel embedding layers;
      updated gpt model;
      updated unit test cases;
      fixed few collective communicator bugs
      9ee197d0
  11. 28 Oct, 2021 1 commit