Commits · 7bc5a8e338a1240070ca82f96721c451060f1a70 · OpenDAS / ColossalAI

17 Apr, 2023 1 commit

[chatgpt] Detached PPO Training (#3195) · e3551443

csric authored Apr 17, 2023



* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

---------
Co-authored-by: csric <richcsr256@gmail.com>

e3551443

28 Mar, 2023 1 commit
- [Coati] first commit (#3283) · b0ce5a10
  Fazzie-Maqianli authored Mar 28, 2023
  
  b0ce5a10
14 Feb, 2023 1 commit
- [app] add chatgpt application (#2698) · 1b347010
  ver217 authored Feb 14, 2023
  
  1b347010
10 Jan, 2023 1 commit
- [workflow]auto comment with test coverage report (#2419) · b3472d32
  Frank Lee authored Jan 10, 2023
```
* [workflow]auto comment with test coverage report

* polish code

* polish yaml
```
  b3472d32
09 Jan, 2023 1 commit

[worfklow] added coverage test (#2399) · 53bb8682

Frank Lee authored Jan 09, 2023

* [worfklow] added coverage test

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

53bb8682

06 Jan, 2023 1 commit

[setup] support pre-build and jit-build of cuda kernels (#2374) · 40d376c5

Frank Lee authored Jan 06, 2023

* [setup] support pre-build and jit-build of cuda kernels

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

40d376c5

30 Nov, 2022 1 commit
- [setup] supported conda-installed torch (#2048) · 81e0da7f
  Frank Lee authored Nov 30, 2022
```
* [setup] supported conda-installed torch

* polish code
```
  81e0da7f
08 Nov, 2022 1 commit
- [NFC] update gitignore remove DS_Store (#1830) · f86a703b
  Jiarui Fang authored Nov 08, 2022
  
  f86a703b
01 Apr, 2022 1 commit
- [model checkpoint] added unit tests for checkpoint save/load (#599) · 354b7954
  アマデウス authored Apr 01, 2022
  
  354b7954
15 Feb, 2022 1 commit

moved env variables to global variables; (#215) · 9ee197d0

アマデウス authored Feb 14, 2022

added branch context;
added vocab parallel layers;
moved split_batch from load_batch to tensor parallel embedding layers;
updated gpt model;
updated unit test cases;
fixed few collective communicator bugs

9ee197d0

28 Oct, 2021 1 commit
- Migrated project · 404ecbdc
  zbian authored Oct 28, 2021
  
  404ecbdc