- 17 Apr, 2023 1 commit
-
-
csric authored
* run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments --------- Co-authored-by:csric <richcsr256@gmail.com>
-
- 28 Mar, 2023 1 commit
-
-
Fazzie-Maqianli authored
-
- 14 Feb, 2023 1 commit
-
-
ver217 authored
-
- 10 Jan, 2023 1 commit
-
-
Frank Lee authored
* [workflow]auto comment with test coverage report * polish code * polish yaml
-
- 09 Jan, 2023 1 commit
-
-
Frank Lee authored
* [worfklow] added coverage test * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code
-
- 06 Jan, 2023 1 commit
-
-
Frank Lee authored
* [setup] support pre-build and jit-build of cuda kernels * polish code * polish code * polish code * polish code * polish code * polish code
-
- 30 Nov, 2022 1 commit
-
-
Frank Lee authored
* [setup] supported conda-installed torch * polish code
-
- 08 Nov, 2022 1 commit
-
-
Jiarui Fang authored
-
- 01 Apr, 2022 1 commit
-
-
アマデウス authored
-
- 15 Feb, 2022 1 commit
-
-
アマデウス authored
added branch context; added vocab parallel layers; moved split_batch from load_batch to tensor parallel embedding layers; updated gpt model; updated unit test cases; fixed few collective communicator bugs
-
- 28 Oct, 2021 1 commit
-
-
zbian authored
-