- 29 Mar, 2024 1 commit
-
-
YeAnbang authored
* Add dpo. Fix sft, ppo, lora. Refactor all * fix and tested ppo * 2 nd round refactor * add ci tests * fix ci * fix ci * fix readme, style * fix readme style * fix style, fix benchmark * reproduce benchmark result, remove useless files * rename to ColossalChat * use new image * fix ci workflow * fix ci * use local model/tokenizer for ci tests * fix ci * fix ci * fix ci * fix ci timeout * fix rm progress bar. fix ci timeout * fix ci * fix ci typo * remove 3d plugin from ci temporary * test environment * cannot save optimizer * support chat template * fix readme * fix path * test ci locally * restore build_or_pr * fix ci data path * fix benchmark * fix ci, move ci tests to 3080, disable fast tokenizer * move ci to 85 * support flash attention 2 * add all-in-one data preparation script. Fix colossal-llama2-chat chat template * add hardware requirements * move ci test data * fix save_model, add unwrap * fix missing bos * fix missing bos; support grad accumulation with gemini * fix ci * fix ci * fix ci * fix llama2 chat template config * debug sft * debug sft * fix colossalai version requirement * fix ci * add sanity check to prevent NaN loss * fix requirements * add dummy data generation script * add dummy data generation script * add dummy data generation script * add dummy data generation script * update readme * update readme * update readme and ignore * fix logger bug * support parallel_output * modify data preparation logic * fix tokenization * update lr * fix inference * run pre-commit --------- Co-authored-by:Tong Li <tong.li352711588@gmail.com>
-
- 26 Jul, 2023 1 commit
-
-
Ziheng Qin authored
Co-authored-by:henryqin1997 <henryqin1997@gamil.com>
-
- 17 Apr, 2023 1 commit
-
-
csric authored
* run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments --------- Co-authored-by:csric <richcsr256@gmail.com>
-
- 28 Mar, 2023 1 commit
-
-
Fazzie-Maqianli authored
-
- 14 Feb, 2023 1 commit
-
-
ver217 authored
-
- 10 Jan, 2023 1 commit
-
-
Frank Lee authored
* [workflow]auto comment with test coverage report * polish code * polish yaml
-
- 09 Jan, 2023 1 commit
-
-
Frank Lee authored
* [worfklow] added coverage test * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code
-
- 06 Jan, 2023 1 commit
-
-
Frank Lee authored
* [setup] support pre-build and jit-build of cuda kernels * polish code * polish code * polish code * polish code * polish code * polish code
-
- 30 Nov, 2022 1 commit
-
-
Frank Lee authored
* [setup] supported conda-installed torch * polish code
-
- 08 Nov, 2022 1 commit
-
-
Jiarui Fang authored
-
- 01 Apr, 2022 1 commit
-
-
アマデウス authored
-
- 15 Feb, 2022 1 commit
-
-
アマデウス authored
added branch context; added vocab parallel layers; moved split_batch from load_batch to tensor parallel embedding layers; updated gpt model; updated unit test cases; fixed few collective communicator bugs
-
- 28 Oct, 2021 1 commit
-
-
zbian authored
-