Commits · df5e9c53cf23d44656470cc319ee0b470c40712f · OpenDAS / ColossalAI

29 Mar, 2024 1 commit

[ColossalChat] Update RLHF V2 (#5286) · df5e9c53

YeAnbang authored Mar 29, 2024



* Add dpo. Fix sft, ppo, lora. Refactor all

* fix and tested ppo

* 2 nd round refactor

* add ci tests

* fix ci

* fix ci

* fix readme, style

* fix readme style

* fix style, fix benchmark

* reproduce benchmark result, remove useless files

* rename to ColossalChat

* use new image

* fix ci workflow

* fix ci

* use local model/tokenizer for ci tests

* fix ci

* fix ci

* fix ci

* fix ci timeout

* fix rm progress bar. fix ci timeout

* fix ci

* fix ci typo

* remove 3d plugin from ci temporary

* test environment

* cannot save optimizer

* support chat template

* fix readme

* fix path

* test ci locally

* restore build_or_pr

* fix ci data path

* fix benchmark

* fix ci, move ci tests to 3080, disable fast tokenizer

* move ci to 85

* support flash attention 2

* add all-in-one data preparation script. Fix colossal-llama2-chat chat template

* add hardware requirements

* move ci test data

* fix save_model, add unwrap

* fix missing bos

* fix missing bos; support grad accumulation with gemini

* fix ci

* fix ci

* fix ci

* fix llama2 chat template config

* debug sft

* debug sft

* fix colossalai version requirement

* fix ci

* add sanity check to prevent NaN loss

* fix requirements

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* update readme

* update readme

* update readme and ignore

* fix logger bug

* support parallel_output

* modify data preparation logic

* fix tokenization

* update lr

* fix inference

* run pre-commit

---------
Co-authored-by: Tong Li <tong.li352711588@gmail.com>

df5e9c53

26 Jul, 2023 1 commit
- applications/Chat/.gitignore (#4279) · c972d653
  Ziheng Qin authored Jul 19, 2023
```
Co-authored-by: henryqin1997 <henryqin1997@gamil.com>
```
  c972d653
17 Apr, 2023 1 commit

[chatgpt] Detached PPO Training (#3195) · e3551443

csric authored Apr 17, 2023



* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

---------
Co-authored-by: csric <richcsr256@gmail.com>

e3551443

28 Mar, 2023 1 commit
- [Coati] first commit (#3283) · b0ce5a10
  Fazzie-Maqianli authored Mar 28, 2023
  
  b0ce5a10
14 Feb, 2023 1 commit
- [app] add chatgpt application (#2698) · 1b347010
  ver217 authored Feb 14, 2023
  
  1b347010
10 Jan, 2023 1 commit
- [workflow]auto comment with test coverage report (#2419) · b3472d32
  Frank Lee authored Jan 10, 2023
```
* [workflow]auto comment with test coverage report

* polish code

* polish yaml
```
  b3472d32
09 Jan, 2023 1 commit

[worfklow] added coverage test (#2399) · 53bb8682

Frank Lee authored Jan 09, 2023

* [worfklow] added coverage test

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

53bb8682

06 Jan, 2023 1 commit

[setup] support pre-build and jit-build of cuda kernels (#2374) · 40d376c5

Frank Lee authored Jan 06, 2023

* [setup] support pre-build and jit-build of cuda kernels

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

40d376c5

30 Nov, 2022 1 commit
- [setup] supported conda-installed torch (#2048) · 81e0da7f
  Frank Lee authored Nov 30, 2022
```
* [setup] supported conda-installed torch

* polish code
```
  81e0da7f
08 Nov, 2022 1 commit
- [NFC] update gitignore remove DS_Store (#1830) · f86a703b
  Jiarui Fang authored Nov 08, 2022
  
  f86a703b
01 Apr, 2022 1 commit
- [model checkpoint] added unit tests for checkpoint save/load (#599) · 354b7954
  アマデウス authored Apr 01, 2022
  
  354b7954
15 Feb, 2022 1 commit

moved env variables to global variables; (#215) · 9ee197d0

アマデウス authored Feb 14, 2022

added branch context;
added vocab parallel layers;
moved split_batch from load_batch to tensor parallel embedding layers;
updated gpt model;
updated unit test cases;
fixed few collective communicator bugs

9ee197d0

28 Oct, 2021 1 commit
- Migrated project · 404ecbdc
  zbian authored Oct 28, 2021
  
  404ecbdc