Commits · 806477121d960a11c45d37c48247249201f97e97 · OpenDAS / ColossalAI

01 Aug, 2023 1 commit

[release] update version (#4332) · 80647712

Hongxin Liu authored Aug 01, 2023

* [release] update version

* [devops] hotfix cuda extension building

* [devops] pytest ignore useless folders

80647712

21 Jul, 2023 1 commit
- [ci] support testmon core pkg change detection (#4305) · 02192a63
  Hongxin Liu authored Jul 21, 2023
  
  02192a63
04 Jul, 2023 2 commits

[workflow] show test duration (#4159) · cc3cbe9f
Frank Lee authored Jul 04, 2023

cc3cbe9f

[chat] use official transformers and fix some issues (#4117) · 3d8d5d0d

Wenhao Chen authored Jul 04, 2023

* feat: remove on_learn_epoch fn as not used

* revert: add _on_learn_epoch fn

* feat: remove NaiveStrategy

* test: update train_prompts tests

* fix: remove prepare_llama_tokenizer_and_embedding

* test: add lora arg

* feat: remove roberta support in train_prompts due to runtime errs

* feat: remove deberta & roberta in rm as not used

* test: remove deberta and roberta tests

* feat: remove deberta and roberta models as not used

* fix: remove calls to roberta

* fix: remove prepare_llama_tokenizer_and_embedding

* chore: update transformers version

* docs: update transformers version

* fix: fix actor inference

* fix: fix ci

* feat: change llama pad token to unk

* revert: revert ddp setup_distributed

* fix: change llama pad token to unk

* revert: undo unnecessary changes

* fix: use pip to install transformers

3d8d5d0d

28 Jun, 2023 1 commit
- [workflow] added status check for test coverage workflow (#4106) · 1ee947f6
  Frank Lee authored Jun 28, 2023
  
  1ee947f6
22 Jun, 2023 1 commit
- [workflow] cover all public repositories in weekly report (#4069) · b463651f
  Frank Lee authored Jun 22, 2023
  
  b463651f
19 Jun, 2023 1 commit
- [devops] fix build on pr ci (#4043) · 4a81faa5
  Hongxin Liu authored Jun 19, 2023
```
* [devops] fix build on pr ci

* [devops] fix build on pr ci
```
  4a81faa5
13 Jun, 2023 1 commit
- [workflow] fixed the directory check in build (#3980) · 8bcad736
  Frank Lee authored Jun 13, 2023
  
  8bcad736
12 Jun, 2023 2 commits
- [workflow] cancel duplicated workflow jobs (#3960) · 6718a2f2
  Frank Lee authored Jun 12, 2023
  
  6718a2f2
- [workflow] cancel duplicated workflow jobs (#3960) · 4110d1f0
  Frank Lee authored Jun 12, 2023
  
  4110d1f0
09 Jun, 2023 1 commit
- fix typo .github/workflows/scripts/ (#3946) · 1aadeede
  digger yu authored Jun 09, 2023
  
  1aadeede
07 Jun, 2023 3 commits

[workflow] added docker latest tag for release (#3920) · 5e2132dc
Frank Lee authored Jun 07, 2023

5e2132dc
[devops] hotfix testmon cache clean logic (#3917) · c25d421f
Hongxin Liu authored Jun 07, 2023

c25d421f

[chat] add distributed PPO trainer (#3740) · b5f05663

Hongxin Liu authored Jun 07, 2023



* Detached ppo (#9)

* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

* working on pipeline strategy. in progress.

* remove pipeline code. clean this branch

* update remote parameters by state_dict. no test

* nothing

* state_dict sharding transfer

* merge debug branch

* gemini _unwrap_model fix

* simplify code

* simplify code & fix LoRALinear AttributeError

* critic unwrapped state_dict

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add perfomance evaluator and fix bugs (#10)

* [chat] add performance evaluator for ray

* [chat] refactor debug arg

* [chat] support hf config

* [chat] fix generation

* [chat] add 1mmt dummy example

* [chat] fix gemini ckpt

* split experience to send (#11)
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] refactor trainer and maker (#12)

* [chat] refactor experience maker holder

* [chat] refactor model init

* [chat] refactor trainer args

* [chat] refactor model init

* [chat] refactor trainer

* [chat] refactor experience sending logic and training loop args (#13)

* [chat] refactor experience send logic

* [chat] refactor trainer

* [chat] refactor trainer

* [chat] refactor experience maker

* [chat] refactor pbar

* [chat] refactor example folder (#14)

* [chat] support quant (#15)

* [chat] add quant

* [chat] add quant example

* prompt example (#16)

* prompt example

* prompt load csv data

* remove legacy try

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add mmmt dummy example and refactor experience sending (#17)

* [chat] add mmmt dummy example

* [chat] refactor naive strategy

* [chat] fix struck problem

* [chat] fix naive strategy

* [chat] optimize experience maker sending logic

* [chat] refactor sending assignment

* [chat] refactor performance evaluator (#18)

* Prompt Example & requires_grad state_dict & sharding state_dict (#19)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

---------
Co-authored-by: csric <richcsr256@gmail.com>

* state_dict sending adapts to new unwrap function (#20)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

* opt benchmark

* better script

* nothing

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test

* working on lora reconstruction

* state_dict sending adapts to new unwrap function

* remove comments

---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* [chat-ray] add readme (#21)

* add readme

* transparent graph

* add note background

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] get images from url (#22)

* Refactor/chat ray (#23)

* [chat] lora add todo

* [chat] remove unused pipeline strategy

* [chat] refactor example structure

* [chat] setup ci for ray

* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)

* lora support prototype

* lora support

* 1mmt lora & remove useless code

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] fix test ci for ray

* [chat] fix test ci requirements for ray

* [chat] fix ray runtime env

* [chat] fix ray runtime env

* [chat] fix example ci docker args

* [chat] add debug info in trainer

* [chat] add nccl debug info

* [chat] skip ray test

* [doc] fix typo

---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>

b5f05663

06 Jun, 2023 2 commits

[devops] hotfix CI about testmon cache (#3910) · 41fb7236
Hongxin Liu authored Jun 06, 2023
```
* [devops] hotfix CI about testmon cache

* [devops] fix testmon cahe on pr
```
41fb7236

[devops] improving testmon cache (#3902) · ec9bbc00

Hongxin Liu authored Jun 06, 2023

* [devops] improving testmon cache

* [devops] fix branch name with slash

* [devops] fix branch name with slash

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] update readme

ec9bbc00

25 May, 2023 2 commits
- [workflow] fixed workflow check for docker build (#3849) · ae959a72
  Frank Lee authored May 25, 2023
  
  ae959a72
- [workflow] supported test on CUDA 10.2 (#3841) · 54e97ed7
  Frank Lee authored May 25, 2023
  
  54e97ed7
24 May, 2023 3 commits
- [workflow] fixed testmon cache in build CI (#3806) · 84500b77
  Frank Lee authored May 24, 2023
```
* [workflow] fixed testmon cache in build CI

* polish code
```
  84500b77
- [workflow] changed to doc build to be on schedule and release (#3825) · 05b8a8de
  Frank Lee authored May 24, 2023
```
* [workflow] changed to doc build to be on schedule and release

* polish code
```
  05b8a8de
- fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) · 7f8203af
  digger yu authored May 24, 2023
  
  7f8203af
23 May, 2023 2 commits
- [workflow] enblaed doc build from a forked repo (#3815) · 1e3b64f2
  Frank Lee authored May 23, 2023
  
  1e3b64f2
- [workflow] enable testing for develop & feature branch (#3801) · ad93c736
  Frank Lee authored May 23, 2023
  
  ad93c736
22 May, 2023 2 commits

[workflow] fixed the docker build workflow (#3794) · 788e07db
Frank Lee authored May 22, 2023
```
* [workflow] fixed the docker build workflow

* polish code
```
788e07db

Fix/docker action (#3266) · 4d29c0f8

liuzeming authored May 22, 2023



* [docker] Add ARG VERSION to determine the Tag

* [workflow] fixed the version in the release docker workflow

---------
Co-authored-by: liuzeming <liuzeming@4paradigm.com>

4d29c0f8

19 May, 2023 1 commit
- [devops] fix doc test on pr (#3782) · b4788d63
  Hongxin Liu authored May 19, 2023
  
  b4788d63
17 May, 2023 2 commits

[devops] fix ci for document check (#3751) · 5dd573c6

Hongxin Liu authored May 17, 2023

* [doc] add test info

* [devops] update doc check ci

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] remove debug info and update invalid doc

* [devops] add essential comments

5dd573c6

[devops] make build on PR run automatically (#3748) · c03bd7c6
Hongxin Liu authored May 17, 2023
```
* [devops] make build on PR run automatically

* [devops] update build on pr condition
```
c03bd7c6

15 May, 2023 1 commit

[devops] update torch version of CI (#3725) · afb239bb

Hongxin Liu authored May 15, 2023

* [test] fix flop tensor test

* [test] fix autochunk test

* [test] fix lazyinit test

* [devops] update torch version of CI

* [devops] enable testmon

* [devops] fix ci

* [devops] fix ci

* [test] fix checkpoint io test

* [test] fix cluster test

* [test] fix timm test

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] force sync to test ci

* [test] skip fsdp test

afb239bb

26 Apr, 2023 1 commit

[gemini] accelerate inference (#3641) · 50793b35

Hongxin Liu authored Apr 26, 2023

* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test

50793b35

24 Apr, 2023 1 commit
- [devops] fix chat ci (#3628) · 179558a8
  Hongxin Liu authored Apr 24, 2023
  
  179558a8
20 Apr, 2023 1 commit
- [doc] .github/workflows/README.md (#3605) · 633bac2f
  digger-yu authored Apr 20, 2023
```
Fixed several word spelling errors
change "compatiblity" to "compatibility" etc.
```
  633bac2f
18 Apr, 2023 1 commit

Update test_ci.sh · 36a519b4

Camille Zhong authored Mar 22, 2023

update

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update test_ci.sh

Update test_ci.sh

update

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

update ci

Update test_ci.sh

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update test_ci.sh

Update test_ci.sh

Update run_chatgpt_examples.yml

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

update test ci

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

[test]chat_update_ci

Update test_ci.sh

Update test_ci.sh

test

Update gpt_critic.py

Update gpt_critic.py

Update run_chatgpt_unit_tests.yml

update test ci

update

update

update

update

Update test_ci.sh

update

Update test_ci.sh

Update test_ci.sh

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

36a519b4

17 Apr, 2023 1 commit
- [doc] Update .github/workflows/README.md (#3577) · 6e7e43c6
  digger-yu authored Apr 17, 2023
```
Optimization Code
I think there were two extra $ entered here, which have been deleted
```
  6e7e43c6
06 Apr, 2023 1 commit

[test] refactor tests with spawn (#3452) · 80eba05b

Frank Lee authored Apr 06, 2023

* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

80eba05b

27 Mar, 2023 1 commit
- [CI] Fix pre-commit workflow (#3238) · 1653063f
  Hakjin Lee authored Mar 27, 2023
  
  1653063f
14 Mar, 2023 1 commit
- [workflow] purged extension cache before GPT test (#3128) · 169ed4d2
  Frank Lee authored Mar 14, 2023
  
  169ed4d2
09 Mar, 2023 1 commit
- [workflow] fixed doc build trigger condition (#3072) · 91ccf975
  Frank Lee authored Mar 09, 2023
  
  91ccf975
07 Mar, 2023 2 commits

[workflow] supported conda package installation in doc test (#3028) · 8fedc876

Frank Lee authored Mar 07, 2023

* [workflow] supported conda package installation in doc test

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

8fedc876

[workflow] fixed the post-commit failure when no formatting needed (#3020) · 2cd6ba30
Frank Lee authored Mar 07, 2023
```
* [workflow] fixed the post-commit failure when no formatting needed

* polish code

* polish code

* polish code
```
2cd6ba30