Commits · 9e768b598a69f1d7e2955418b30cb5897dff800f · OpenDAS / ColossalAI

26 Sep, 2023 1 commit

[lazy] support from_pretrained (#4801) · 4965c0da

Hongxin Liu authored Sep 26, 2023

* [lazy] patch from pretrained

* [lazy] fix from pretrained and add tests

* [devops] update ci

4965c0da

20 Sep, 2023 1 commit

[chat]: update rm, add wandb and fix bugs (#4471) · 7b9b8644

Wenhao Chen authored Sep 20, 2023



* feat: modify forward fn of critic and reward model

* feat: modify calc_action_log_probs

* to: add wandb in sft and rm trainer

* feat: update train_sft

* feat: update train_rm

* style: modify type annotation and add warning

* feat: pass tokenizer to ppo trainer

* to: modify trainer base and maker base

* feat: add wandb in ppo trainer

* feat: pass tokenizer to generate

* test: update generate fn tests

* test: update train tests

* fix: remove action_mask

* feat: remove unused code

* fix: fix wrong ignore_index

* fix: fix mock tokenizer

* chore: update requirements

* revert: modify make_experience

* fix: fix inference

* fix: add padding side

* style: modify _on_learn_batch_end

* test: use mock tokenizer

* fix: use bf16 to avoid overflow

* fix: fix workflow

* [chat] fix gemini strategy

* [chat] fix

* sync: update colossalai strategy

* fix: fix args and model dtype

* fix: fix checkpoint test

* fix: fix requirements

* fix: fix missing import and wrong arg

* fix: temporarily skip gemini test in stage 3

* style: apply pre-commit

* fix: temporarily skip gemini test in stage 1&2

---------
Co-authored-by: Mingyan Jiang <1829166702@qq.com>

7b9b8644

19 Sep, 2023 1 commit

[misc] update pre-commit and run all files (#4752) · 079bf3cb

Hongxin Liu authored Sep 19, 2023

* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format

079bf3cb

18 Sep, 2023 1 commit

[legacy] clean up legacy code (#4743) · b5f9e37c

Hongxin Liu authored Sep 18, 2023

* [legacy] remove outdated codes of pipeline (#4692)

* [legacy] remove cli of benchmark and update optim (#4690)

* [legacy] remove cli of benchmark and update optim

* [doc] fix cli doc test

* [legacy] fix engine clip grad norm

* [legacy] remove outdated colo tensor (#4694)

* [legacy] remove outdated colo tensor

* [test] fix test import

* [legacy] move outdated zero to legacy (#4696)

* [legacy] clean up utils (#4700)

* [legacy] clean up utils

* [example] update examples

* [legacy] clean up amp

* [legacy] fix amp module

* [legacy] clean up gpc (#4742)

* [legacy] clean up context

* [legacy] clean core, constants and global vars

* [legacy] refactor initialize

* [example] fix examples ci

* [example] fix examples ci

* [legacy] fix tests

* [example] fix gpt example

* [example] fix examples ci

* [devops] fix ci installation

* [example] fix examples ci

b5f9e37c

11 Sep, 2023 1 commit
- [devops] fix concurrency group (#4667) · 536397cc
  Hongxin Liu authored Sep 11, 2023
  
  536397cc
08 Sep, 2023 1 commit

[devops] fix concurrency group and compatibility test (#4665) · a686f9dd

Hongxin Liu authored Sep 08, 2023

* [devops] fix concurrency group

* [devops] fix compatibility test

* [devops] fix tensornvme install

* [devops] fix tensornvme install

* [devops] fix colossalai install

a686f9dd

01 Sep, 2023 1 commit
- [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) · 38ccb8b1
  Baizhou Zhang authored Sep 01, 2023
```
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
```
  38ccb8b1
30 Aug, 2023 2 commits
- [devops] cancel previous runs in the PR (#4546) · c7b60f75
  Hongxin Liu authored Aug 30, 2023
  
  c7b60f75
- [coati] update ci · 1c43bfd5
  ver217 authored Aug 30, 2023
  
  1c43bfd5
16 Aug, 2023 1 commit

[devops] add large-scale distributed test marker (#4452) · 26e29d58

Hongxin Liu authored Aug 16, 2023

* [test] remove cpu marker

* [test] remove gpu marker

* [test] update pytest markers

* [ci] update unit test ci

26e29d58

02 Aug, 2023 1 commit

[chat] fix bugs and add unit tests (#4213) · da4f7b85

Wenhao Chen authored Aug 02, 2023

* style: rename replay buffer

Experience replay is typically for off policy algorithms.
Use this name in PPO maybe misleading.

* fix: fix wrong zero2 default arg

* test: update experience tests

* style: rename zero_pad fn

* fix: defer init in CycledDataLoader

* test: add benchmark test

* style: rename internal fn of generation

* style: rename internal fn of lora

* fix: remove unused loss fn

* fix: remove unused utils fn

* refactor: remove generate_with_actor fn

* fix: fix type annotation

* test: add models tests

* fix: skip llama due to long execution time

* style: modify dataset

* style: apply formatter

* perf: update reward dataset

* fix: fix wrong IGNORE_INDEX in sft dataset

* fix: remove DataCollatorForSupervisedDataset

* test: add dataset tests

* style: apply formatter

* style: rename test_ci to test_train

* feat: add llama in inference

* test: add inference tests

* test: change test scripts directory

* fix: update ci

* fix: fix typo

* fix: skip llama due to oom

* fix: fix file mod

* style: apply formatter

* refactor: remove duplicated llama_gptq

* style: apply formatter

* to: update rm test

* feat: add tokenizer arg

* feat: add download model script

* test: update train tests

* fix: modify gemini load and save pretrained

* test: update checkpoint io test

* to: modify nproc_per_node

* fix: do not remove existing dir

* fix: modify save path

* test: add random choice

* fix: fix sft path

* fix: enlarge nproc_per_node to avoid oom

* fix: add num_retry

* fix: make lora config of rm and critic consistent

* fix: add warning about lora weights

* fix: skip some gpt2 tests

* fix: remove grad ckpt in rm and critic due to errors

* refactor: directly use Actor in train_sft

* test: add more arguments

* fix: disable grad ckpt when using lora

* fix: fix save_pretrained and related tests

* test: enable zero2 tests

* revert: remove useless fn

* style: polish code

* test: modify test args

da4f7b85

01 Aug, 2023 1 commit

[release] update version (#4332) · 80647712

Hongxin Liu authored Aug 01, 2023

* [release] update version

* [devops] hotfix cuda extension building

* [devops] pytest ignore useless folders

80647712

21 Jul, 2023 1 commit
- [ci] support testmon core pkg change detection (#4305) · 02192a63
  Hongxin Liu authored Jul 21, 2023
  
  02192a63
04 Jul, 2023 2 commits

[workflow] show test duration (#4159) · cc3cbe9f
Frank Lee authored Jul 04, 2023

cc3cbe9f

[chat] use official transformers and fix some issues (#4117) · 3d8d5d0d

Wenhao Chen authored Jul 04, 2023

* feat: remove on_learn_epoch fn as not used

* revert: add _on_learn_epoch fn

* feat: remove NaiveStrategy

* test: update train_prompts tests

* fix: remove prepare_llama_tokenizer_and_embedding

* test: add lora arg

* feat: remove roberta support in train_prompts due to runtime errs

* feat: remove deberta & roberta in rm as not used

* test: remove deberta and roberta tests

* feat: remove deberta and roberta models as not used

* fix: remove calls to roberta

* fix: remove prepare_llama_tokenizer_and_embedding

* chore: update transformers version

* docs: update transformers version

* fix: fix actor inference

* fix: fix ci

* feat: change llama pad token to unk

* revert: revert ddp setup_distributed

* fix: change llama pad token to unk

* revert: undo unnecessary changes

* fix: use pip to install transformers

3d8d5d0d

28 Jun, 2023 1 commit
- [workflow] added status check for test coverage workflow (#4106) · 1ee947f6
  Frank Lee authored Jun 28, 2023
  
  1ee947f6
22 Jun, 2023 1 commit
- [workflow] cover all public repositories in weekly report (#4069) · b463651f
  Frank Lee authored Jun 22, 2023
  
  b463651f
19 Jun, 2023 1 commit
- [devops] fix build on pr ci (#4043) · 4a81faa5
  Hongxin Liu authored Jun 19, 2023
```
* [devops] fix build on pr ci

* [devops] fix build on pr ci
```
  4a81faa5
13 Jun, 2023 1 commit
- [workflow] fixed the directory check in build (#3980) · 8bcad736
  Frank Lee authored Jun 13, 2023
  
  8bcad736
12 Jun, 2023 2 commits
- [workflow] cancel duplicated workflow jobs (#3960) · 6718a2f2
  Frank Lee authored Jun 12, 2023
  
  6718a2f2
- [workflow] cancel duplicated workflow jobs (#3960) · 4110d1f0
  Frank Lee authored Jun 12, 2023
  
  4110d1f0
09 Jun, 2023 1 commit
- fix typo .github/workflows/scripts/ (#3946) · 1aadeede
  digger yu authored Jun 09, 2023
  
  1aadeede
07 Jun, 2023 3 commits

[workflow] added docker latest tag for release (#3920) · 5e2132dc
Frank Lee authored Jun 07, 2023

5e2132dc
[devops] hotfix testmon cache clean logic (#3917) · c25d421f
Hongxin Liu authored Jun 07, 2023

c25d421f

[chat] add distributed PPO trainer (#3740) · b5f05663

Hongxin Liu authored Jun 07, 2023



* Detached ppo (#9)

* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

* working on pipeline strategy. in progress.

* remove pipeline code. clean this branch

* update remote parameters by state_dict. no test

* nothing

* state_dict sharding transfer

* merge debug branch

* gemini _unwrap_model fix

* simplify code

* simplify code & fix LoRALinear AttributeError

* critic unwrapped state_dict

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add perfomance evaluator and fix bugs (#10)

* [chat] add performance evaluator for ray

* [chat] refactor debug arg

* [chat] support hf config

* [chat] fix generation

* [chat] add 1mmt dummy example

* [chat] fix gemini ckpt

* split experience to send (#11)
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] refactor trainer and maker (#12)

* [chat] refactor experience maker holder

* [chat] refactor model init

* [chat] refactor trainer args

* [chat] refactor model init

* [chat] refactor trainer

* [chat] refactor experience sending logic and training loop args (#13)

* [chat] refactor experience send logic

* [chat] refactor trainer

* [chat] refactor trainer

* [chat] refactor experience maker

* [chat] refactor pbar

* [chat] refactor example folder (#14)

* [chat] support quant (#15)

* [chat] add quant

* [chat] add quant example

* prompt example (#16)

* prompt example

* prompt load csv data

* remove legacy try

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add mmmt dummy example and refactor experience sending (#17)

* [chat] add mmmt dummy example

* [chat] refactor naive strategy

* [chat] fix struck problem

* [chat] fix naive strategy

* [chat] optimize experience maker sending logic

* [chat] refactor sending assignment

* [chat] refactor performance evaluator (#18)

* Prompt Example & requires_grad state_dict & sharding state_dict (#19)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

---------
Co-authored-by: csric <richcsr256@gmail.com>

* state_dict sending adapts to new unwrap function (#20)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

* opt benchmark

* better script

* nothing

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test

* working on lora reconstruction

* state_dict sending adapts to new unwrap function

* remove comments

---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* [chat-ray] add readme (#21)

* add readme

* transparent graph

* add note background

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] get images from url (#22)

* Refactor/chat ray (#23)

* [chat] lora add todo

* [chat] remove unused pipeline strategy

* [chat] refactor example structure

* [chat] setup ci for ray

* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)

* lora support prototype

* lora support

* 1mmt lora & remove useless code

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] fix test ci for ray

* [chat] fix test ci requirements for ray

* [chat] fix ray runtime env

* [chat] fix ray runtime env

* [chat] fix example ci docker args

* [chat] add debug info in trainer

* [chat] add nccl debug info

* [chat] skip ray test

* [doc] fix typo

---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>

b5f05663

06 Jun, 2023 2 commits

[devops] hotfix CI about testmon cache (#3910) · 41fb7236
Hongxin Liu authored Jun 06, 2023
```
* [devops] hotfix CI about testmon cache

* [devops] fix testmon cahe on pr
```
41fb7236

[devops] improving testmon cache (#3902) · ec9bbc00

Hongxin Liu authored Jun 06, 2023

* [devops] improving testmon cache

* [devops] fix branch name with slash

* [devops] fix branch name with slash

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] update readme

ec9bbc00

25 May, 2023 2 commits
- [workflow] fixed workflow check for docker build (#3849) · ae959a72
  Frank Lee authored May 25, 2023
  
  ae959a72
- [workflow] supported test on CUDA 10.2 (#3841) · 54e97ed7
  Frank Lee authored May 25, 2023
  
  54e97ed7
24 May, 2023 3 commits
- [workflow] fixed testmon cache in build CI (#3806) · 84500b77
  Frank Lee authored May 24, 2023
```
* [workflow] fixed testmon cache in build CI

* polish code
```
  84500b77
- [workflow] changed to doc build to be on schedule and release (#3825) · 05b8a8de
  Frank Lee authored May 24, 2023
```
* [workflow] changed to doc build to be on schedule and release

* polish code
```
  05b8a8de
- fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) · 7f8203af
  digger yu authored May 24, 2023
  
  7f8203af
23 May, 2023 2 commits
- [workflow] enblaed doc build from a forked repo (#3815) · 1e3b64f2
  Frank Lee authored May 23, 2023
  
  1e3b64f2
- [workflow] enable testing for develop & feature branch (#3801) · ad93c736
  Frank Lee authored May 23, 2023
  
  ad93c736
22 May, 2023 2 commits

[workflow] fixed the docker build workflow (#3794) · 788e07db
Frank Lee authored May 22, 2023
```
* [workflow] fixed the docker build workflow

* polish code
```
788e07db

Fix/docker action (#3266) · 4d29c0f8

liuzeming authored May 22, 2023



* [docker] Add ARG VERSION to determine the Tag

* [workflow] fixed the version in the release docker workflow

---------
Co-authored-by: liuzeming <liuzeming@4paradigm.com>

4d29c0f8

19 May, 2023 1 commit
- [devops] fix doc test on pr (#3782) · b4788d63
  Hongxin Liu authored May 19, 2023
  
  b4788d63
17 May, 2023 2 commits

[devops] fix ci for document check (#3751) · 5dd573c6

Hongxin Liu authored May 17, 2023

* [doc] add test info

* [devops] update doc check ci

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] remove debug info and update invalid doc

* [devops] add essential comments

5dd573c6

[devops] make build on PR run automatically (#3748) · c03bd7c6
Hongxin Liu authored May 17, 2023
```
* [devops] make build on PR run automatically

* [devops] update build on pr condition
```
c03bd7c6

15 May, 2023 1 commit

[devops] update torch version of CI (#3725) · afb239bb

Hongxin Liu authored May 15, 2023

* [test] fix flop tensor test

* [test] fix autochunk test

* [test] fix lazyinit test

* [devops] update torch version of CI

* [devops] enable testmon

* [devops] fix ci

* [devops] fix ci

* [test] fix checkpoint io test

* [test] fix cluster test

* [test] fix timm test

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] force sync to test ci

* [test] skip fsdp test

afb239bb