Commits · 33eef714db460d3db42698a2d969cb6a669dc583 · OpenDAS / ColossalAI

08 Jun, 2023 4 commits
- fix typo examples and docs (#3932) · 33eef714
  digger yu authored Jun 08, 2023
  
  33eef714
- fix typo examples/community/roberta (#3925) · 407aa484
  digger yu authored Jun 08, 2023
  
  407aa484
- [example] update opt example using booster api (#3918) · e417dd00
  Baizhou Zhang authored Jun 08, 2023
  
  e417dd00
- [devops] update torch version in compability test (#3919) · 9166988d
  Hongxin Liu authored Jun 08, 2023
  
  9166988d
07 Jun, 2023 8 commits

[nfc] fix typo colossalai/zero (#3923) · de0d7df3
digger yu authored Jun 08, 2023

de0d7df3
fix typo with colossalai/trainer utils zero (#3908) · a9d1cadc
digger yu authored Jun 07, 2023

a9d1cadc

[example] Modify palm example with the new booster API (#3913) · b306cecf

Liu Ziming authored Jun 07, 2023

* Modify torch version requirement to adapt torch 2.0

* modify palm example using new booster API

* roll back

* fix port

* polish

* polish

b306cecf

[booster] update bert example, using booster api (#3885) · a55fb00c
wukong1992 authored Jun 07, 2023

a55fb00c
[workflow] added docker latest tag for release (#3920) · 5e2132dc
Frank Lee authored Jun 07, 2023

5e2132dc
[devops] hotfix testmon cache clean logic (#3917) · c25d421f
Hongxin Liu authored Jun 07, 2023

c25d421f
[lazy] fix compatibility problem on torch 1.13 (#3911) · 9c88b6cb
Hongxin Liu authored Jun 07, 2023

9c88b6cb

[chat] add distributed PPO trainer (#3740) · b5f05663

Hongxin Liu authored Jun 07, 2023



* Detached ppo (#9)

* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

* working on pipeline strategy. in progress.

* remove pipeline code. clean this branch

* update remote parameters by state_dict. no test

* nothing

* state_dict sharding transfer

* merge debug branch

* gemini _unwrap_model fix

* simplify code

* simplify code & fix LoRALinear AttributeError

* critic unwrapped state_dict

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add perfomance evaluator and fix bugs (#10)

* [chat] add performance evaluator for ray

* [chat] refactor debug arg

* [chat] support hf config

* [chat] fix generation

* [chat] add 1mmt dummy example

* [chat] fix gemini ckpt

* split experience to send (#11)
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] refactor trainer and maker (#12)

* [chat] refactor experience maker holder

* [chat] refactor model init

* [chat] refactor trainer args

* [chat] refactor model init

* [chat] refactor trainer

* [chat] refactor experience sending logic and training loop args (#13)

* [chat] refactor experience send logic

* [chat] refactor trainer

* [chat] refactor trainer

* [chat] refactor experience maker

* [chat] refactor pbar

* [chat] refactor example folder (#14)

* [chat] support quant (#15)

* [chat] add quant

* [chat] add quant example

* prompt example (#16)

* prompt example

* prompt load csv data

* remove legacy try

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add mmmt dummy example and refactor experience sending (#17)

* [chat] add mmmt dummy example

* [chat] refactor naive strategy

* [chat] fix struck problem

* [chat] fix naive strategy

* [chat] optimize experience maker sending logic

* [chat] refactor sending assignment

* [chat] refactor performance evaluator (#18)

* Prompt Example & requires_grad state_dict & sharding state_dict (#19)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

---------
Co-authored-by: csric <richcsr256@gmail.com>

* state_dict sending adapts to new unwrap function (#20)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

* opt benchmark

* better script

* nothing

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test

* working on lora reconstruction

* state_dict sending adapts to new unwrap function

* remove comments

---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* [chat-ray] add readme (#21)

* add readme

* transparent graph

* add note background

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] get images from url (#22)

* Refactor/chat ray (#23)

* [chat] lora add todo

* [chat] remove unused pipeline strategy

* [chat] refactor example structure

* [chat] setup ci for ray

* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)

* lora support prototype

* lora support

* 1mmt lora & remove useless code

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] fix test ci for ray

* [chat] fix test ci requirements for ray

* [chat] fix ray runtime env

* [chat] fix ray runtime env

* [chat] fix example ci docker args

* [chat] add debug info in trainer

* [chat] add nccl debug info

* [chat] skip ray test

* [doc] fix typo

---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>

b5f05663

06 Jun, 2023 4 commits

[devops] hotfix CI about testmon cache (#3910) · 41fb7236
Hongxin Liu authored Jun 06, 2023
```
* [devops] hotfix CI about testmon cache

* [devops] fix testmon cahe on pr
```
41fb7236

[nfc]fix typo colossalai/pipeline tensor nn (#3899) · 0e484e62

digger yu authored Jun 06, 2023

* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel

* fix typo colossalai/nn

* revert change warmuped

* fix typo colossalai/pipeline tensor nn

0e484e62

[doc] fix docs about booster api usage (#3898) · c1535ccb
Baizhou Zhang authored Jun 06, 2023

c1535ccb

[devops] improving testmon cache (#3902) · ec9bbc00

Hongxin Liu authored Jun 06, 2023

* [devops] improving testmon cache

* [devops] fix branch name with slash

* [devops] fix branch name with slash

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] update readme

ec9bbc00

05 Jun, 2023 6 commits

support evaluation for english (#3880) · 57a6d768
Yuanchen authored Jun 05, 2023
```
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
```
57a6d768

[nfc] fix typo colossalai/nn (#3887) · 18787497

digger yu authored Jun 05, 2023

* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel

* fix typo colossalai/nn

* revert change warmuped

18787497

[bf16] add bf16 support (#3882) · ae02d4e4

Hongxin Liu authored Jun 05, 2023

* [bf16] add bf16 support for fused adam (#3844)

* [bf16] fused adam kernel support bf16

* [test] update fused adam kernel test

* [test] update fused adam test

* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860)

* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869)

* [bf16] add mixed precision mixin

* [bf16] low level zero optim support bf16

* [text] update low level zero test

* [text] fix low level zero grad acc test

* [bf16] add bf16 support for gemini (#3872)

* [bf16] gemini support bf16

* [test] update gemini bf16 test

* [doc] update gemini docstring

* [bf16] add bf16 support for plugins (#3877)

* [bf16] add bf16 support for legacy zero (#3879)

* [zero] init context support bf16

* [zero] legacy zero support bf16

* [test] add zero bf16 test

* [doc] add bf16 related docstring for legacy zero

ae02d4e4

[doc]update moe chinese document. (#3890) · 07cb2114

jiangmingyan authored Jun 05, 2023

* [doc]update-moe

* [doc]update-moe

* [doc]update-moe

* [doc]update-moe

* [doc]update-moe

07cb2114

Modify torch version requirement to adapt torch 2.0 (#3896) · 8065cc5f
Liu Ziming authored Jun 05, 2023

8065cc5f

[lazy] refactor lazy init (#3891) · dbb32692

Hongxin Liu authored Jun 05, 2023

* [lazy] remove old lazy init

* [lazy] refactor lazy init folder structure

* [lazy] fix lazy tensor deepcopy

* [test] update lazy init test

dbb32692

02 Jun, 2023 1 commit

[nfc] fix typo colossalai/cli fx kernel (#3847) · 70c8cdec

digger yu authored Jun 02, 2023

* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel

70c8cdec

30 May, 2023 3 commits

[doc] update document of zero with chunk. (#3855) · 281b33f3

jiangmingyan authored May 30, 2023

* [doc] fix title of mixed precision

* [doc]update document of zero with chunk

* [doc] update document of zero with chunk, fix

* [doc] update document of zero with chunk, fix

* [doc] update document of zero with chunk, fix

* [doc] update document of zero with chunk, add doc test

* [doc] update document of zero with chunk, add doc test

* [doc] update document of zero with chunk, fix installation

* [doc] update document of zero with chunk, fix zero with chunk doc

* [doc] update document of zero with chunk, fix zero with chunk doc

281b33f3

[example] update gemini examples (#3868) · 5f79008c
jiangmingyan authored May 30, 2023
```
* [example]update gemini examples

* [example]update gemini examples
```
5f79008c

[evaluation] improvement on evaluation (#3862) · 2506e275

Yuanchen authored May 30, 2023



* fix a bug when the config file contains one category but the answer file doesn't contains that category

* fix Chinese prompt file

* support gpt-3.5-turbo and gpt-4 evaluation

* polish and update README

* resolve pr comments

---------
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

2506e275

25 May, 2023 7 commits
- [doc] update nvme offload documents. (#3850) · b0474878
  jiangmingyan authored May 26, 2023
  
  b0474878
- [workflow] fixed workflow check for docker build (#3849) · ae959a72
  Frank Lee authored May 25, 2023
  
  ae959a72
- [release] bump to v0.3.0 (#3830) · d42b1be0
  Frank Lee authored May 25, 2023
  
  d42b1be0
- [nfc] fix typo colossalai/ applications/ (#3831) · e2d81eba
  digger yu authored May 25, 2023
```
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/
```
  e2d81eba
- [doc] update document of gemini instruction. (#3842) · a64df3fa
  jiangmingyan authored May 25, 2023
```
* [doc] update meet_gemini.md

* [doc] update meet_gemini.md

* [doc] fix parentheses

* [doc] fix parentheses

* [doc] fix doc test

* [doc] fix doc test

* [doc] fix doc
```
  a64df3fa
- [workflow] supported test on CUDA 10.2 (#3841) · 54e97ed7
  Frank Lee authored May 25, 2023
  
  54e97ed7
- [booster] add warning for torch fsdp plugin doc (#3833) · 3229f93e
  wukong1992 authored May 25, 2023
  
  3229f93e
24 May, 2023 7 commits

[workflow] fixed testmon cache in build CI (#3806) · 84500b77
Frank Lee authored May 24, 2023
```
* [workflow] fixed testmon cache in build CI

* polish code
```
84500b77

[docs] change placememt_policy to placement_policy (#3829) · 518b31c0

digger yu authored May 24, 2023

* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

518b31c0

fix typo docs/ · e90fdb10
digger yu authored May 24, 2023

e90fdb10

[evaluation] add automatic evaluation pipeline (#3821) · 34966378

Yuanchen authored May 24, 2023



* add functions for gpt evaluation

* add automatic eval

Update eval.py

* using jload and modify the type of answers1 and answers2

* Update eval.py

Update eval.py

* Update evaluator.py

* support gpt evaluation

* update readme.md

update README.md

update READNE.md

modify readme.md

* add Chinese example for config, battle prompt and evaluation prompt file

* remove GPT-4 config

* remove sample folder

---------
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>

34966378

[workflow] changed to doc build to be on schedule and release (#3825) · 05b8a8de
Frank Lee authored May 24, 2023
```
* [workflow] changed to doc build to be on schedule and release

* polish code
```
05b8a8de
[Docker] Fix a couple of build issues (#3691) · 269150b6
Yanming W authored May 23, 2023

269150b6
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) · 7f8203af
digger yu authored May 24, 2023

7f8203af