Commits · b3ab7fbabf3b72805403b82eeb79a6155d72004f · OpenDAS / ColossalAI

12 Jun, 2023 1 commit
- [example] update ViT example using booster api (#3940) · b3ab7fba
  Baizhou Zhang authored Jun 12, 2023
  
  b3ab7fba
09 Jun, 2023 6 commits
- fix typo .github/workflows/scripts/ (#3946) · 1aadeede
  digger yu authored Jun 09, 2023
  
  1aadeede
- fix typo tests/ (#3936) · e61ffc77
  digger yu authored Jun 09, 2023
  
  e61ffc77
- Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer · bd2c7c32
  FoolPlayer authored Jun 09, 2023
```
Revert "[sync] sync feature/shardformer with develop"
```
  bd2c7c32
- Revert "[sync] sync feature/shardformer with develop" · ddcf58ca
  Frank Lee authored Jun 09, 2023
  
  ddcf58ca
- Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer · 24651fdd
  FoolPlayer authored Jun 09, 2023
```
[sync] sync feature/shardformer with develop
```
  24651fdd
- Merge pull request #3905 from MaruyamaAya/dreambooth · e277534a
  Liu Ziming authored Jun 09, 2023
```
[example] Adding an example of training dreambooth with the new booster API
```
  e277534a
08 Jun, 2023 20 commits
- support UniEval and add CHRF metric (#3924) · 21c4c0b1
  Yuanchen authored Jun 08, 2023
```
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
```
  21c4c0b1
- fix typo examples and docs (#3932) · 33eef714
  digger yu authored Jun 08, 2023
  
  33eef714
- [shardformer] add gpt2 policy and modify shard and slicer to support (#3883) · ef153775
  FoolPlayer authored Jun 07, 2023
```
* add gpt2 policy and modify shard and slicer to support

* remove unused code

* polish code
```
  ef153775
- update README (#3909) · 6370a935
  FoolPlayer authored Jun 06, 2023
  
  6370a935
- [shardformer] add Dropout layer support different dropout pattern (#3856) · 21a3915c
  FoolPlayer authored Jun 01, 2023
```
* add dropout layer, add dropout test

* modify seed manager as context manager

* add a copy of col_nn.layer

* add dist_crossentropy loss; separate module test

* polish the code

* fix dist crossentropy loss
```
  21a3915c
- [shardformer] update readme with modules implement doc (#3834) · 997544c1
  FoolPlayer authored May 24, 2023
```
* update readme with modules content

* remove img
```
  997544c1
- [shardformer] refactored the user api (#3828) · 537a52b7
  Frank Lee authored May 24, 2023
```
* [shardformer] refactored the user api

* polish code
```
  537a52b7
- [shardformer] updated readme (#3827) · bc19024b
  Frank Lee authored May 24, 2023
  
  bc19024b
- [shardformer]: Feature/shardformer, add some docstring and readme (#3816) · 58f64324
  FoolPlayer authored May 24, 2023
```
* init shardformer code structure

* add implement of sharder (inject and replace)

* add implement of replace layer to colossal layer

* separate different layer policy, add some notion

* implement 1d and 2d slicer, can tell col or row

* fix bug when slicing and inject model

* fix some bug; add inference test example

* add share weight and train example

* add train

* add docstring and readme

* add docstring for other files

* pre-commit
```
  58f64324
- [shardformer] init shardformer code structure (#3731) · 6a69b44d
  FoolPlayer authored May 22, 2023
```
* init shardformer code structure

* add implement of sharder (inject and replace)

* add implement of replace layer to colossal layer

* separate different layer policy, add some notion

* implement 1d and 2d slicer, can tell col or row

* fix bug when slicing and inject model

* fix some bug; add inference test example
```
  6a69b44d
- modify shell for check · 9b5e7ce2
  Maruyama_Aya authored Jun 08, 2023
  
  9b5e7ce2
- Merge pull request #3926 from hpcaitech/feature/dtensor · a98e16ed
  Frank Lee authored Jun 08, 2023
```
[feature] updated device mesh and dtensor
```
  a98e16ed
- fix typo examples/community/roberta (#3925) · 407aa484
  digger yu authored Jun 08, 2023
  
  407aa484
- modify shell for check · 730a092b
  Maruyama_Aya authored Jun 08, 2023
  
  730a092b
- modify shell for check · 49567d56
  Maruyama_Aya authored Jun 08, 2023
  
  49567d56
- modify shell for check · 039854b3
  Maruyama_Aya authored Jun 08, 2023
  
  039854b3
- [example] update opt example using booster api (#3918) · e417dd00
  Baizhou Zhang authored Jun 08, 2023
  
  e417dd00
- modify shell for check · cf4792c9
  Maruyama_Aya authored Jun 08, 2023
  
  cf4792c9
- [dtensor] updated api and doc (#3845) · eb39154d
  Frank Lee authored Jun 08, 2023
  
  eb39154d
- [devops] update torch version in compability test (#3919) · 9166988d
  Hongxin Liu authored Jun 08, 2023
  
  9166988d
07 Jun, 2023 13 commits

[nfc] fix typo colossalai/zero (#3923) · de0d7df3
digger yu authored Jun 08, 2023

de0d7df3

[doc] add lazy init tutorial (#3922) · 12c90db3

Hongxin Liu authored Jun 07, 2023

* [doc] add lazy init en doc

* [doc] add lazy init zh doc

* [doc] add lazy init doc in sidebar

* [doc] add lazy init doc test

* [doc] fix lazy init doc link

12c90db3

modify shell for check · c94a3357
Maruyama_Aya authored Jun 07, 2023

c94a3357
fix typo with colossalai/trainer utils zero (#3908) · a9d1cadc
digger yu authored Jun 07, 2023

a9d1cadc

[example] Modify palm example with the new booster API (#3913) · b306cecf

Liu Ziming authored Jun 07, 2023

* Modify torch version requirement to adapt torch 2.0

* modify palm example using new booster API

* roll back

* fix port

* polish

* polish

b306cecf

[booster] update bert example, using booster api (#3885) · a55fb00c
wukong1992 authored Jun 07, 2023

a55fb00c
[workflow] added docker latest tag for release (#3920) · 5e2132dc
Frank Lee authored Jun 07, 2023

5e2132dc
[devops] hotfix testmon cache clean logic (#3917) · c25d421f
Hongxin Liu authored Jun 07, 2023

c25d421f
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop · d51e83d6
Frank Lee authored Jun 07, 2023
```
[sync] sync feature/dtensor with develop
```
d51e83d6
Merge pull request #3915 from FrankLeeeee/update/develop · c622bb36
Frank Lee authored Jun 07, 2023
```
[sync] update develop with main
```
c622bb36
[lazy] fix compatibility problem on torch 1.13 (#3911) · 9c88b6cb
Hongxin Liu authored Jun 07, 2023

9c88b6cb
modify file path · 4fc8bc68
Maruyama_Aya authored Jun 07, 2023

4fc8bc68

[chat] add distributed PPO trainer (#3740) · b5f05663

Hongxin Liu authored Jun 07, 2023



* Detached ppo (#9)

* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

* working on pipeline strategy. in progress.

* remove pipeline code. clean this branch

* update remote parameters by state_dict. no test

* nothing

* state_dict sharding transfer

* merge debug branch

* gemini _unwrap_model fix

* simplify code

* simplify code & fix LoRALinear AttributeError

* critic unwrapped state_dict

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add perfomance evaluator and fix bugs (#10)

* [chat] add performance evaluator for ray

* [chat] refactor debug arg

* [chat] support hf config

* [chat] fix generation

* [chat] add 1mmt dummy example

* [chat] fix gemini ckpt

* split experience to send (#11)
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] refactor trainer and maker (#12)

* [chat] refactor experience maker holder

* [chat] refactor model init

* [chat] refactor trainer args

* [chat] refactor model init

* [chat] refactor trainer

* [chat] refactor experience sending logic and training loop args (#13)

* [chat] refactor experience send logic

* [chat] refactor trainer

* [chat] refactor trainer

* [chat] refactor experience maker

* [chat] refactor pbar

* [chat] refactor example folder (#14)

* [chat] support quant (#15)

* [chat] add quant

* [chat] add quant example

* prompt example (#16)

* prompt example

* prompt load csv data

* remove legacy try

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add mmmt dummy example and refactor experience sending (#17)

* [chat] add mmmt dummy example

* [chat] refactor naive strategy

* [chat] fix struck problem

* [chat] fix naive strategy

* [chat] optimize experience maker sending logic

* [chat] refactor sending assignment

* [chat] refactor performance evaluator (#18)

* Prompt Example & requires_grad state_dict & sharding state_dict (#19)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

---------
Co-authored-by: csric <richcsr256@gmail.com>

* state_dict sending adapts to new unwrap function (#20)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

* opt benchmark

* better script

* nothing

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test

* working on lora reconstruction

* state_dict sending adapts to new unwrap function

* remove comments

---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* [chat-ray] add readme (#21)

* add readme

* transparent graph

* add note background

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] get images from url (#22)

* Refactor/chat ray (#23)

* [chat] lora add todo

* [chat] remove unused pipeline strategy

* [chat] refactor example structure

* [chat] setup ci for ray

* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)

* lora support prototype

* lora support

* 1mmt lora & remove useless code

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] fix test ci for ray

* [chat] fix test ci requirements for ray

* [chat] fix ray runtime env

* [chat] fix ray runtime env

* [chat] fix example ci docker args

* [chat] add debug info in trainer

* [chat] add nccl debug info

* [chat] skip ray test

* [doc] fix typo

---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>

b5f05663