Commits · ef4f0ee8543c9a5648787ef44bdf5a330b480052 · OpenDAS / ColossalAI

15 Jan, 2024 1 commit
- [hotfix]: add pp sanity check and fix mbs arg (#5268) · ef4f0ee8
  Wenhao Chen authored Jan 15, 2024
```
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
```
  ef4f0ee8
08 Jan, 2024 1 commit

[pipeline] A more general _communicate in p2p (#5062) · d565df38

Elsa Granger authored Jan 08, 2024



* A more general _communicate

* feat: finish tree_flatten version p2p

* fix: update p2p api calls

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

d565df38

03 Jan, 2024 1 commit
- [pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214) · d799a308
  Wenhao Chen authored Jan 03, 2024
```
* fix: add fallback order option and update 1f1b

* fix: fix deadlock comm in interleaved pp

* test: modify p2p test
```
  d799a308
22 Dec, 2023 1 commit

[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) · 4fa689fc

Wenhao Chen authored Dec 22, 2023

* test: add more p2p tests

* fix: remove send_forward_recv_forward as p2p op list need to use the same group

* fix: make send and receive atomic

* feat: update P2PComm fn

* feat: add metadata cache in 1f1b

* feat: add metadata cache in interleaved pp

* feat: modify is_xx_stage fn

* revert: add _broadcast_object_list

* feat: add interleaved pp in llama policy

* feat: set NCCL_BUFFSIZE in HybridParallelPlugin

4fa689fc

28 Nov, 2023 1 commit

[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) · 7172459e

Wenhao Chen authored Nov 28, 2023



* [shardformer] implement policy for all GPT-J models and test

* [shardformer] support interleaved pipeline parallel for bert finetune

* [shardformer] shardformer support falcon (#4883)

* [shardformer]: fix interleaved pipeline for bert model (#5048)

* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)

* Add Mistral support for Shardformer (#5103)

* [shardformer] add tests to mistral (#5105)

---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>

7172459e

19 Sep, 2023 1 commit

[misc] update pre-commit and run all files (#4752) · 079bf3cb

Hongxin Liu authored Sep 19, 2023

* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format

079bf3cb

18 Sep, 2023 1 commit

[legacy] clean up legacy code (#4743) · b5f9e37c

Hongxin Liu authored Sep 18, 2023

* [legacy] remove outdated codes of pipeline (#4692)

* [legacy] remove cli of benchmark and update optim (#4690)

* [legacy] remove cli of benchmark and update optim

* [doc] fix cli doc test

* [legacy] fix engine clip grad norm

* [legacy] remove outdated colo tensor (#4694)

* [legacy] remove outdated colo tensor

* [test] fix test import

* [legacy] move outdated zero to legacy (#4696)

* [legacy] clean up utils (#4700)

* [legacy] clean up utils

* [example] update examples

* [legacy] clean up amp

* [legacy] fix amp module

* [legacy] clean up gpc (#4742)

* [legacy] clean up context

* [legacy] clean core, constants and global vars

* [legacy] refactor initialize

* [example] fix examples ci

* [example] fix examples ci

* [legacy] fix tests

* [example] fix gpt example

* [example] fix examples ci

* [devops] fix ci installation

* [example] fix examples ci

b5f9e37c

11 Sep, 2023 1 commit

[legacy] move communication and nn to legacy and refactor logger (#4671) · 554aa959

Hongxin Liu authored Sep 11, 2023

* [legacy] move communication to legacy (#4640)

* [legacy] refactor logger and clean up legacy codes (#4654)

* [legacy] make logger independent to gpc

* [legacy] make optim independent to registry

* [legacy] move test engine to legacy

* [legacy] move nn to legacy (#4656)

* [legacy] move nn to legacy

* [checkpointio] fix save hf config

* [test] remove useledd rpc pp test

* [legacy] fix nn init

* [example] skip tutorial hybriad parallel example

* [devops] test doc check

* [devops] test doc check

554aa959

07 Sep, 2023 1 commit

[pipeline] set optimizer to optional in execute_pipeline (#4630) · 660eed91

Baizhou Zhang authored Sep 07, 2023

* set optimizer to optional in execute_pipeline

* arrange device and mixed precision in booster init

* fix execute_pipeline in booster.py

660eed91

05 Sep, 2023 1 commit

[legacy] move trainer to legacy (#4545) · 89fe0277

Hongxin Liu authored Aug 31, 2023

* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test

89fe0277

01 Sep, 2023 1 commit
- [pipeline] 1f1b schedule receive microbatch size (#4589) · 508ca36f
  Hongxin Liu authored Sep 01, 2023
  
  508ca36f
24 Aug, 2023 1 commit

[gemini] improve compatibility and add static placement policy (#4479) · 27061426

Hongxin Liu authored Aug 24, 2023

* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example

27061426

18 Aug, 2023 1 commit

[shardformer] Pipeline/whisper (#4456) · 8739aa7f

Jianghai authored Aug 18, 2023

* add some base tests and policies

* finish whisper base model

* add conditional generation

* finish basic tests

* whisper

* finish whisper

* finish whisper

* del useless  whisper test

* fix

* add argmin to replace

* finish revision

8739aa7f

16 Aug, 2023 2 commits
- [shardformer] support interleaved pipeline (#4448) · a78daf61
  LuGY authored Aug 16, 2023
```
* support interleaved pipeline

* fix unit test

* remove virtual stage test in stage mgr

* add droped type hint and updated bwd
```
  a78daf61
- [format] applied code formatting on changed files in pull request 4441 (#4445) · d20dceb9
  github-actions[bot] authored Aug 16, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
  d20dceb9
15 Aug, 2023 15 commits

[pipeline] add chatglm (#4363) · a88e9225

Jianghai authored Aug 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

* add bert_for_pretraining forward and policy

* fix typos

* cancel warning

* change the imediate output to default dict

* change the default output of get_shared_params

* add chatglm

* add

* chatglm

* chatglm

* finish chatglm

* deletes

* fix rmsnorm

* chatglm

* fix chatglm shard

* init

a88e9225

[pipeline] refactor test pipeline and remove useless utils in pipeline (#4324) · f13954cd

Jianghai authored Aug 01, 2023

* refactor tests

* refactor bloom model

* finish policy tests

* refactor tests

* fix test pure pipeline

* remove test pipeline and cutdown launch process

* refactor tests

* refactor bloom model

* finish policy tests

* refactor tests

* fix test pure pipeline

* remove test pipeline and cutdown launch process

f13954cd

[pipeline] add unit test for 1f1b (#4303) · d3c6cd66
LuGY authored Jul 31, 2023
```
* add unit test for 1f1b

* polish code

* polish code and update ut version

* fix
```
d3c6cd66

[pipeline] add pipeline support for T5Stack/T5EncoderModel (#4300) · 36e546b2

Baizhou Zhang authored Jul 21, 2023

* modify t5 policy & add test

* pipeline stage distribution for t5

* complete t5 base policy

* t5 stack: halfway

* modify gpt2 pipeline test

* complete pipeline forward for T5Stack/T5EncoderModel

* fix docstring

* move t5 util tests to test_pipeline

36e546b2

[pipeline] OPT model pipeline (#4258) · d8408d18

Jianghai authored Jul 20, 2023

* opt forward and test

* pause

* finish opt model pipeline

* finish opt pipeline

* opt forward and test

* pause

* finish opt model pipeline

* finish opt pipeline

* fix opt

* set transformers version

* refactor the test pipeline

d8408d18

[pipeline] All bert models (#4233) · e7cc62d7

Jianghai authored Jul 17, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

* finish llama

* causal lm and sequence classification

* revision

* add pure pipeline test

* finish some bert models

* finish all bert models

* finish bert tests

* fix bugs

* fix bugs

* fix test pipeline

* fix data gen for qa

* update the set pipeline forward

* shared params

* fix bugs

e7cc62d7

[pipeline] move bert related pipeline components to shardformer (#4187) · f3bcc292

Jianghai authored Jul 07, 2023

* move bert related pipeline components to shardformer

* fix bugs

* revision

* fix bert model tests

* fix bert_lm_head model tests

* fix tests

* fix tests

* done checks

* skip bloom

f3bcc292

[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172) · c5ea7280

Jianghai authored Jul 06, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

* add bert_for_pretraining forward and policy

* fix typos

* cancel warning

* change the imediate output to default dict

* change the default output of get_shared_params

c5ea7280

[pipeline] build bloom model and policy , revise the base class of policy (#4161) · 90a65ea6

Jianghai authored Jul 05, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

90a65ea6

[pipeline]add pipeline policy and bert forward (#4130) · c552cefa

Jianghai authored Jul 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

c552cefa

[pipeline] add stage manager (#4093) · 5c897ddb

Hongxin Liu authored Jun 27, 2023

* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager

5c897ddb

[pipeline]add pipeline policy and bert forward (#4130) · e8e7e492

Jianghai authored Jul 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

e8e7e492

[pipeline] refactor 1f1b schedule (#4115) · f51ce1bc

Hongxin Liu authored Jun 29, 2023

* [api] update optimizer wrapper to fit pipeline

* [pipeline] add base schedule

* [pipeline] add 1f1b schedule

* [test] add pipeline schedule utils test

* [pipeline] fix import

f51ce1bc

[pipeline] implement p2p communication (#4100) · 45fdc9b4

Hongxin Liu authored Jun 28, 2023

* [pipeline] add p2p communication

* [test] add p2p communication test

* [test] add rerun decorator

* [test] rename to avoid conflict

45fdc9b4

[pipeline] add stage manager (#4093) · 42254422

Hongxin Liu authored Jun 27, 2023

* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager

42254422

06 Apr, 2023 1 commit

[test] refactor tests with spawn (#3452) · 80eba05b

Frank Lee authored Apr 06, 2023

* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

80eba05b

12 Dec, 2022 1 commit

[PP Middleware] Add bwd and step for PP middleware (#2111) · 09d69e1c

Ziyue Jiang authored Dec 12, 2022



* add bwd and step for PP middleware

* pre-commit
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>

09d69e1c

08 Dec, 2022 1 commit

[Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087) · e4705ba4

Ziyue Jiang authored Dec 08, 2022



* add DAG test case

* fix datarace by adjusting theposition of lock

* polish code

* fix pytest for middleware

* remove test
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>

e4705ba4

05 Dec, 2022 1 commit

[Pipeline Middleware] Adapt scheduler for Topo (#2066) · 597cdd30

Ziyue Jiang authored Dec 05, 2022



* adapt scheduler for Topo

* remoove comment

* fix set input
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>

597cdd30

29 Nov, 2022 1 commit

[rpc] split with dag (#2028) · b0936e4a

Ziyue Jiang authored Nov 29, 2022



* add DAG to split_module

* add comment

* add test case for DAG

* remove print

* add DAG middleware in scheduler

* add test case for scheduler

* remove break

* recover old lifecycle
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>

b0936e4a

18 Oct, 2022 1 commit

[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions... · 393f5940

Super Daniel authored Oct 18, 2022

[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710)

* [fx] move meta registration

* [fx] fix tests.

* [fx] fix test.

* [fx] fix.

* [meta] refactor meta registration.py.

* [fx] add compatibility descriptions.

* [fx] polish import.

* [fx] add a decorator.

* [fx] fix tests.

* [fx] remove print.

* [fx] edit raise error.

* [fx] edit raise error.

* [fx] add type hint.

* [fx] fix import in experimental.

* [rpc] remove color debug.

* [meta] fix naming.

393f5940

29 Sep, 2022 1 commit

[pipeline/pytree] add pytree to process args and kwargs | provide... · 9708638d

Kirigaya Kazuto authored Sep 29, 2022

[pipeline/pytree] add pytree to process args and kwargs | provide `data_process_func` to process args and kwargs after forward (#1642)

* [pipeline/tuning] improve dispatch performance both time and space cost

* [pipeline/converge] add interface for testing convergence

* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style

* Update PipelineBase.py

* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera

* [pipeline/chimera] test chimera | fix bug of initializing

* [pipeline/pytree] add pytree to process args and kwargs | provide  to process args and kwargs after forward

9708638d

20 Sep, 2022 1 commit

[pipeline/chimera] test chimera | fix bug of initializing (#1615) · 170fa810

Kirigaya Kazuto authored Sep 20, 2022

* [pipeline/tuning] improve dispatch performance both time and space cost

* [pipeline/converge] add interface for testing convergence

* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style

* Update PipelineBase.py

* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera

* [pipeline/chimera] test chimera | fix bug of initializing

170fa810

19 Sep, 2022 1 commit

[pipeline/chimera] reconstruct PipelineBase and Worker to support more... · edc9e419

Kirigaya Kazuto authored Sep 19, 2022

[pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera (#1595)

* [pipeline/tuning] improve dispatch performance both time and space cost

* [pipeline/converge] add interface for testing convergence

* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style

* Update PipelineBase.py

* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera

edc9e419

07 Sep, 2022 1 commit
- [pipeline/tuning] improve dispatch performance both time and space cost (#1544) · 6159d454
  Kirigaya Kazuto authored Sep 07, 2022
  
  6159d454