Commits · a39a5c66fe943741af65df5adf18093692045dab · OpenDAS / ColossalAI

24 Aug, 2023 1 commit

[gemini] improve compatibility and add static placement policy (#4479) · 27061426

Hongxin Liu authored Aug 24, 2023

* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example

27061426

08 Jun, 2023 4 commits
- modify shell for check · 9b5e7ce2
  Maruyama_Aya authored Jun 08, 2023
  
  9b5e7ce2
- modify shell for check · 49567d56
  Maruyama_Aya authored Jun 08, 2023
  
  49567d56
- modify shell for check · 039854b3
  Maruyama_Aya authored Jun 08, 2023
  
  039854b3
- modify shell for check · cf4792c9
  Maruyama_Aya authored Jun 08, 2023
  
  cf4792c9
07 Jun, 2023 2 commits
- modify shell for check · c94a3357
  Maruyama_Aya authored Jun 07, 2023
  
  c94a3357
- modify file path · 4fc8bc68
  Maruyama_Aya authored Jun 07, 2023
  
  4fc8bc68
06 Jun, 2023 2 commits
- fixed port · 79c9f776
  Maruyama_Aya authored Jun 06, 2023
  
  79c9f776
- change directory · b29e1f07
  Maruyama_Aya authored Jun 06, 2023
  
  b29e1f07
19 Jan, 2023 1 commit
- add test_ci.sh to dreambooth · 32390cbe
  jiaruifang authored Jan 19, 2023
  
  32390cbe
16 Jan, 2023 1 commit
- [example] stable diffusion add roadmap (#2482) · e4c38ba3
  Jiarui Fang authored Jan 16, 2023
  
  e4c38ba3
06 Jan, 2023 1 commit

[setup] support pre-build and jit-build of cuda kernels (#2374) · 40d376c5

Frank Lee authored Jan 06, 2023

* [setup] support pre-build and jit-build of cuda kernels

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

40d376c5

19 Aug, 2022 1 commit
- [autoparallel] standardize the code structure (#1469) · 3a54e1c9
  Frank Lee authored Aug 19, 2022
  
  3a54e1c9
02 Aug, 2022 1 commit
- [device] add DeviceMesh class to support logical device layout (#1394) · 0442f940
  YuliangLiu0306 authored Aug 02, 2022
```
* [device] add DeviceMesh class to support logical device layout

* polish code

* add doc string
```
  0442f940
26 Apr, 2022 1 commit
- fix import error (#880) · 4df6471f
  ver217 authored Apr 26, 2022
  
  4df6471f
09 Dec, 2021 1 commit

Develop/experiments (#59) · da01c234

Frank Lee authored Dec 09, 2021



* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.

* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>

* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000

* Integrate 1d tensor parallel in Colossal-AI (#39)

* fixed 1D and 2D convergence (#38)

* optimized 2D operations

* fixed 1D ViT convergence problem

* Feature/ddp (#49)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.

* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* support torch ddp

* fix loss accumulation

* add log for ddp

* change seed

* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* Feature/pipeline (#40)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.

* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* optimize communication of pipeline parallel

* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)

* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset

* update api for better usability (#58)

update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>

da01c234

28 Oct, 2021 1 commit
- Migrated project · 404ecbdc
  zbian authored Oct 28, 2021
  
  404ecbdc