Commits · 50ec3a7e06f7079fd7a8bd045be4f295c8f551a9 · OpenDAS / ColossalAI

09 Jun, 2022 1 commit
- [test] skip tests when not enough GPUs are detected (#1090) · 50ec3a7e
  Frank Lee authored Jun 09, 2022
```
* [test] skip tests when not enough GPUs are detected

* polish code

* polish code
```
  50ec3a7e
08 Jun, 2022 1 commit
- [test] ignore 8 gpu test (#1080) · 65ee6dcc
  Frank Lee authored Jun 08, 2022
```
* [test] ignore 8 gpu test

* polish code

* polish workflow

* polish workflow
```
  65ee6dcc
06 Jun, 2022 1 commit
- [refactory] add nn.parallel module (#1068) · 49832b23
  Jiarui Fang authored Jun 06, 2022
  
  49832b23
24 Apr, 2022 2 commits

[gemini] add GeminiMemoryManger (#832) · e5ea3fde
HELSON authored Apr 24, 2022
```
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
```
e5ea3fde

[pipelinable]use pipelinable context to initialize non-pipeline model (#816) · 35ea6e10

YuliangLiu0306 authored Apr 24, 2022

* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4500af6a9220ef7fe4d3c7b1daebd4c.

* [pipeline]add module lazy init feature to support large model initization.

* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model

* refactor the module structure

* polish

* [pipelinable]add unit test for pipelinable

* polish

* polish

* Fix CodeFactor issues.

35ea6e10

19 Apr, 2022 1 commit
- [refactor] moving grad acc logic to engine (#804) · 681addb5
  Jiarui Fang authored Apr 19, 2022
  
  681addb5
14 Apr, 2022 1 commit
- [test] refactored with the new rerun decorator (#763) · 5a1a095b
  Frank Lee authored Apr 15, 2022
```
* [test] refactored with the new rerun decorator

* polish test case
```
  5a1a095b
12 Apr, 2022 2 commits
- [utils] correct cpu memory used and capacity in the context of multi-process (#726) · 53cb5848
  Jiarui Fang authored Apr 12, 2022
  
  53cb5848
- [test] added missing decorators to model checkpointing tests · 62b4ce73
  FrankLeeeee authored Apr 12, 2022
  
  62b4ce73
11 Apr, 2022 2 commits
- [refactor] zero directory (#724) · 4d90a7b5
  Jiarui Fang authored Apr 11, 2022
  
  4d90a7b5
- [refactor] refactor the memory utils (#715) · 193dc8da
  Jiarui Fang authored Apr 11, 2022
  
  193dc8da
02 Apr, 2022 1 commit

[hotfix] fix bugs in testing (#659) · e5d615ae

HELSON authored Apr 02, 2022

* remove hybrid adam in test_moe_zero_optim

* fix activation checkpointing and its unitest

e5d615ae

01 Apr, 2022 3 commits
- [model checkpoint] added unit tests for checkpoint save/load (#599) · 354b7954
  アマデウス authored Apr 01, 2022
  
  354b7954
- [zero] test zero tensor utils (#609) · 93f14d2a
  FredHuang99 authored Apr 01, 2022
  
  93f14d2a
- [refactor] memory utils (#577) · e956d93a
  Jiarui Fang authored Apr 01, 2022
  
  e956d93a
28 Mar, 2022 1 commit
- [zero] refactor model data tracing (#537) · 705f5610
  Jiarui Fang authored Mar 28, 2022
  
  705f5610
25 Mar, 2022 5 commits
- [zero] refactor model data tracing (#522) · 8d8c5407
  Jiarui Fang authored Mar 25, 2022
  
  8d8c5407
- [test] fixed rerun_on_exception and adapted test cases (#487) · 3601b2ba
  Frank Lee authored Mar 25, 2022
  
  3601b2ba
- [refactor] remove old zero code (#517) · 4d322b79
  Jiarui Fang authored Mar 25, 2022
  
  4d322b79
- [zero] add colo move inline (#521) · 920c5889
  Jiarui Fang authored Mar 25, 2022
  
  920c5889
- [zero] show model data cuda memory usage after zero context init. (#515) · 7ef3507a
  Jiarui Fang authored Mar 25, 2022
  
  7ef3507a
24 Mar, 2022 2 commits
- [memory] set cuda mem frac (#506) · 9330be0f
  Jiarui Fang authored Mar 24, 2022
  
  9330be0f
- [memory] add model data tensor moving api (#503) · 0035b7be
  Jiarui Fang authored Mar 24, 2022
  
  0035b7be
22 Mar, 2022 1 commit

[zero] polish sharded param name (#484) · b3348221

Jiarui Fang authored Mar 22, 2022

* [zero] polish sharded param name

* polish code

* polish

* polish code

* polish

* polsih

* polish

b3348221

18 Mar, 2022 1 commit
- [test] optimized zero data parallel test (#452) · f27d801a
  Frank Lee authored Mar 18, 2022
  
  f27d801a
14 Mar, 2022 3 commits
- [zero] memtracer to record cuda memory usage of model data and overall system (#395) · 21dc54e0
  Jiarui Fang authored Mar 14, 2022
  
  21dc54e0
- [hotfix] rm test_tensor_detector.py (#413) · a37bf1bc
  Jiarui Fang authored Mar 14, 2022
  
  a37bf1bc
- Added tensor detector (#393) · a9c27be4
  LuGY authored Mar 14, 2022
```
* Added tensor detector

* Added the - states

* Allowed change include_cpu when detect()
```
  a9c27be4
11 Mar, 2022 7 commits

fixed bug in activation checkpointing test (#387) · 1e4bf85c
Frank Lee authored Mar 11, 2022

1e4bf85c
[unit test] Refactored test cases with component func (#339) · 526a3180
Frank Lee authored Mar 11, 2022
```
* refactored test with component func

* fixed bug
```
526a3180
Added activation offload (#331) · de464504
LuGY authored Mar 11, 2022
```
* Added activation offload

* Fixed the import bug, used the pytest
```
de464504
[zero] find miss code (#378) · b5f43ace
Jiarui Fang authored Mar 10, 2022

b5f43ace
Revert "[zero] bucketized tensor cpu gpu copy (#368)" · d9217e19
jiaruifang authored Mar 10, 2022
```
This reverts commit bef05489b642385c80e59fe757d598efd1752ecf.
```
d9217e19
[zero] bucketized tensor cpu gpu copy (#368) · 00670c87
Jiarui Fang authored Mar 10, 2022

00670c87

Feature/zero (#279) · 5a560a06

Jiarui Fang authored Mar 01, 2022



* add zero1 (#209)

* add zero1

* add test zero1

* update zero stage 1 develop (#212)

* Implement naive zero3 (#240)

* naive zero3 works well

* add zero3 param manager

* add TODOs in comments

* add gather full param ctx

* fix sub module streams

* add offload

* fix bugs of hook and add unit tests

* fix bugs of hook and add unit tests (#252)

* add gather full param ctx

* fix sub module streams

* add offload

* fix bugs of hook and add unit tests

* polish code and add state dict hook

* fix bug

* update unit test

* refactor reconstructed zero code

* clip_grad support zero3 and add unit test

* add unit test for Zero3ParameterManager

* [WIP] initialize the shard param class

* [WIP] Yet another sharded model implementation (#274)

* [WIP] initialize the shard param class

* [WIP] Yes another implementation of shardModel. Using a better hook method.

* torch.concat -> torch.cat

* fix test_zero_level_1.py::test_zero_level_1 unitest

* remove deepspeed implementation and refactor for the reconstructed zero module

* polish zero dp unittests
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>

5a560a06

29 Dec, 2021 1 commit

Hotfix/Colossalai layers (#92) · 01a80cd8

アマデウス authored Dec 29, 2021



* optimized 1d layer apis; reorganized nn.layer modules; fixed tests

* fixed 2.5d runtime issue

* reworked split batch, now called in trainer.schedule.load_batch
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>

01a80cd8

16 Dec, 2021 1 commit
- added CI for unit testing (#69) · cd9c28e0
  Frank Lee authored Dec 16, 2021
  
  cd9c28e0
09 Dec, 2021 1 commit

Develop/experiments (#59) · da01c234

Frank Lee authored Dec 09, 2021



* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.

* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>

* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000

* Integrate 1d tensor parallel in Colossal-AI (#39)

* fixed 1D and 2D convergence (#38)

* optimized 2D operations

* fixed 1D ViT convergence problem

* Feature/ddp (#49)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.

* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* support torch ddp

* fix loss accumulation

* add log for ddp

* change seed

* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* Feature/pipeline (#40)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.

* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* optimize communication of pipeline parallel

* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)

* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset

* update api for better usability (#58)

update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>

da01c234

28 Oct, 2021 1 commit
- Migrated project · 404ecbdc
  zbian authored Oct 28, 2021
  
  404ecbdc