Commits · e396bb71f2d557a44566fd7ec958475f5d406b8e · OpenDAS / ColossalAI

13 Apr, 2022 1 commit

[zero] add tensor placement policies (#743) · e396bb71

ver217 authored Apr 13, 2022

* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests

e396bb71

12 Apr, 2022 2 commits
- [bug] fixed DDP compatibility with torch 1.8 (#739) · f4f42d4c
  Frank Lee authored Apr 13, 2022
  
  f4f42d4c
- [utils] correct cpu memory used and capacity in the context of multi-process (#726) · 53cb5848
  Jiarui Fang authored Apr 12, 2022
  
  53cb5848
08 Apr, 2022 1 commit
- [zero] adapt zero hooks for unsharded module (#699) · ee112fe1
  HELSON authored Apr 08, 2022
  
  ee112fe1
07 Apr, 2022 1 commit
- [zero] fix init bugs in zero context (#686) · d7ecaf36
  HELSON authored Apr 07, 2022
```
* adapt model weight initialization for methods in Pytorch nn.init
```
  d7ecaf36
29 Mar, 2022 1 commit
- [zero] polish ZeroInitContext (#540) · 1f90a3b1
  ver217 authored Mar 29, 2022
  
  1f90a3b1
28 Mar, 2022 1 commit

[zero] adapt for no-leaf module in zero (#535) · a30e2b4c

HELSON authored Mar 28, 2022

only process module's own parameters in Zero context

add zero hooks for all modules that contrain parameters

gather parameters only belonging to module itself

a30e2b4c

25 Mar, 2022 2 commits
- [test] fixed rerun_on_exception and adapted test cases (#487) · 3601b2ba
  Frank Lee authored Mar 25, 2022
  
  3601b2ba
- [refactor] remove old zero code (#517) · 4d322b79
  Jiarui Fang authored Mar 25, 2022
  
  4d322b79
18 Mar, 2022 6 commits
- [test] fixed amp convergence comparison test (#454) · af185b55
  Frank Lee authored Mar 18, 2022
  
  af185b55
- update sharded optim and fix zero init ctx (#457) · 642846d6
  ver217 authored Mar 18, 2022
  
  642846d6
- Revert "[zero] update sharded optim and fix zero init ctx" (#456) · e2e9f825
  Jiarui Fang authored Mar 18, 2022
```
* Revert "polish code"

This reverts commit 8cf7ff08.

* Revert "rename variables"

This reverts commit e99af94a.

* Revert "remove surplus imports"

This reverts commit 46add4a5.

* Revert "update sharded optim and fix zero init ctx"

This reverts commit 57567ee7.
```
  e2e9f825
- polish code · 8cf7ff08
  ver217 authored Mar 18, 2022
  
  8cf7ff08
- update sharded optim and fix zero init ctx · 57567ee7
  ver217 authored Mar 18, 2022
  
  57567ee7
- [test] optimized zero data parallel test (#452) · f27d801a
  Frank Lee authored Mar 18, 2022
  
  f27d801a
16 Mar, 2022 1 commit
- sync before creating empty grad · fce9432f
  ver217 authored Mar 16, 2022
  
  fce9432f
14 Mar, 2022 2 commits
- [zero] memtracer to record cuda memory usage of model data and overall system (#395) · 21dc54e0
  Jiarui Fang authored Mar 14, 2022
  
  21dc54e0
- polish unit test · 54fd37f0
  ver217 authored Mar 14, 2022
  
  54fd37f0
11 Mar, 2022 12 commits

[zero] able to place params on cpu after zero init context (#365) · 44e4891f
Jiarui Fang authored Mar 10, 2022
```
* place params on cpu after zero init context

* polish code
```
44e4891f
[test] polish zero related unitest (#351) · cb34cd38
Jiarui Fang authored Mar 10, 2022

cb34cd38
[zero] update sharded optim v2 (#334) · d0ae0f22
ver217 authored Mar 09, 2022

d0ae0f22
fix bert unit test · f5f0ad26
ver217 authored Mar 09, 2022

f5f0ad26
polish engine unitest · d271f259
jiaruifang authored Mar 09, 2022

d271f259
polish code · 354c0f90
jiaruifang authored Mar 09, 2022

354c0f90
adapting bert unitest interface · 4d94cd51
jiaruifang authored Mar 09, 2022

4d94cd51
add bert for unitest and sharded model is not able to pass the bert case · 7977422a
jiaruifang authored Mar 09, 2022

7977422a
[zero] Update sharded model v2 using sharded param v2 (#323) · 13886716
ver217 authored Mar 08, 2022

13886716
using pytest parametrize · 799d105b
jiaruifang authored Mar 08, 2022

799d105b
[zero] add sharded grad and refactor grad hooks for ShardedModel (#287) · 7aef75ca
ver217 authored Mar 02, 2022

7aef75ca

Feature/zero (#279) · 5a560a06

Jiarui Fang authored Mar 01, 2022



* add zero1 (#209)

* add zero1

* add test zero1

* update zero stage 1 develop (#212)

* Implement naive zero3 (#240)

* naive zero3 works well

* add zero3 param manager

* add TODOs in comments

* add gather full param ctx

* fix sub module streams

* add offload

* fix bugs of hook and add unit tests

* fix bugs of hook and add unit tests (#252)

* add gather full param ctx

* fix sub module streams

* add offload

* fix bugs of hook and add unit tests

* polish code and add state dict hook

* fix bug

* update unit test

* refactor reconstructed zero code

* clip_grad support zero3 and add unit test

* add unit test for Zero3ParameterManager

* [WIP] initialize the shard param class

* [WIP] Yet another sharded model implementation (#274)

* [WIP] initialize the shard param class

* [WIP] Yes another implementation of shardModel. Using a better hook method.

* torch.concat -> torch.cat

* fix test_zero_level_1.py::test_zero_level_1 unitest

* remove deepspeed implementation and refactor for the reconstructed zero module

* polish zero dp unittests
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>

5a560a06