Commits · d7ecaf362b6c20412838ade71446dcc2d9d05f6c · OpenDAS / ColossalAI

07 Apr, 2022 1 commit
- [zero] fix init bugs in zero context (#686) · d7ecaf36
  HELSON authored Apr 07, 2022
```
* adapt model weight initialization for methods in Pytorch nn.init
```
  d7ecaf36
03 Apr, 2022 2 commits
- [hotfix] fix a bug in model data stats tracing (#655) · 0aab5230
  Jiarui Fang authored Apr 03, 2022
  
  0aab5230
- [refactor] pipeline, put runtime schedule into engine. (#627) · ade05a5d
  YuliangLiu0306 authored Apr 03, 2022
  
  ade05a5d
02 Apr, 2022 2 commits

[hotfix] fix bugs in testing (#659) · e5d615ae

HELSON authored Apr 02, 2022

* remove hybrid adam in test_moe_zero_optim

* fix activation checkpointing and its unitest

e5d615ae

fix bugs in CPU adam (#633) · b31daed4

HELSON authored Apr 02, 2022

* add cpu adam counter for all cpu adam

* fixed updating error in adam kernel

b31daed4

01 Apr, 2022 4 commits
- [zero] adapt zero for unsharded paramters (Optimizer part) (#601) · 055fbf5b
  HELSON authored Apr 01, 2022
  
  055fbf5b
- [model checkpoint] added unit tests for checkpoint save/load (#599) · 354b7954
  アマデウス authored Apr 01, 2022
  
  354b7954
- [zero] test zero tensor utils (#609) · 93f14d2a
  FredHuang99 authored Apr 01, 2022
  
  93f14d2a
- [refactor] memory utils (#577) · e956d93a
  Jiarui Fang authored Apr 01, 2022
  
  e956d93a
31 Mar, 2022 3 commits
- [zero] adapt zero for unsharded parameters (#561) · e6d50ec1
  HELSON authored Mar 31, 2022
```
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
```
  e6d50ec1
- [zero] trace states of fp16/32 grad and fp32 param (#571) · 7c6c427d
  ver217 authored Mar 31, 2022
  
  7c6c427d
- [polish] rename col_attr -> colo_attr (#558) · 7675366f
  Jiarui Fang authored Mar 31, 2022
  
  7675366f
30 Mar, 2022 3 commits
- [zero] hijack p.grad in sharded model (#554) · 014bac0c
  ver217 authored Mar 30, 2022
```
* hijack p.grad in sharded model

* polish comments

* polish comments
```
  014bac0c
- [zero] label state for param fp16 and grad (#551) · f552b112
  Jiarui Fang authored Mar 30, 2022
  
  f552b112
- [zero] add stateful tensor (#549) · 214da761
  Jiarui Fang authored Mar 30, 2022
  
  214da761
29 Mar, 2022 5 commits
- [zero] add zero context manager to change config during initialization (#546) · 8c90d4df
  HELSON authored Mar 29, 2022
  
  8c90d4df
- Refactored docstring to google style · ec5086c4
  Liang Bowen authored Mar 25, 2022
  
  ec5086c4
- [zero] non model data tracing (#545) · 53b1b6e3
  Jiarui Fang authored Mar 29, 2022
  
  53b1b6e3
- [zero] polish ZeroInitContext (#540) · 1f90a3b1
  ver217 authored Mar 29, 2022
  
  1f90a3b1
- [zero] get memory usage of sharded optim v2. (#542) · c11ff81b
  Jiarui Fang authored Mar 29, 2022
  
  c11ff81b
28 Mar, 2022 4 commits
- [zero] adapt for no-leaf module in zero (#535) · a30e2b4c
  HELSON authored Mar 28, 2022
```
only process module's own parameters in Zero context

add zero hooks for all modules that contrain parameters

gather parameters only belonging to module itself
```
  a30e2b4c
- [zero] refactor model data tracing (#537) · 705f5610
  Jiarui Fang authored Mar 28, 2022
  
  705f5610
- [zero] improve the accuracy of get_memory_usage of sharded param (#538) · a590ed0b
  Jiarui Fang authored Mar 28, 2022
  
  a590ed0b
- [zero] get memory usage for sharded param (#536) · 37cb70fe
  Jiarui Fang authored Mar 28, 2022
  
  37cb70fe
25 Mar, 2022 8 commits
- [zero]added hybrid adam, removed loss scale in adam (#527) · 105c5301
  LuGY authored Mar 25, 2022
```
* [zero]added hybrid adam, removed loss scale of adam

* remove useless code
```
  105c5301
- [zero] refactor model data tracing (#522) · 8d8c5407
  Jiarui Fang authored Mar 25, 2022
  
  8d8c5407
- [test] fixed rerun_on_exception and adapted test cases (#487) · 3601b2ba
  Frank Lee authored Mar 25, 2022
  
  3601b2ba
- [refactor] remove old zero code (#517) · 4d322b79
  Jiarui Fang authored Mar 25, 2022
  
  4d322b79
- [cuda] modify the fused adam, support hybrid of fp16 and fp32 (#497) · 6a3f9fda
  LuGY authored Mar 25, 2022
  
  6a3f9fda
- [zero] add colo move inline (#521) · 920c5889
  Jiarui Fang authored Mar 25, 2022
  
  920c5889
- [zero] fix init device bug in zero init context unittest (#516) · 0bebda6e
  Jiarui Fang authored Mar 25, 2022
  
  0bebda6e
- [zero] show model data cuda memory usage after zero context init. (#515) · 7ef3507a
  Jiarui Fang authored Mar 25, 2022
  
  7ef3507a
24 Mar, 2022 2 commits
- [memory] set cuda mem frac (#506) · 9330be0f
  Jiarui Fang authored Mar 24, 2022
  
  9330be0f
- [memory] add model data tensor moving api (#503) · 0035b7be
  Jiarui Fang authored Mar 24, 2022
  
  0035b7be
23 Mar, 2022 2 commits
- [polish] polish singleton and global context (#500) · a445e118
  Jiarui Fang authored Mar 23, 2022
  
  a445e118
- [zero] sharded model support the reuse of fp16 shard (#495) · 9ec1ce6a
  ver217 authored Mar 23, 2022
```
* sharded model supports reuse fp16 shard

* rename variable

* polish code

* polish code

* polish code
```
  9ec1ce6a
22 Mar, 2022 2 commits
- [zero] sharded optim support hybrid cpu adam (#486) · 62b0a8d6
  ver217 authored Mar 22, 2022
```
* sharded optim support hybrid cpu adam

* update unit test

* polish docstring
```
  62b0a8d6
- [zero] polish sharded param name (#484) · b3348221
  Jiarui Fang authored Mar 22, 2022
```
* [zero] polish sharded param name

* polish code

* polish

* polish code

* polish

* polsih

* polish
```
  b3348221
21 Mar, 2022 2 commits
- [format] polish name format for MOE (#481) · 65c0f380
  Jiarui Fang authored Mar 21, 2022
  
  65c0f380
- [MOE] add unitest for MOE experts layout, gradient handler and kernel (#469) · 75443471
  HELSON authored Mar 21, 2022
  
  75443471