Commits · 16c0acc01b34c988eb3f452d21a1cd466e86dc73 · OpenDAS / ColossalAI

31 Jul, 2023 5 commits

[zero] support shard optimizer state dict of zero (#4194) · 1a49a5ea
LuGY authored Jul 11, 2023
```
* support shard optimizer of zero

* polish code

* support sync grad manually
```
1a49a5ea
[zero] add state dict for low level zero (#4179) · dd7cc582
LuGY authored Jul 06, 2023
```
* add state dict for zero

* fix unit test

* polish
```
dd7cc582
[zero] allow passing process group to zero12 (#4153) · c668801d
LuGY authored Jul 04, 2023
```
* allow passing process group to zero12

* union tp-zero and normal-zero

* polish code
```
c668801d
[zero]support no_sync method for zero1 plugin (#4138) · 79cf1b5f
LuGY authored Jul 04, 2023
```
* support no sync for zero1 plugin

* polish

* polish
```
79cf1b5f

[zero] refactor low level zero for shard evenly (#4030) · c6ab9698

LuGY authored Jun 30, 2023

* refactor low level zero

* fix zero2 and support cpu offload

* avg gradient and modify unit test

* refactor grad store, support layer drop

* refactor bucket store, support grad accumulation

* fix and update unit test of zero and ddp

* compatible with tp, ga and unit test

* fix memory leak and polish

* add zero layer drop unittest

* polish code

* fix import err in unit test

* support diffenert comm dtype, modify docstring style

* polish code

* test padding and fix

* fix unit test of low level zero

* fix pad recording in bucket store

* support some models

* polish

c6ab9698

21 Jul, 2023 1 commit

[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302) · c6f60059

Baizhou Zhang authored Jul 21, 2023

* sharded optimizer checkpoint for gemini plugin

* modify test to reduce testing time

* update doc

* fix bug when keep_gatherd is true under GeminiPlugin

c6f60059

19 Jul, 2023 1 commit

[lazy] support init on cuda (#4269) · fc5cef2c

Hongxin Liu authored Jul 19, 2023

* [lazy] support init on cuda

* [test] update lazy init test

* [test] fix transformer version

fc5cef2c

18 Jul, 2023 1 commit

[Kernels] added triton-implemented of self attention for colossal-ai (#4241) · 4b977541

Cuiqing Li authored Jul 18, 2023

* added softmax kernel

* added qkv_kernel

* added ops

* adding tests

* upload tets

* fix tests

* debugging

* debugging tests

* debugging

* added

* fixed errors

* added softmax kernel

* clean codes

* added tests

* update tests

* update tests

* added attention

* add

* fixed pytest checking

* add cuda check

* fix cuda version

* fix typo

4b977541

07 Jul, 2023 1 commit

Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141) · 58913441

Baizhou Zhang authored Jul 07, 2023

* [checkpointio] unsharded optimizer checkpoint for Gemini plugin

* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather

58913441

04 Jul, 2023 31 commits
- [format] applied code formatting on changed files in pull request 4152 (#4157) · c77b3b19
  github-actions[bot] authored Jul 04, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
  c77b3b19
- [shardformer] made tensor parallelism configurable (#4144) · 1fb0d95d
  Frank Lee authored Jul 04, 2023
```
* [shardformer] made tensor parallelism configurable

* polish code
```
  1fb0d95d
- [shardformer] refactored some doc and api (#4137) · 74257cb4
  Frank Lee authored Jul 03, 2023
```
* [shardformer] refactored some doc and api

* polish code
```
  74257cb4
- [shardformer] added embedding gradient check (#4124) · ae035d30
  Frank Lee authored Jun 30, 2023
  
  ae035d30
- [shardformer] integrate with data parallelism (#4103) · 6a88bae4
  Frank Lee authored Jun 30, 2023
  
  6a88bae4
- [shardformer] supported fused normalization (#4112) · f3b6aaa6
  Frank Lee authored Jun 30, 2023
  
  f3b6aaa6
- [shardformer] supported bloom model (#4098) · b1c29015
  Frank Lee authored Jun 28, 2023
  
  b1c29015
- [shardformer] support vision transformer (#4096) · 8af29ee4
  Kun Lin authored Jun 28, 2023
```
* first v of vit shardformer

* keep vit

* update

* vit shard add vitattention vitlayer

* update num head shard para

* finish test for vit

* add new_model_class & postprocess

* add vit readme

* delete old files & fix the conflict

* fix sth
```
  8af29ee4
- [shardformer] shardformer support opt models (#4091) · ac809371
  jiangmingyan authored Jun 27, 2023
```
* [shardformer] shardformer support opt models

* [shardformer] shardformer support opt models, fix

* [shardformer] shardformer support opt models, fix

* [shardformer] shardformer support opt models, fix
```
  ac809371
- [shardformer] refactored layernorm (#4086) · d33a44e8
  Frank Lee authored Jun 26, 2023
  
  d33a44e8
- [test] fixed tests failed due to dtensor change (#4082) · c4b1b659
  Frank Lee authored Jun 26, 2023
```
* [test] fixed tests failed due to dtensor change

* polish code
```
  c4b1b659
- [shardformer] Add layernorm (#4072) · 92f67910
  FoolPlayer authored Jun 23, 2023
```
* add layernorm to bert

* add layernorm test

* add layernorm test with load state dict

* add use_mixedfusedLN in shard config

* refactor policy to support fused_layernorm
```
  92f67910
- [shardformer] supported fused qkv checkpoint (#4073) · 70c58cfd
  Frank Lee authored Jun 23, 2023
  
  70c58cfd
- [shardformer] add linearconv1d test (#4067) · 0803a614
  FoolPlayer authored Jun 22, 2023
```
* add linearconv1d test

* add linearconv1d test
```
  0803a614
- [shardformer] support module saving and loading (#4062) · 8eb09a4c
  Frank Lee authored Jun 22, 2023
```
* [shardformer] support module saving and loading

* polish code
```
  8eb09a4c
- support kit use for bert/gpt test (#4055) · 7740c55c
  FoolPlayer authored Jun 22, 2023
```
* support kit use for bert test

* support kit test for gpt2
```
  7740c55c
- [shardformer] refactored the shardformer layer structure (#4053) · f22ddace
  Frank Lee authored Jun 21, 2023
  
  f22ddace
- [shardformer] adapted T5 and LLaMa test to use kit (#4049) · 58df7205
  Frank Lee authored Jun 21, 2023
```
* [shardformer] adapted T5 and LLaMa test to use kit

* polish code
```
  58df7205
- [shardformer] add gpt2 test and layer class refactor (#4041) · 4021b9a8
  FoolPlayer authored Jun 20, 2023
```
* add gpt2 test and layer class refactor

* add dropout in gpt2 policy
```
  4021b9a8
- [shardformer] supported T5 and its variants (#4045) · d857f3db
  Frank Lee authored Jun 19, 2023
  
  d857f3db
- [shardformer] adapted llama to the new API (#4036) · c1d5453e
  Frank Lee authored Jun 19, 2023
  
  c1d5453e
- [shardformer] fix bert and gpt downstream with new api (#4024) · 74d176c8
  FoolPlayer authored Jun 19, 2023
```
* fix bert downstream with new api

* remove comment line
```
  74d176c8
- add vocabembedding layer · 507c0ad3
  FoolPlayer authored Jun 16, 2023
  
  507c0ad3
- [shardformer] refactored embedding and dropout to parallel module (#4013) · 3893fa1a
  Frank Lee authored Jun 16, 2023
```
* [shardformer] refactored embedding and dropout to parallel module

* polish code
```
  3893fa1a
- integrate with dist layer (#4011) · dfca9678
  FoolPlayer authored Jun 16, 2023
  
  dfca9678
- [shardformer] integrated linear 1D with dtensor (#3996) · 015af592
  Frank Lee authored Jun 15, 2023
```
* [shardformer] integrated linear 1D with dtensor

* polish code
```
  015af592
- [device] support init device mesh from process group (#3990) · 61197124
  Frank Lee authored Jun 15, 2023
  
  61197124
- [Shardformer] Downstream bert (#3979) · f7774ec0
  FoolPlayer authored Jun 15, 2023
```
* add dist dropout in model

* update docstring and bert policy with dropout

* refactor basepolicy and sharded, update bert

* update format

* update gpt2 policy

* update bert policy

* remove unused code

* update readme for new policy usage

* add downstream model of bert

* remove unused code
```
  f7774ec0
- [shardformer] shardformer support t5 model (#3994) · c1c672d0
  wukong1992 authored Jun 15, 2023
```
test t5
```
  c1c672d0
- [shardformer] support llama model using shardformer (#3969) · 6b30dfb7
  wukong1992 authored Jun 13, 2023
```
adjust layer attr
```
  6b30dfb7
- [shardformer] Unit test (#3928) · a7313048
  FoolPlayer authored Jun 12, 2023
```
* fix bug in slicer, add slicer unit test

* add dropout test

* use pid as dropout seed

* updata dropout test with local pattern

* ad todo
```
  a7313048