Commits · 0a94fcd3514a6f7d4f287bba614fda3fb12c8802 · OpenDAS / ColossalAI

04 Sep, 2023 2 commits

[shardformer] update bert finetune example with HybridParallelPlugin (#4584) · 0a94fcd3

flybird11111 authored Sep 04, 2023



* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* [shardformer] fix opt test hanging

* fix

* test

* test

* [shardformer] zero1+pp and the corresponding tests (#4517)

* pause

* finish pp+zero1

* Update test_shard_vit.py

* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)

* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom

* [shardformer] fix emerged bugs after updating transformers (#4526)

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] Add overlap support for gpt2 (#4535)

* add overlap support for gpt2

* remove unused code

* remove unused code

* [shardformer] support pp+tp+zero1 tests (#4531)

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] fix submodule replacement bug when enabling pp (#4544)

* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* rebase feature/shardformer

* update pipeline

* [shardformer] fix

* [shardformer] fix

* [shardformer] bert finetune fix

* [shardformer] add all_reduce operation to loss

add all_reduce operation to loss

* [shardformer] make compatible with pytree.

make compatible with pytree.

* [shardformer] disable tp

disable tp

* [shardformer] add 3d plugin to ci test

* [shardformer] update num_microbatches to None

* [shardformer] update microbatchsize

* [shardformer] update assert

* update scheduler

* update scheduler

---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>

0a94fcd3

[shardformer] Pytree fix (#4533) · 24c07687

Jianghai authored Sep 04, 2023

* pytree test

* test bert

* test bert

* test bert

* revise

* add register

* add register

24c07687

01 Sep, 2023 2 commits
- [pipeline] 1f1b schedule receive microbatch size (#4589) · 508ca36f
  Hongxin Liu authored Sep 01, 2023
  
  508ca36f
- [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) · 38ccb8b1
  Baizhou Zhang authored Sep 01, 2023
```
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
```
  38ccb8b1
31 Aug, 2023 2 commits

[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) · c9625dbb

Baizhou Zhang authored Aug 31, 2023

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

c9625dbb

[shardformer] fix submodule replacement bug when enabling pp (#4544) · 2c787d7f
Baizhou Zhang authored Aug 31, 2023

2c787d7f

30 Aug, 2023 2 commits

[shardformer] support pp+tp+zero1 tests (#4531) · ec18fc73

flybird11111 authored Aug 30, 2023

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

ec18fc73

[shardformer] fix opt test hanging (#4521) · d367b887

flybird11111 authored Aug 30, 2023

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

d367b887

29 Aug, 2023 2 commits
- [shardformer] Add overlap support for gpt2 (#4535) · e241b74f
  Bin Jia authored Aug 29, 2023
```
* add overlap support for gpt2

* remove unused code

* remove unused code
```
  e241b74f
- [shardformer] fix emerged bugs after updating transformers (#4526) · 0387a47e
  Baizhou Zhang authored Aug 29, 2023
  
  0387a47e
28 Aug, 2023 2 commits
- [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) · c554b7f5
  Bin Jia authored Aug 28, 2023
```
* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom
```
  c554b7f5
- [shardformer] zero1+pp and the corresponding tests (#4517) · 376533a5
  Jianghai authored Aug 28, 2023
```
* pause

* finish pp+zero1

* Update test_shard_vit.py
```
  376533a5
25 Aug, 2023 2 commits

[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506) · 44eab2b2

Baizhou Zhang authored Aug 25, 2023

* add APIs

* implement save_sharded_model

* add test for hybrid checkpointio

* implement naive loading for sharded model

* implement efficient sharded model loading

* open a new file for hybrid checkpoint_io

* small fix

* fix circular importing

* fix docstring

* arrange arguments and apis

* small fix

44eab2b2

[shardformer] opt fix. (#4514) · de8a65ba

flybird11111 authored Aug 25, 2023

* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks

* [Test] test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* fix

de8a65ba

24 Aug, 2023 1 commit

[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498) · 3353e55c

flybird11111 authored Aug 24, 2023

* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks

3353e55c

23 Aug, 2023 1 commit
- [shardformer] tests for 3d parallel (#4493) · e04436a8
  Jianghai authored Aug 23, 2023
  
  e04436a8
22 Aug, 2023 3 commits

[shardformer] chatglm support sequence parallel (#4482) · 59e252ec

flybird11111 authored Aug 22, 2023

* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

59e252ec

[shardformer/sequence parallel] not support opt of seq-parallel, add warning... · 351351a3
Bin Jia authored Aug 22, 2023
```
[shardformer/sequence parallel] not support opt of seq-parallel, add warning and fix a bug in gpt2 pp (#4488)
```
351351a3
rename chatglm to chatglm2 (#4484) · 5545114f
Jianghai authored Aug 22, 2023

5545114f

21 Aug, 2023 1 commit
- [shardformer] support tp+zero for shardformer (#4472) · 1c7df566
  Baizhou Zhang authored Aug 21, 2023
```
* support tp+zero/input type cast for hybridplugin

* add tp+zero tests

* fix bucket arguments
```
  1c7df566
18 Aug, 2023 4 commits

[shardformer] Pipeline/whisper (#4456) · 8739aa7f

Jianghai authored Aug 18, 2023

* add some base tests and policies

* finish whisper base model

* add conditional generation

* finish basic tests

* whisper

* finish whisper

* finish whisper

* del useless  whisper test

* fix

* add argmin to replace

* finish revision

8739aa7f

[shardformer] bert support sequence parallel. (#4455) · a27e0bb4

flybird11111 authored Aug 18, 2023

* [shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

* [shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

[shardformer] bert support sequence parallel

* [shardformer] bert support sequence parallel

a27e0bb4

[shardformer] bloom support sequence parallel (#4465) · 0ecd71e0
flybird11111 authored Aug 18, 2023
```
[shardformer] bloom support sequence parallel
```
0ecd71e0
[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460) · 7c8be770
Bin Jia authored Aug 18, 2023
```
* support gpt2 seq parallel with pp/dp/tp

* fix a bug when waiting for stream done

* delete unused gpt2_seq file
```
7c8be770

16 Aug, 2023 5 commits

[shardformer] support interleaved pipeline (#4448) · a78daf61

LuGY authored Aug 16, 2023

* support interleaved pipeline

* fix unit test

* remove virtual stage test in stage mgr

* add droped type hint and updated bwd

a78daf61

[devops] add large-scale distributed test marker (#4452) · 26e29d58

Hongxin Liu authored Aug 16, 2023

* [test] remove cpu marker

* [test] remove gpu marker

* [test] update pytest markers

* [ci] update unit test ci

26e29d58

[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446) · 6ef33f75
Baizhou Zhang authored Aug 16, 2023
```
* support DDP for HybridPlugin/add tp+dp tests

* add docstring for HybridParallelPlugin
```
6ef33f75

[shardformer/sequence parallel] Cherry pick commit to new branch (#4450) · 424629fe

Bin Jia authored Aug 16, 2023

* [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384)

* [sequence parallel] add sequence parallel linear col/row support (#4336)

* add sequence parallel linear col/row support

* add annotation

* add annotation

* add support for gpt2 fused qkv linear layer

* support sequence parallel in GPT2

* add docstring and note

* add requirments

* remove unused flash-attb

* modify flash attn test

* modify flash attn setting

* modify flash attn code

* add assert before divide, rename forward function

* [shardformer/test] fix gpt2 test with seq-parallel

* [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401)

* overlap gather input / grad computing during col backward

* modify test for overlap

* simplify code

* fix code and modify cuda stream synchronize

* [shardformer/sequence parallel] polish code

424629fe

[format] applied code formatting on changed files in pull request 4441 (#4445) · d20dceb9
github-actions[bot] authored Aug 16, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
d20dceb9

15 Aug, 2023 11 commits

[shardformer] fix import · 5d4efdf5
ver217 authored Aug 15, 2023

5d4efdf5
[shardformer] fix embedding · 73a4144b
ver217 authored Aug 15, 2023

73a4144b
[misc] update requirements · 92230226
ver217 authored Aug 15, 2023

92230226
[misc] resolve code factor issues (#4433) · 172f7fa3
Hongxin Liu authored Aug 14, 2023

172f7fa3

[shardformer] update bloom/llama/vit/chatglm tests (#4420) · 328a791d

flybird11111 authored Aug 14, 2023

[shardformer] update bloom/llama/vit/chatglm tests

[shardformer] update opt tests

[shardformer] update opt tests

[shardformer] update bloom/llama/vit/chatglm tests

[shardformer] update bloom/llama/vit/chatglm tests

[shardformer] update bloom/llama/vit/chatglm tests

328a791d

[shardformer]update t5 tests for using all optimizations. (#4407) · 108e54a0

flybird11111 authored Aug 14, 2023

* [shardformer] gpt2 tests fix

[shardformer] test all optimizations (#4399)

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] gpt2 tests fix

* [shardformer]update t5 to use all optimizations

108e54a0

[shardformer] update tests for all optimization (#4413) · 1edc9b5f
flybird11111 authored Aug 11, 2023
```
[shardformer] update tests for all optimization
```
1edc9b5f

[shardformer] rewrite tests for opt/bloom/llama/vit/chatglm (#4395) · 7711bd52

Baizhou Zhang authored Aug 11, 2023

* rewrite opt tests

* rewrite llama tests

* rewrite bloom & vit tests

* rewrite chatglm tests

* fix LinearCol for classfiers

* add judge for other tp layers, fix lazy init in util

7711bd52

[shardformer]fix, test gpt2 for AMP+TP (#4403) · 21e0a42f

flybird11111 authored Aug 11, 2023

* [shardformer] gpt2 tests fix

[shardformer] test all optimizations (#4399)

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] gpt2 tests fix

* [shardformer] gpt2 tests fix

21e0a42f

[pipeline] rewrite bert tests and fix some bugs (#4409) · 7596e9ae

Jianghai authored Aug 11, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

* add bert_for_pretraining forward and policy

* fix typos

* cancel warning

* change the imediate output to default dict

* change the default output of get_shared_params

* rewrite bert test

* rewrite bert test

* fix some bugs

* del pipeline tests

* del pipeline tests

* del useless print

* del useless print

* rewrite data repeats

7596e9ae

[shardformer] test all optimizations (#4399) · d2cd48e0

flybird1111 authored Aug 10, 2023

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] test all optimizations

d2cd48e0