Commits · d921ce83915f5b5f2f01a31b0d591c38e02d90b4 · OpenDAS / ColossalAI

15 Aug, 2023 24 commits

[shardformer] support inplace sharding (#4251) · d921ce83

Hongxin Liu authored Jul 20, 2023

* [shardformer] embedding support inplace sharding

* [shardformer] linear support inplace sharding

* [shardformer] layernorm support inplace sharding

* [shardformer] qkv support inplace sharding

* [test] update shardformer layer test

* [shardformer] fix shared param sharding

* [shardformer] fix bert policy

* [shardformer] fix bloom policy

* [shardformer] fix llama policy

* [shardformer] fix opt policy

* [shardformer] fix t5 policy

* [shardformer] fix fused qkv linear

* [shardformer] fix bugs

* force sync

* [test] fix bugs

* [test] fix transformer version

d921ce83

[pipeline] support shardformer for GPT2ForQuestionAnswering & complete... · 2a2eacfa

Baizhou Zhang authored Jul 19, 2023

[pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 (#4245)

* change for transformers loggers

* add forward for GPT2ForQuestionAnswering

* fix assert

* fix torchrec test

2a2eacfa

[bugs] hot fix some testing bugs for new models (#4268) · d9be0472
Jianghai authored Jul 18, 2023
```
* hot fix

* hot fx tracer
```
d9be0472

[pipeline] finish bloom models pipeline and tests (#4223) · 34f0e34a

Jianghai authored Jul 17, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* finish bloom model

* test shard gpt2

* clear cache

* support all bloom models

* add bloom models policies

* finish bloom pipeline and tests

* add set pipeline

* finish bloom

34f0e34a

[pipeline] All bert models (#4233) · e7cc62d7

Jianghai authored Jul 17, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

* finish llama

* causal lm and sequence classification

* revision

* add pure pipeline test

* finish some bert models

* finish all bert models

* finish bert tests

* fix bugs

* fix bugs

* fix test pipeline

* fix data gen for qa

* update the set pipeline forward

* shared params

* fix bugs

e7cc62d7

[pipeline] add pipeline forward for variants of gpt2 (#4238) · a14d3520

Baizhou Zhang authored Jul 17, 2023

* add forward for GPTLMHeadModel

* add test for gpt_lm

* arranging get_held_layers method

* arrange forward replacement

* add forward for GPT2ForTokenClassification

* add forward for GPT2ForSequenceClassification

* fix test_shard_gpt2.py

* add GPT2DoubleHeadsmodel & fix bugs

* add id checking in get_shared_params

a14d3520

[pipeline] Add Pipeline Forward for GPT2Model Shardformer (#4224) · 208ac8f2

Baizhou Zhang authored Jul 13, 2023

* * fix typehint & docstring in sharder.py

* * update pipeline forward for GPT2Model

* * add test for pipeline forward of GPT2Model

* * add cache cleaning in gpt2 test

* * change assert to raise command

208ac8f2

[pipeline] add bloom model pipeline (#4210) · 37d22f68

Jianghai authored Jul 13, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* finish bloom model

* test shard gpt2

* clear cache

37d22f68

[pipeline] Llama causal lm and llama for sequence classification pipeline (#4208) · 31bcf867

Jianghai authored Jul 11, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

* finish llama

* causal lm and sequence classification

* revision

31bcf867

[pipeline] Llama pipeline (#4205) · 16220310

Jianghai authored Jul 11, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

16220310

[pipeline] Bert pipeline for shardformer and its tests (#4197) · 1094e0f0

Jianghai authored Jul 10, 2023

* add pipeline forward

* complete pipeline forward check

* fix bert forward without pipeline

* fix comments

* discard useless line

* add todo

* clean prints

* fix distribute layers

1094e0f0

[shardformer] support lazy init (#4202) · 890774b2

Hongxin Liu authored Jul 10, 2023

* [shardformer] support lazy init

* [shardformer] linear support lazy init

* [shardformer] embedding support lazy init

* [shardformer] norm support lazy init

* [shardformer] fused linear support lazy init

* [test] update shardformer test layer

* [test] shardformer with lazy init fit ddp

* [lazy] hotfix deepcopy of param

* [shardformer] fix bert policy and update test

* [shardformer] fix bloom policy and update test

* [shardformer] fix opt policy and update test

* [shardformer] fix t5 policy and update test

* [shardformer] fix gpt2 policy and update test

* [shardformer] fix llama policy and update test

890774b2

[pipeline] move bert related pipeline components to shardformer (#4187) · f3bcc292

Jianghai authored Jul 07, 2023

* move bert related pipeline components to shardformer

* fix bugs

* revision

* fix bert model tests

* fix bert_lm_head model tests

* fix tests

* fix tests

* done checks

* skip bloom

f3bcc292

[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172) · c5ea7280

Jianghai authored Jul 06, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

* add bert_for_pretraining forward and policy

* fix typos

* cancel warning

* change the imediate output to default dict

* change the default output of get_shared_params

c5ea7280

[test] add shard util tests · 5fc60a3a
ver217 authored Jul 05, 2023

5fc60a3a
[test] update shardformer tests · 2d6cc07f
ver217 authored Jul 05, 2023

2d6cc07f

[pipeline] build bloom model and policy , revise the base class of policy (#4161) · 90a65ea6

Jianghai authored Jul 05, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

90a65ea6

[pipeline]add pipeline policy and bert forward (#4130) · c552cefa

Jianghai authored Jul 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

c552cefa

[pipeline] add stage manager (#4093) · 5c897ddb

Hongxin Liu authored Jun 27, 2023

* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager

5c897ddb

[pipeline]add pipeline policy and bert forward (#4130) · e8e7e492

Jianghai authored Jul 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

e8e7e492

[pipeline] refactor 1f1b schedule (#4115) · f51ce1bc

Hongxin Liu authored Jun 29, 2023

* [api] update optimizer wrapper to fit pipeline

* [pipeline] add base schedule

* [pipeline] add 1f1b schedule

* [test] add pipeline schedule utils test

* [pipeline] fix import

f51ce1bc

[pipeline] implement p2p communication (#4100) · 45fdc9b4

Hongxin Liu authored Jun 28, 2023

* [pipeline] add p2p communication

* [test] add p2p communication test

* [test] add rerun decorator

* [test] rename to avoid conflict

45fdc9b4

[pipeline] add stage manager (#4093) · 42254422

Hongxin Liu authored Jun 27, 2023

* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager

42254422

[cluster] add process group mesh (#4039) · 5e1a9d48
Hongxin Liu authored Jun 20, 2023
```
* [cluster] add process group mesh

* [test] add process group mesh test

* force sync
```
5e1a9d48

11 Aug, 2023 1 commit
- [hotfix] fix unsafe async comm in zero (#4404) · d86ddd9b
  LuGY authored Aug 11, 2023
```
* improve stablility of zero

* fix wrong index

* add record stream
```
  d86ddd9b
09 Aug, 2023 1 commit
- [kernel] updated unittests for coloattention (#4389) · 458ae331
  flybird1111 authored Aug 09, 2023
```
Updated coloattention tests of checking outputs and gradients
```
  458ae331
04 Aug, 2023 2 commits
- [coloattention] fix import error (#4380) · 38b792aa
  flybird1111 authored Aug 04, 2023
```
fixed an import error
```
  38b792aa
- [fix] coloattention support flash attention 2 (#4347) · 25c57b9f
  flybird1111 authored Aug 04, 2023
```
Improved ColoAttention interface to support flash attention 2. Solved #4322 
```
  25c57b9f
01 Aug, 2023 1 commit

[test] remove useless tests (#4359) · 16bf4c02

Hongxin Liu authored Aug 01, 2023

* [test] remove legacy zero test

* [test] remove lazy distribute test

* [test] remove outdated checkpoint io

16bf4c02

31 Jul, 2023 5 commits

[zero] support shard optimizer state dict of zero (#4194) · 1a49a5ea
LuGY authored Jul 11, 2023
```
* support shard optimizer of zero

* polish code

* support sync grad manually
```
1a49a5ea
[zero] add state dict for low level zero (#4179) · dd7cc582
LuGY authored Jul 06, 2023
```
* add state dict for zero

* fix unit test

* polish
```
dd7cc582
[zero] allow passing process group to zero12 (#4153) · c668801d
LuGY authored Jul 04, 2023
```
* allow passing process group to zero12

* union tp-zero and normal-zero

* polish code
```
c668801d
[zero]support no_sync method for zero1 plugin (#4138) · 79cf1b5f
LuGY authored Jul 04, 2023
```
* support no sync for zero1 plugin

* polish

* polish
```
79cf1b5f

[zero] refactor low level zero for shard evenly (#4030) · c6ab9698

LuGY authored Jun 30, 2023

* refactor low level zero

* fix zero2 and support cpu offload

* avg gradient and modify unit test

* refactor grad store, support layer drop

* refactor bucket store, support grad accumulation

* fix and update unit test of zero and ddp

* compatible with tp, ga and unit test

* fix memory leak and polish

* add zero layer drop unittest

* polish code

* fix import err in unit test

* support diffenert comm dtype, modify docstring style

* polish code

* test padding and fix

* fix unit test of low level zero

* fix pad recording in bucket store

* support some models

* polish

c6ab9698

21 Jul, 2023 1 commit

[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302) · c6f60059

Baizhou Zhang authored Jul 21, 2023

* sharded optimizer checkpoint for gemini plugin

* modify test to reduce testing time

* update doc

* fix bug when keep_gatherd is true under GeminiPlugin

c6f60059

19 Jul, 2023 1 commit

[lazy] support init on cuda (#4269) · fc5cef2c

Hongxin Liu authored Jul 19, 2023

* [lazy] support init on cuda

* [test] update lazy init test

* [test] fix transformer version

fc5cef2c

18 Jul, 2023 1 commit

[Kernels] added triton-implemented of self attention for colossal-ai (#4241) · 4b977541

Cuiqing Li authored Jul 18, 2023

* added softmax kernel

* added qkv_kernel

* added ops

* adding tests

* upload tets

* fix tests

* debugging

* debugging tests

* debugging

* added

* fixed errors

* added softmax kernel

* clean codes

* added tests

* update tests

* update tests

* added attention

* add

* fixed pytest checking

* add cuda check

* fix cuda version

* fix typo

4b977541

07 Jul, 2023 1 commit

Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141) · 58913441

Baizhou Zhang authored Jul 07, 2023

* [checkpointio] unsharded optimizer checkpoint for Gemini plugin

* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather

58913441

04 Jul, 2023 2 commits
- [format] applied code formatting on changed files in pull request 4152 (#4157) · c77b3b19
  github-actions[bot] authored Jul 04, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
  c77b3b19
- [shardformer] made tensor parallelism configurable (#4144) · 1fb0d95d
  Frank Lee authored Jul 04, 2023
```
* [shardformer] made tensor parallelism configurable

* polish code
```
  1fb0d95d