Commits · e7cc62d73568795b7ae54a6c13e7056f2048a98a · OpenDAS / ColossalAI

15 Aug, 2023 25 commits

[pipeline] All bert models (#4233) · e7cc62d7

Jianghai authored Jul 17, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

* finish llama

* causal lm and sequence classification

* revision

* add pure pipeline test

* finish some bert models

* finish all bert models

* finish bert tests

* fix bugs

* fix bugs

* fix test pipeline

* fix data gen for qa

* update the set pipeline forward

* shared params

* fix bugs

e7cc62d7

[pipeline] add pipeline forward for variants of gpt2 (#4238) · a14d3520

Baizhou Zhang authored Jul 17, 2023

* add forward for GPTLMHeadModel

* add test for gpt_lm

* arranging get_held_layers method

* arrange forward replacement

* add forward for GPT2ForTokenClassification

* add forward for GPT2ForSequenceClassification

* fix test_shard_gpt2.py

* add GPT2DoubleHeadsmodel & fix bugs

* add id checking in get_shared_params

a14d3520

[shardformer] fix base policy (#4229) · 7e4de520
Hongxin Liu authored Jul 14, 2023

7e4de520

[pipeline] Add Pipeline Forward for GPT2Model Shardformer (#4224) · 208ac8f2

Baizhou Zhang authored Jul 13, 2023

* * fix typehint & docstring in sharder.py

* * update pipeline forward for GPT2Model

* * add test for pipeline forward of GPT2Model

* * add cache cleaning in gpt2 test

* * change assert to raise command

208ac8f2

[pipeline] add bloom model pipeline (#4210) · 37d22f68

Jianghai authored Jul 13, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* finish bloom model

* test shard gpt2

* clear cache

37d22f68

[pipeline] Llama causal lm and llama for sequence classification pipeline (#4208) · 31bcf867

Jianghai authored Jul 11, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

* finish llama

* causal lm and sequence classification

* revision

31bcf867

[pipeline] Llama pipeline (#4205) · 16220310

Jianghai authored Jul 11, 2023

* bloom policy

* llama pipeline forward and tests

* fix the output and attention_mask

* fix name

* bind argument to policy

* Revert "bloom policy"

This reverts commit 8dee68a0a22568dbeed6d4563372b25e1e825fb0.

This policy should be revert and copied to feature/bloom

* revert the bloom changes

* cancel unneeded inputs

* gpt

16220310

[pipeline] Bert pipeline for shardformer and its tests (#4197) · 1094e0f0

Jianghai authored Jul 10, 2023

* add pipeline forward

* complete pipeline forward check

* fix bert forward without pipeline

* fix comments

* discard useless line

* add todo

* clean prints

* fix distribute layers

1094e0f0

[shardformer] support lazy init (#4202) · 890774b2

Hongxin Liu authored Jul 10, 2023

* [shardformer] support lazy init

* [shardformer] linear support lazy init

* [shardformer] embedding support lazy init

* [shardformer] norm support lazy init

* [shardformer] fused linear support lazy init

* [test] update shardformer test layer

* [test] shardformer with lazy init fit ddp

* [lazy] hotfix deepcopy of param

* [shardformer] fix bert policy and update test

* [shardformer] fix bloom policy and update test

* [shardformer] fix opt policy and update test

* [shardformer] fix t5 policy and update test

* [shardformer] fix gpt2 policy and update test

* [shardformer] fix llama policy and update test

890774b2

[pipeline] move bert related pipeline components to shardformer (#4187) · f3bcc292

Jianghai authored Jul 07, 2023

* move bert related pipeline components to shardformer

* fix bugs

* revision

* fix bert model tests

* fix bert_lm_head model tests

* fix tests

* fix tests

* done checks

* skip bloom

f3bcc292

[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172) · c5ea7280

Jianghai authored Jul 06, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

* add bert_for_pretraining forward and policy

* fix typos

* cancel warning

* change the imediate output to default dict

* change the default output of get_shared_params

c5ea7280

[shardformer] fix type hint · d35bd7d0
ver217 authored Jul 05, 2023

d35bd7d0
[shardformer] rename policy file name · 1ed3f8a2
ver217 authored Jul 05, 2023

1ed3f8a2
[test] add shard util tests · 5fc60a3a
ver217 authored Jul 05, 2023

5fc60a3a
[test] update shardformer tests · 2d6cc07f
ver217 authored Jul 05, 2023

2d6cc07f
[pipeline] update shardformer docstring · b0b8ad28
ver217 authored Jul 05, 2023

b0b8ad28
[pipeline] update shardformer policy · 59f6f573
ver217 authored Jul 05, 2023

59f6f573

[pipeline] build bloom model and policy , revise the base class of policy (#4161) · 90a65ea6

Jianghai authored Jul 05, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

90a65ea6

[pipeline]add pipeline policy and bert forward (#4130) · c552cefa

Jianghai authored Jul 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

c552cefa

[pipeline] add stage manager (#4093) · 5c897ddb

Hongxin Liu authored Jun 27, 2023

* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager

5c897ddb

[pipeline]add pipeline policy and bert forward (#4130) · e8e7e492

Jianghai authored Jul 04, 2023

* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

e8e7e492

[pipeline] refactor 1f1b schedule (#4115) · f51ce1bc

Hongxin Liu authored Jun 29, 2023

* [api] update optimizer wrapper to fit pipeline

* [pipeline] add base schedule

* [pipeline] add 1f1b schedule

* [test] add pipeline schedule utils test

* [pipeline] fix import

f51ce1bc

[pipeline] implement p2p communication (#4100) · 45fdc9b4

Hongxin Liu authored Jun 28, 2023

* [pipeline] add p2p communication

* [test] add p2p communication test

* [test] add rerun decorator

* [test] rename to avoid conflict

45fdc9b4

[pipeline] add stage manager (#4093) · 42254422

Hongxin Liu authored Jun 27, 2023

* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager

42254422

[cluster] add process group mesh (#4039) · 5e1a9d48
Hongxin Liu authored Jun 20, 2023
```
* [cluster] add process group mesh

* [test] add process group mesh test

* force sync
```
5e1a9d48

14 Aug, 2023 2 commits

[doc] fix a typo in examples/tutorial/auto_parallel/README.md (#4430) · ff836790
Tian Siyuan authored Aug 15, 2023
```
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
```
ff836790

[doc] update Coati README (#4405) · 6d41c3f2

Wenhao Chen authored Aug 14, 2023

* style: apply formatter

* fix: add outdated warnings

* docs: add dataset format and polish

* docs: polish README

* fix: fix json format

* fix: fix typos

* revert: revert 7b example

6d41c3f2

11 Aug, 2023 1 commit
- [hotfix] fix unsafe async comm in zero (#4404) · d86ddd9b
  LuGY authored Aug 11, 2023
```
* improve stablility of zero

* fix wrong index

* add record stream
```
  d86ddd9b
10 Aug, 2023 1 commit
- [gemini] fix tensor storage cleaning in state dict collection (#4396) · 6ccecc0c
  Baizhou Zhang authored Aug 10, 2023
  
  6ccecc0c
09 Aug, 2023 1 commit
- [kernel] updated unittests for coloattention (#4389) · 458ae331
  flybird1111 authored Aug 09, 2023
```
Updated coloattention tests of checking outputs and gradients
```
  458ae331
04 Aug, 2023 4 commits
- [doc] add Series A Funding and NeurIPS news (#4377) · 089c365f
  binmakeswell authored Aug 04, 2023
```
* [doc] add Series A Funding and NeurIPS news

* [kernal] fix mha kernal

* [CI] skip moe

* [CI] fix requirements
```
  089c365f
- [doc] Fix gradient accumulation doc. (#4349) · f40b7189
  flybird1111 authored Aug 04, 2023
```
* [doc] fix gradient accumulation doc

* [doc] fix gradient accumulation doc
```
  f40b7189
- [coloattention] fix import error (#4380) · 38b792aa
  flybird1111 authored Aug 04, 2023
```
fixed an import error
```
  38b792aa
- [fix] coloattention support flash attention 2 (#4347) · 25c57b9f
  flybird1111 authored Aug 04, 2023
```
Improved ColoAttention interface to support flash attention 2. Solved #4322 
```
  25c57b9f
02 Aug, 2023 1 commit

[chat] fix bugs and add unit tests (#4213) · da4f7b85

Wenhao Chen authored Aug 02, 2023

* style: rename replay buffer

Experience replay is typically for off policy algorithms.
Use this name in PPO maybe misleading.

* fix: fix wrong zero2 default arg

* test: update experience tests

* style: rename zero_pad fn

* fix: defer init in CycledDataLoader

* test: add benchmark test

* style: rename internal fn of generation

* style: rename internal fn of lora

* fix: remove unused loss fn

* fix: remove unused utils fn

* refactor: remove generate_with_actor fn

* fix: fix type annotation

* test: add models tests

* fix: skip llama due to long execution time

* style: modify dataset

* style: apply formatter

* perf: update reward dataset

* fix: fix wrong IGNORE_INDEX in sft dataset

* fix: remove DataCollatorForSupervisedDataset

* test: add dataset tests

* style: apply formatter

* style: rename test_ci to test_train

* feat: add llama in inference

* test: add inference tests

* test: change test scripts directory

* fix: update ci

* fix: fix typo

* fix: skip llama due to oom

* fix: fix file mod

* style: apply formatter

* refactor: remove duplicated llama_gptq

* style: apply formatter

* to: update rm test

* feat: add tokenizer arg

* feat: add download model script

* test: update train tests

* fix: modify gemini load and save pretrained

* test: update checkpoint io test

* to: modify nproc_per_node

* fix: do not remove existing dir

* fix: modify save path

* test: add random choice

* fix: fix sft path

* fix: enlarge nproc_per_node to avoid oom

* fix: add num_retry

* fix: make lora config of rm and critic consistent

* fix: add warning about lora weights

* fix: skip some gpt2 tests

* fix: remove grad ckpt in rm and critic due to errors

* refactor: directly use Actor in train_sft

* test: add more arguments

* fix: disable grad ckpt when using lora

* fix: fix save_pretrained and related tests

* test: enable zero2 tests

* revert: remove useless fn

* style: polish code

* test: modify test args

da4f7b85

01 Aug, 2023 5 commits
- [test] remove useless tests (#4359) · 16bf4c02
  Hongxin Liu authored Aug 01, 2023
```
* [test] remove legacy zero test

* [test] remove lazy distribute test

* [test] remove outdated checkpoint io
```
  16bf4c02
- [hotfix] update gradio 3.11 to 3.34.0 (#4329) · 16c0acc0
  caption authored Aug 01, 2023
  
  16c0acc0
- [release] update version (#4332) · 80647712
  Hongxin Liu authored Aug 01, 2023
```
* [release] update version

* [devops] hotfix cuda extension building

* [devops] pytest ignore useless folders
```
  80647712
- [chat] fix compute_approx_kl (#4338) · 75c53890
  Wenhao Chen authored Aug 01, 2023
  
  75c53890
- fix localhost measurement (#4320) · 03654c0c
  LuGY authored Aug 01, 2023
  
  03654c0c