Commits · d83c633ca63c4eef49f3473aa998515fa5ca573f · OpenDAS / ColossalAI

18 Apr, 2024 1 commit

[hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606) · d83c633c

Edenzzzz authored Apr 18, 2024



* fix no pad token bug

* fixed some auto parallel codegen bug, but might not run on torch 2.1

---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>

d83c633c

25 Mar, 2024 1 commit

[hotfix] set return_outputs=False in examples and polish code (#5404) · bb0a668f

Wenhao Chen authored Mar 25, 2024

* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value

bb0a668f

04 Mar, 2024 1 commit

[example]add gpt2 benchmark example script. (#5295) · 29695cf7

flybird11111 authored Mar 04, 2024



* benchmark gpt2

* fix

fix

fix

fix

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* fix

* fix

* fix

fix

* fix

fix

fix

* fix

fix

* benchmark gpt2

* fix

fix

fix

fix

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* fix

fix

* fix

fix

fix

* fix

* fix

fix

fix

fix

fix

* fix

* Update shardformer.py

---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>

29695cf7

09 Jan, 2024 1 commit

[npu] change device to accelerator api (#5239) · d202cc28

Hongxin Liu authored Jan 09, 2024



* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>

d202cc28

21 Sep, 2023 1 commit
- [bug] fix get_default_parser in examples (#4764) · df66741f
  Baizhou Zhang authored Sep 21, 2023
  
  df66741f
20 Sep, 2023 1 commit

[chat]: update rm, add wandb and fix bugs (#4471) · 7b9b8644

Wenhao Chen authored Sep 20, 2023



* feat: modify forward fn of critic and reward model

* feat: modify calc_action_log_probs

* to: add wandb in sft and rm trainer

* feat: update train_sft

* feat: update train_rm

* style: modify type annotation and add warning

* feat: pass tokenizer to ppo trainer

* to: modify trainer base and maker base

* feat: add wandb in ppo trainer

* feat: pass tokenizer to generate

* test: update generate fn tests

* test: update train tests

* fix: remove action_mask

* feat: remove unused code

* fix: fix wrong ignore_index

* fix: fix mock tokenizer

* chore: update requirements

* revert: modify make_experience

* fix: fix inference

* fix: add padding side

* style: modify _on_learn_batch_end

* test: use mock tokenizer

* fix: use bf16 to avoid overflow

* fix: fix workflow

* [chat] fix gemini strategy

* [chat] fix

* sync: update colossalai strategy

* fix: fix args and model dtype

* fix: fix checkpoint test

* fix: fix requirements

* fix: fix missing import and wrong arg

* fix: temporarily skip gemini test in stage 3

* style: apply pre-commit

* fix: temporarily skip gemini test in stage 1&2

---------
Co-authored-by: Mingyan Jiang <1829166702@qq.com>

7b9b8644

19 Sep, 2023 1 commit

[misc] update pre-commit and run all files (#4752) · 079bf3cb

Hongxin Liu authored Sep 19, 2023

* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format

079bf3cb

18 Sep, 2023 2 commits

[format] applied code formatting on changed files in pull request 4743 (#4750) · 3c6b831c
github-actions[bot] authored Sep 18, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
3c6b831c

[legacy] clean up legacy code (#4743) · b5f9e37c

Hongxin Liu authored Sep 18, 2023

* [legacy] remove outdated codes of pipeline (#4692)

* [legacy] remove cli of benchmark and update optim (#4690)

* [legacy] remove cli of benchmark and update optim

* [doc] fix cli doc test

* [legacy] fix engine clip grad norm

* [legacy] remove outdated colo tensor (#4694)

* [legacy] remove outdated colo tensor

* [test] fix test import

* [legacy] move outdated zero to legacy (#4696)

* [legacy] clean up utils (#4700)

* [legacy] clean up utils

* [example] update examples

* [legacy] clean up amp

* [legacy] fix amp module

* [legacy] clean up gpc (#4742)

* [legacy] clean up context

* [legacy] clean core, constants and global vars

* [legacy] refactor initialize

* [example] fix examples ci

* [example] fix examples ci

* [legacy] fix tests

* [example] fix gpt example

* [example] fix examples ci

* [devops] fix ci installation

* [example] fix examples ci

b5f9e37c

15 Sep, 2023 1 commit

[example] add gpt2 HybridParallelPlugin example (#4653) · 608cffae

Bin Jia authored Sep 15, 2023

* add gpt2 HybridParallelPlugin example

* update readme and testci

* update test ci

* fix test_ci bug

* update requirements

* add requirements

* update requirements

* add requirement

* rename file

608cffae

11 Sep, 2023 1 commit

[legacy] move communication and nn to legacy and refactor logger (#4671) · 554aa959

Hongxin Liu authored Sep 11, 2023

* [legacy] move communication to legacy (#4640)

* [legacy] refactor logger and clean up legacy codes (#4654)

* [legacy] make logger independent to gpc

* [legacy] make optim independent to registry

* [legacy] move test engine to legacy

* [legacy] move nn to legacy (#4656)

* [legacy] move nn to legacy

* [checkpointio] fix save hf config

* [test] remove useledd rpc pp test

* [legacy] fix nn init

* [example] skip tutorial hybriad parallel example

* [devops] test doc check

* [devops] test doc check

554aa959

05 Sep, 2023 2 commits
- [legacy] move builder and registry to legacy (#4603) · ac178ca5
  Hongxin Liu authored Sep 04, 2023
  
  ac178ca5
- [legacy] move trainer to legacy (#4545) · 89fe0277
  Hongxin Liu authored Aug 31, 2023
```
* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test
```
  89fe0277
24 Aug, 2023 1 commit

[gemini] improve compatibility and add static placement policy (#4479) · 27061426

Hongxin Liu authored Aug 24, 2023

* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example

27061426

28 Jun, 2023 1 commit
- fix #3852 path error (#4058) · 2d40759a
  digger yu authored Jun 28, 2023
  
  2d40759a
26 Jun, 2023 1 commit
- [hotfix]fix argument naming in docs and examples (#4083) · 4da324cd
  Baizhou Zhang authored Jun 26, 2023
  
  4da324cd
19 Jun, 2023 1 commit
- [example] fix bucket size in example of gpt gemini (#4028) · 160c64c6
  LuGY authored Jun 19, 2023
  
  160c64c6
08 Jun, 2023 1 commit
- fix typo examples and docs (#3932) · 33eef714
  digger yu authored Jun 08, 2023
  
  33eef714
30 May, 2023 1 commit
- [example] update gemini examples (#3868) · 5f79008c
  jiangmingyan authored May 30, 2023
```
* [example]update gemini examples

* [example]update gemini examples
```
  5f79008c
18 May, 2023 1 commit
- [auto] fix install cmd (#3772) · 15024e40
  binmakeswell authored May 18, 2023
  
  15024e40
26 Apr, 2023 1 commit

[doc] Fix typo under colossalai and doc(#3618) · b9a8dff7

digger-yu authored Apr 26, 2023

* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402

b9a8dff7

06 Apr, 2023 1 commit

[test] refactor tests with spawn (#3452) · 80eba05b

Frank Lee authored Apr 06, 2023

* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

80eba05b

04 Apr, 2023 1 commit

[zero] reorganize zero/gemini folder structure (#3424) · 26b7aac0

ver217 authored Apr 04, 2023

* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import

26b7aac0

23 Mar, 2023 1 commit
- [auto] fix requirements typo for issue #3125 (#3209) · 18934796
  Yan Fang authored Mar 23, 2023
  
  18934796
21 Mar, 2023 1 commit

[auto-parallel] add auto-offload feature (#3154) · 18dbe76c

Zihao authored Mar 21, 2023

* add auto-offload feature

* polish code

* fix syn offload runtime pass bug

* add offload example

* fix offload testing bug

* fix example testing bug

18dbe76c

07 Mar, 2023 1 commit

[pipeline] Add Simplified Alpa DP Partition (#2507) · 400f6301

Ziyue Jiang authored Mar 07, 2023



* add alpa dp split

* add alpa dp split

* use fwd+bwd instead of fwd only

---------
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>

400f6301

27 Feb, 2023 2 commits
- [format] applied code formatting on changed files in pull request 2922 (#2923) · da056285
  github-actions[bot] authored Feb 27, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
  da056285
- [doc] update installation for GPT (#2922) · 12bafe05
  binmakeswell authored Feb 27, 2023
  
  12bafe05
22 Feb, 2023 1 commit
- [doc] fix GPT tutorial (#2860) · 55424a16
  dawei-wang authored Feb 21, 2023
```
Fix hpcaitech/ColossalAI#2851
```
  55424a16
15 Feb, 2023 1 commit
- [doc] fixed a typo in GPT readme (#2736) · 43dffdab
  cloudhuang authored Feb 15, 2023
  
  43dffdab
09 Feb, 2023 1 commit
- [example] Polish README.md (#2658) · a255a38f
  Jiatong (Julius) Han authored Feb 09, 2023
```
* [tutorial] polish readme.md

* [example] Update README.md
```
  a255a38f
31 Jan, 2023 1 commit
- [gemini] add profiler in the demo (#2534) · 6e0faa70
  HELSON authored Jan 31, 2023
  
  6e0faa70
30 Jan, 2023 1 commit
- [gemini] update the gpt example (#2527) · 66dfcf52
  HELSON authored Jan 30, 2023
  
  66dfcf52
28 Jan, 2023 1 commit
- [gemini] update ddp strict mode (#2518) · 707b11d4
  HELSON authored Jan 28, 2023
```
* [zero] add strict ddp mode for chunk init

* [gemini] update gpt example
```
  707b11d4
20 Jan, 2023 1 commit

[zero] add strict ddp mode (#2508) · 2d1a7dfe

HELSON authored Jan 20, 2023

* [zero] add strict ddp mode

* [polish] add comments for strict ddp mode

* [zero] fix test error

2d1a7dfe

18 Jan, 2023 3 commits
- [hotfix] gpt example titans bug #2493 (#2494) · e327e951
  Jiarui Fang authored Jan 18, 2023
  
  e327e951
- polish code and fix dataloader bugs · e58cc441
  jiaruifang authored Jan 18, 2023
  
  e58cc441
- [hotfix] gpt example titans bug #2493 · a4b75b78
  jiaruifang authored Jan 18, 2023
  
  a4b75b78
17 Jan, 2023 1 commit
- [example] fix requirements (#2488) · fcc6d61d
  binmakeswell authored Jan 17, 2023
  
  fcc6d61d
16 Jan, 2023 1 commit
- [example] titans for gpt (#2484) · 3a21485e
  Jiarui Fang authored Jan 16, 2023
  
  3a21485e