Commits · 68ec99e946129298b2e6d8e6463886fe6b22a5df · OpenDAS / ColossalAI

26 Apr, 2024 1 commit
- [hotfix] add soft link to support required files (#5661) · 68ec99e9
  Tong Li authored Apr 26, 2024
  
  68ec99e9
25 Apr, 2024 1 commit

[shardformer] refactor pipeline grad ckpt config (#5646) · 1b387ca9

Hongxin Liu authored Apr 25, 2024

* [shardformer] refactor pipeline grad ckpt config

* [shardformer] refactor pipeline grad ckpt config

* [pipeline] fix stage manager

1b387ca9

23 Apr, 2024 2 commits

[example] llama3 (#5631) · f4c5aafe

binmakeswell authored Apr 23, 2024

* release llama3

* [release] llama3

* [release] llama3

* [release] llama3

* [release] llama3

f4c5aafe

[exampe] update llama example (#5626) · 4de4e318

Hongxin Liu authored Apr 23, 2024

* [plugin] support dp inside for hybriad parallel

* [example] update llama benchmark

* [example] update llama benchmark

* [example] update llama readme

* [example] update llama readme

4de4e318

18 Apr, 2024 1 commit

[hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606) · d83c633c

Edenzzzz authored Apr 18, 2024



* fix no pad token bug

* fixed some auto parallel codegen bug, but might not run on torch 2.1

---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>

d83c633c

08 Apr, 2024 1 commit

[devops] remove post commit ci (#5566) · 641b1ee7

Hongxin Liu authored Apr 08, 2024

* [devops] remove post commit ci

* [misc] run pre-commit on all files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

641b1ee7

07 Apr, 2024 3 commits
- [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548) · 341263df
  digger yu authored Apr 07, 2024
  
  341263df
- [fix] fix typo s/muiti-node /multi-node etc. (#5448) · a799ca34
  digger yu authored Apr 07, 2024
  
  a799ca34
- [hotfix] quick fixes to make legacy tutorials runnable (#5559) · 15055f9a
  Edenzzzz authored Apr 07, 2024
```
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
```
  15055f9a
01 Apr, 2024 1 commit

[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous... · e614aa34

Wenhao Chen authored Apr 01, 2024

[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)

* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`

* feat: apply `GradientCheckpointConfig` to policy and llama_forward

* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager

* fix: add optional args for `distribute_layer` and `get_stage_index`

* fix: fix changed API calls

* test: update llama tests

* style: polish `GradientCheckpointConfig`

* fix: fix pipeline utils tests

e614aa34

28 Mar, 2024 1 commit
- [Fix] Grok-1 use tokenizer from the same pretrained path (#5532) · 36c4bb28
  Yuanheng Zhao authored Mar 28, 2024
```
* [fix] use tokenizer from the same pretrained path

* trust remote code
```
  36c4bb28
27 Mar, 2024 1 commit

[shardformer] fix pipeline forward error if custom layer distribution is used (#5189) · 00525f77

Insu Jang authored Mar 27, 2024



* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

00525f77

26 Mar, 2024 1 commit
- [fix] fix grok-1 example typo (#5506) · 131f32a0
  Yuanheng Zhao authored Mar 26, 2024
  
  131f32a0
25 Mar, 2024 2 commits

[release] grok-1 inference benchmark (#5500) · 34e90925

binmakeswell authored Mar 25, 2024

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

34e90925

[hotfix] set return_outputs=False in examples and polish code (#5404) · bb0a668f

Wenhao Chen authored Mar 25, 2024

* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value

bb0a668f

24 Mar, 2024 1 commit

[example] update Grok-1 inference (#5495) · 5fcd7795

Yuanheng Zhao authored Mar 24, 2024

* revise grok-1 example

* remove unused arg in scripts

* prevent re-installing torch

* update readme

* revert modifying colossalai requirements

* add perf

* trivial

* add tokenizer url

5fcd7795

22 Mar, 2024 1 commit
- [release] grok-1 314b inference (#5490) · 6df844b8
  binmakeswell authored Mar 22, 2024
```
* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference
```
  6df844b8
21 Mar, 2024 1 commit

[example] add grok-1 inference (#5485) · 848a574c

Hongxin Liu authored Mar 21, 2024

* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme

848a574c

12 Mar, 2024 1 commit
- [hotfix] fix typo s/keywrods/keywords etc. (#5429) · 385e85af
  digger yu authored Mar 12, 2024
  
  385e85af
05 Mar, 2024 4 commits
- [hotfix] fix stable diffusion inference bug. (#5289) · 68f55a70
  Youngon authored Mar 05, 2024
```
* Update train_ddp.yaml

delete  "strategy" to fix DDP config loading bug in "main.py"

* Update train_ddp.yaml

fix inference with scripts/txt2img.py config file load bug.

* Update README.md

add pretrain model test code.
```
  68f55a70
- [hotfix] fix typo of openmoe model source (#5403) · e239cf90
  Luo Yihang authored Mar 05, 2024
  
  e239cf90
- [hotfix] fix sd vit import error (#5420) · e304e4db
  MickeyCHAN authored Mar 05, 2024
```
* fix import error

* Update dpt_depth.py

---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
```
  e304e4db
- [devops] fix extention building (#5427) · 070df689
  Hongxin Liu authored Mar 05, 2024
  
  070df689
04 Mar, 2024 1 commit

[example]add gpt2 benchmark example script. (#5295) · 29695cf7

flybird11111 authored Mar 04, 2024



* benchmark gpt2

* fix

fix

fix

fix

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* fix

* fix

* fix

fix

* fix

fix

fix

* fix

fix

* benchmark gpt2

* fix

fix

fix

fix

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* fix

fix

* fix

fix

fix

* fix

* fix

fix

fix

fix

fix

* fix

* Update shardformer.py

---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>

29695cf7

27 Feb, 2024 1 commit
- [example] reuse flash attn patch (#5400) · d882d18c
  Hongxin Liu authored Feb 27, 2024
  
  d882d18c
30 Jan, 2024 1 commit
- fix typo change dosen't to doesn't (#5308) · 71321a07
  digger yu authored Jan 30, 2024
  
  71321a07
25 Jan, 2024 2 commits
- [feat] refactored extension module (#5298) · 7cfed5f0
  Frank Lee authored Jan 25, 2024
```
* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
```
  7cfed5f0
- fix some typo (#5307) · bce9499e
  digger yu authored Jan 25, 2024
  
  bce9499e
19 Jan, 2024 1 commit
- fix llama pretrain (#5287) · f7e3f82a
  flybird11111 authored Jan 19, 2024
  
  f7e3f82a
15 Jan, 2024 1 commit
- [hotfix]: add pp sanity check and fix mbs arg (#5268) · ef4f0ee8
  Wenhao Chen authored Jan 15, 2024
```
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
```
  ef4f0ee8
11 Jan, 2024 1 commit
- [doc] fix doc typo (#5256) · c174c4fc
  binmakeswell authored Jan 11, 2024
```
* [doc] fix annotation display

* [doc] fix llama2 doc
```
  c174c4fc
09 Jan, 2024 1 commit

[npu] change device to accelerator api (#5239) · d202cc28

Hongxin Liu authored Jan 09, 2024



* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>

d202cc28

08 Jan, 2024 1 commit

[npu] use extension for op builder (#5172) · dd2c28a3

Xuanlei Zhao authored Jan 08, 2024

* update extension

* update cpu adam

* update is

* add doc for cpu adam

* update kernel

* update commit

* update flash

* update memory efficient

* update flash attn

* update flash attention loader

* update api

* fix

* update doc

* update example time limit

* reverse change

* fix doc

* remove useless kernel

* fix

* not use warning

* update

* update

dd2c28a3

02 Jan, 2024 1 commit

[pipeline]: support arbitrary batch size in forward_only mode (#5201) · 3c0d82b1

Wenhao Chen authored Jan 02, 2024

* fix: remove drop last in val & test dataloader

* feat: add run_forward_only, support arbitrary bs

* chore: modify ci script

3c0d82b1

22 Dec, 2023 1 commit

[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) · 4fa689fc

Wenhao Chen authored Dec 22, 2023

* test: add more p2p tests

* fix: remove send_forward_recv_forward as p2p op list need to use the same group

* fix: make send and receive atomic

* feat: update P2PComm fn

* feat: add metadata cache in 1f1b

* feat: add metadata cache in interleaved pp

* feat: modify is_xx_stage fn

* revert: add _broadcast_object_list

* feat: add interleaved pp in llama policy

* feat: set NCCL_BUFFSIZE in HybridParallelPlugin

4fa689fc

08 Dec, 2023 1 commit
- [gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150) · 21aa5de0
  flybird11111 authored Dec 08, 2023
```
* fix

aaa

fix

fix

fix

* fix

* fix

* test ci

* fix ci

fix
```
  21aa5de0
28 Nov, 2023 2 commits

[doc] add moe news (#5128) · 177c79f2
binmakeswell authored Nov 28, 2023
```
* [doc] add moe news

* [doc] add moe news

* [doc] add moe news
```
177c79f2

[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) · 7172459e

Wenhao Chen authored Nov 28, 2023



* [shardformer] implement policy for all GPT-J models and test

* [shardformer] support interleaved pipeline parallel for bert finetune

* [shardformer] shardformer support falcon (#4883)

* [shardformer]: fix interleaved pipeline for bert model (#5048)

* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)

* Add Mistral support for Shardformer (#5103)

* [shardformer] add tests to mistral (#5105)

---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>

7172459e

27 Nov, 2023 1 commit
- [nfc] fix typo change directoty to directory (#5111) · d5661f0f
  digger yu authored Nov 27, 2023
  
  d5661f0f
22 Nov, 2023 1 commit
- [npu] add npu support for hybrid plugin and llama (#5090) · 3acbf6d4
  Xuanlei Zhao authored Nov 22, 2023
```
* llama 3d

* update

* fix autocast
```
  3acbf6d4