Commits · d83c633ca63c4eef49f3473aa998515fa5ca573f · OpenDAS / ColossalAI

18 Apr, 2024 1 commit

[hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606) · d83c633c

Edenzzzz authored Apr 18, 2024



* fix no pad token bug

* fixed some auto parallel codegen bug, but might not run on torch 2.1

---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>

d83c633c

08 Apr, 2024 1 commit

[devops] remove post commit ci (#5566) · 641b1ee7

Hongxin Liu authored Apr 08, 2024

* [devops] remove post commit ci

* [misc] run pre-commit on all files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

641b1ee7

07 Apr, 2024 2 commits
- [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548) · 341263df
  digger yu authored Apr 07, 2024
  
  341263df
- [fix] fix typo s/muiti-node /multi-node etc. (#5448) · a799ca34
  digger yu authored Apr 07, 2024
  
  a799ca34
01 Apr, 2024 1 commit

[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous... · e614aa34

Wenhao Chen authored Apr 01, 2024

[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)

* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`

* feat: apply `GradientCheckpointConfig` to policy and llama_forward

* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager

* fix: add optional args for `distribute_layer` and `get_stage_index`

* fix: fix changed API calls

* test: update llama tests

* style: polish `GradientCheckpointConfig`

* fix: fix pipeline utils tests

e614aa34

28 Mar, 2024 1 commit
- [Fix] Grok-1 use tokenizer from the same pretrained path (#5532) · 36c4bb28
  Yuanheng Zhao authored Mar 28, 2024
```
* [fix] use tokenizer from the same pretrained path

* trust remote code
```
  36c4bb28
27 Mar, 2024 1 commit

[shardformer] fix pipeline forward error if custom layer distribution is used (#5189) · 00525f77

Insu Jang authored Mar 27, 2024



* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

00525f77

26 Mar, 2024 1 commit
- [fix] fix grok-1 example typo (#5506) · 131f32a0
  Yuanheng Zhao authored Mar 26, 2024
  
  131f32a0
25 Mar, 2024 2 commits

[release] grok-1 inference benchmark (#5500) · 34e90925

binmakeswell authored Mar 25, 2024

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

34e90925

[hotfix] set return_outputs=False in examples and polish code (#5404) · bb0a668f

Wenhao Chen authored Mar 25, 2024

* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value

bb0a668f

24 Mar, 2024 1 commit

[example] update Grok-1 inference (#5495) · 5fcd7795

Yuanheng Zhao authored Mar 24, 2024

* revise grok-1 example

* remove unused arg in scripts

* prevent re-installing torch

* update readme

* revert modifying colossalai requirements

* add perf

* trivial

* add tokenizer url

5fcd7795

22 Mar, 2024 1 commit
- [release] grok-1 314b inference (#5490) · 6df844b8
  binmakeswell authored Mar 22, 2024
```
* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference
```
  6df844b8
21 Mar, 2024 1 commit

[example] add grok-1 inference (#5485) · 848a574c

Hongxin Liu authored Mar 21, 2024

* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme

848a574c

05 Mar, 2024 2 commits
- [hotfix] fix typo of openmoe model source (#5403) · e239cf90
  Luo Yihang authored Mar 05, 2024
  
  e239cf90
- [devops] fix extention building (#5427) · 070df689
  Hongxin Liu authored Mar 05, 2024
  
  070df689
04 Mar, 2024 1 commit

[example]add gpt2 benchmark example script. (#5295) · 29695cf7

flybird11111 authored Mar 04, 2024



* benchmark gpt2

* fix

fix

fix

fix

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* fix

* fix

* fix

fix

* fix

fix

fix

* fix

fix

* benchmark gpt2

* fix

fix

fix

fix

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* fix

fix

* fix

fix

fix

* fix

* fix

fix

fix

fix

fix

* fix

* Update shardformer.py

---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>

29695cf7

27 Feb, 2024 1 commit
- [example] reuse flash attn patch (#5400) · d882d18c
  Hongxin Liu authored Feb 27, 2024
  
  d882d18c
30 Jan, 2024 1 commit
- fix typo change dosen't to doesn't (#5308) · 71321a07
  digger yu authored Jan 30, 2024
  
  71321a07
19 Jan, 2024 1 commit
- fix llama pretrain (#5287) · f7e3f82a
  flybird11111 authored Jan 19, 2024
  
  f7e3f82a
15 Jan, 2024 1 commit
- [hotfix]: add pp sanity check and fix mbs arg (#5268) · ef4f0ee8
  Wenhao Chen authored Jan 15, 2024
```
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
```
  ef4f0ee8
11 Jan, 2024 1 commit
- [doc] fix doc typo (#5256) · c174c4fc
  binmakeswell authored Jan 11, 2024
```
* [doc] fix annotation display

* [doc] fix llama2 doc
```
  c174c4fc
09 Jan, 2024 1 commit

[npu] change device to accelerator api (#5239) · d202cc28

Hongxin Liu authored Jan 09, 2024



* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>

d202cc28

08 Jan, 2024 1 commit

[npu] use extension for op builder (#5172) · dd2c28a3

Xuanlei Zhao authored Jan 08, 2024

* update extension

* update cpu adam

* update is

* add doc for cpu adam

* update kernel

* update commit

* update flash

* update memory efficient

* update flash attn

* update flash attention loader

* update api

* fix

* update doc

* update example time limit

* reverse change

* fix doc

* remove useless kernel

* fix

* not use warning

* update

* update

dd2c28a3

02 Jan, 2024 1 commit

[pipeline]: support arbitrary batch size in forward_only mode (#5201) · 3c0d82b1

Wenhao Chen authored Jan 02, 2024

* fix: remove drop last in val & test dataloader

* feat: add run_forward_only, support arbitrary bs

* chore: modify ci script

3c0d82b1

22 Dec, 2023 1 commit

[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) · 4fa689fc

Wenhao Chen authored Dec 22, 2023

* test: add more p2p tests

* fix: remove send_forward_recv_forward as p2p op list need to use the same group

* fix: make send and receive atomic

* feat: update P2PComm fn

* feat: add metadata cache in 1f1b

* feat: add metadata cache in interleaved pp

* feat: modify is_xx_stage fn

* revert: add _broadcast_object_list

* feat: add interleaved pp in llama policy

* feat: set NCCL_BUFFSIZE in HybridParallelPlugin

4fa689fc

08 Dec, 2023 1 commit
- [gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150) · 21aa5de0
  flybird11111 authored Dec 08, 2023
```
* fix

aaa

fix

fix

fix

* fix

* fix

* test ci

* fix ci

fix
```
  21aa5de0
28 Nov, 2023 2 commits

[doc] add moe news (#5128) · 177c79f2
binmakeswell authored Nov 28, 2023
```
* [doc] add moe news

* [doc] add moe news

* [doc] add moe news
```
177c79f2

[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) · 7172459e

Wenhao Chen authored Nov 28, 2023



* [shardformer] implement policy for all GPT-J models and test

* [shardformer] support interleaved pipeline parallel for bert finetune

* [shardformer] shardformer support falcon (#4883)

* [shardformer]: fix interleaved pipeline for bert model (#5048)

* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)

* Add Mistral support for Shardformer (#5103)

* [shardformer] add tests to mistral (#5105)

---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>

7172459e

27 Nov, 2023 1 commit
- [nfc] fix typo change directoty to directory (#5111) · d5661f0f
  digger yu authored Nov 27, 2023
  
  d5661f0f
22 Nov, 2023 2 commits
- [npu] add npu support for hybrid plugin and llama (#5090) · 3acbf6d4
  Xuanlei Zhao authored Nov 22, 2023
```
* llama 3d

* update

* fix autocast
```
  3acbf6d4
- [shardformer]fix flash attention, when mask is casual, just don't unpad it (#5084) · aae49663
  flybird11111 authored Nov 22, 2023
```
* fix flash attn

* fix

fix
```
  aae49663
20 Nov, 2023 2 commits

[format] applied code formatting on changed files in pull request 5067 (#5072) · 8921a73c
github-actions[bot] authored Nov 20, 2023
```
Co-authored-by: github-actions <github-actions@github.com>
```
8921a73c

[npu] add npu support for gemini and zero (#5067) · e5ce4c8e

Hongxin Liu authored Nov 20, 2023

* [npu] setup device utils (#5047)

* [npu] add npu device support

* [npu] support low level zero

* [test] update npu zero plugin test

* [hotfix] fix import

* [test] recover tests

* [npu] gemini support npu (#5052)

* [npu] refactor device utils

* [gemini] support npu

* [example] llama2+gemini support npu

* [kernel] add arm cpu adam kernel (#5065)

* [kernel] add arm cpu adam

* [optim] update adam optimizer

* [kernel] arm cpu adam remove bf16 support

e5ce4c8e

18 Nov, 2023 1 commit
- [exampe] fix llama example' loss error when using gemini plugin (#5060) · bc09b95f
  flybird11111 authored Nov 18, 2023
```
fix llama example
```
  bc09b95f
16 Nov, 2023 1 commit

[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading... · b2ad0d9e

Elsa Granger authored Nov 16, 2023


[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017)

* Use p2p

* Cannot bidirectonal send p2p

* Refactor tensor creation and serialization in P2P
communication

* Fix llama forward args in flash attention

* Add flop estimate from megatron

* Support loading weight not in weight_map when strict=False in hybrid_parallel

* Use send_forward_recv_backward, etc in 1f1b

* Use dataclass for metdata
Remove torch.cuda.synchronize() as suggested

* Add comment about the torch.cuda.synchronize for potential error

* Typo

* Update hybrid_parallel_checkpoint_io.py

* Update p2p.py

* Update one_f_one_b.py

* Update p2p.py

---------
Co-authored-by: flybird11111 <1829166702@qq.com>

b2ad0d9e

09 Nov, 2023 1 commit

[moe]: fix ep/tp tests, add hierarchical all2all (#4982) · 72444127

Wenhao Chen authored Nov 09, 2023

* fix: add warning for EP different behavior

* fix: use shard_data in ep & tp model

* to: add used_capacity

* fix: fix router test

* feat: add create_ep_node_group

* feat: add create_ep_hierarchical_group fn

* feat: add HierarchicalAllToAll

* test: add hierarchical all2all test

* fix: fix test errors

* fix: simplify create_ep_hierarchical_group

* fix: add hierarchical_alltoall arg

* fix: fix environ typo

* revert: revert process mesh order

* to: add todo mark

* fix: skip hierarchical_comm if torch < 1.13.1

72444127

08 Nov, 2023 1 commit

[moe] support optimizer checkpoint (#5015) · f71e63b0

Xuanlei Zhao authored Nov 08, 2023

* Refactor MoE Manager setup method

* unshard optim ckpt

* optim io

* update transformer version

* update requirements

* update ckpt

* update ckpt

* update ckpt

* fix engine

* fix engine

f71e63b0

02 Nov, 2023 1 commit
- [moe] merge moe into main (#4978) · dc003c30
  Xuanlei Zhao authored Nov 02, 2023
```
* update moe module
* support openmoe
```
  dc003c30
07 Oct, 2023 1 commit
- [nfc] fix minor typo in README (#4846) · 8aed02b9
  Blagoy Simandoff authored Oct 07, 2023
  
  8aed02b9
21 Sep, 2023 1 commit
- [bug] fix get_default_parser in examples (#4764) · df66741f
  Baizhou Zhang authored Sep 21, 2023
  
  df66741f