Commits · 7e0ec5a85c73fcc5666b9d218e43865141587dde · OpenDAS / ColossalAI

02 Apr, 2024 1 commit
- fix incorrect sharding without zero (#5545) · 7e0ec5a8
  Edenzzzz authored Apr 02, 2024
```
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
```
  7e0ec5a8
01 Apr, 2024 1 commit

[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous... · e614aa34

Wenhao Chen authored Apr 01, 2024

[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)

* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`

* feat: apply `GradientCheckpointConfig` to policy and llama_forward

* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager

* fix: add optional args for `distribute_layer` and `get_stage_index`

* fix: fix changed API calls

* test: update llama tests

* style: polish `GradientCheckpointConfig`

* fix: fix pipeline utils tests

e614aa34

29 Mar, 2024 1 commit

[ColossalChat] Update RLHF V2 (#5286) · df5e9c53

YeAnbang authored Mar 29, 2024



* Add dpo. Fix sft, ppo, lora. Refactor all

* fix and tested ppo

* 2 nd round refactor

* add ci tests

* fix ci

* fix ci

* fix readme, style

* fix readme style

* fix style, fix benchmark

* reproduce benchmark result, remove useless files

* rename to ColossalChat

* use new image

* fix ci workflow

* fix ci

* use local model/tokenizer for ci tests

* fix ci

* fix ci

* fix ci

* fix ci timeout

* fix rm progress bar. fix ci timeout

* fix ci

* fix ci typo

* remove 3d plugin from ci temporary

* test environment

* cannot save optimizer

* support chat template

* fix readme

* fix path

* test ci locally

* restore build_or_pr

* fix ci data path

* fix benchmark

* fix ci, move ci tests to 3080, disable fast tokenizer

* move ci to 85

* support flash attention 2

* add all-in-one data preparation script. Fix colossal-llama2-chat chat template

* add hardware requirements

* move ci test data

* fix save_model, add unwrap

* fix missing bos

* fix missing bos; support grad accumulation with gemini

* fix ci

* fix ci

* fix ci

* fix llama2 chat template config

* debug sft

* debug sft

* fix colossalai version requirement

* fix ci

* add sanity check to prevent NaN loss

* fix requirements

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* update readme

* update readme

* update readme and ignore

* fix logger bug

* support parallel_output

* modify data preparation logic

* fix tokenization

* update lr

* fix inference

* run pre-commit

---------
Co-authored-by: Tong Li <tong.li352711588@gmail.com>

df5e9c53

28 Mar, 2024 1 commit
- [Fix] Grok-1 use tokenizer from the same pretrained path (#5532) · 36c4bb28
  Yuanheng Zhao authored Mar 28, 2024
```
* [fix] use tokenizer from the same pretrained path

* trust remote code
```
  36c4bb28
27 Mar, 2024 3 commits

[shardformer] fix pipeline forward error if custom layer distribution is used (#5189) · 00525f77

Insu Jang authored Mar 27, 2024



* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

00525f77

[format] applied code formatting on changed files in pull request 5510 (#5517) · e6707a6e
github-actions[bot] authored Mar 27, 2024
```
Co-authored-by: github-actions <github-actions@github.com>
```
e6707a6e

[shardformer] update colo attention to support custom mask (#5510) · 19e1a5cf

Hongxin Liu authored Mar 27, 2024

* [feature] refactor colo attention (#5462)

* [extension] update api

* [feature] add colo attention

* [feature] update sdpa

* [feature] update npu attention

* [feature] update flash-attn

* [test] add flash attn test

* [test] update flash attn test

* [shardformer] update modeling to fit colo attention (#5465)

* [misc] refactor folder structure

* [shardformer] update llama flash-attn

* [shardformer] fix llama policy

* [devops] update tensornvme install

* [test] update llama test

* [shardformer] update colo attn kernel dispatch

* [shardformer] update blip2

* [shardformer] update chatglm

* [shardformer] update gpt2

* [shardformer] update gptj

* [shardformer] update opt

* [shardformer] update vit

* [shardformer] update colo attention mask prep

* [shardformer] update whisper

* [test] fix shardformer tests (#5514)

* [test] fix shardformer tests

* [test] fix shardformer tests

19e1a5cf

26 Mar, 2024 6 commits
- Merge pull request #5515 from Edenzzzz/fix_layout_convert · 9a3321e9
  Edenzzzz authored Mar 26, 2024
```
Fix layout convertor caching
```
  9a3321e9
- Empty-Commit · 18edcd53
  Edenzzzz authored Mar 26, 2024
  
  18edcd53
- fixed layout converter caching and updated tester · 61da3fbc
  Edenzzzz authored Mar 26, 2024
  
  61da3fbc
- Fix ColoTensorSpec for py11 (#5440) · cbe34c55
  Rocky Duan authored Mar 26, 2024
  
  cbe34c55
- [devops] fix example test ci (#5504) · a7790a92
  Hongxin Liu authored Mar 26, 2024
  
  a7790a92
- [fix] fix grok-1 example typo (#5506) · 131f32a0
  Yuanheng Zhao authored Mar 26, 2024
  
  131f32a0
25 Mar, 2024 3 commits

[shardformer]Fix lm parallel. (#5480) · 0688d92e

flybird11111 authored Mar 25, 2024

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* fix lm forward distribution

* fix

* test ci

* fix

0688d92e

[release] grok-1 inference benchmark (#5500) · 34e90925

binmakeswell authored Mar 25, 2024

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

34e90925

[hotfix] set return_outputs=False in examples and polish code (#5404) · bb0a668f

Wenhao Chen authored Mar 25, 2024

* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value

bb0a668f

24 Mar, 2024 1 commit

[example] update Grok-1 inference (#5495) · 5fcd7795

Yuanheng Zhao authored Mar 24, 2024

* revise grok-1 example

* remove unused arg in scripts

* prevent re-installing torch

* update readme

* revert modifying colossalai requirements

* add perf

* trivial

* add tokenizer url

5fcd7795

22 Mar, 2024 1 commit
- [release] grok-1 314b inference (#5490) · 6df844b8
  binmakeswell authored Mar 22, 2024
```
* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference
```
  6df844b8
21 Mar, 2024 1 commit

[example] add grok-1 inference (#5485) · 848a574c

Hongxin Liu authored Mar 21, 2024

* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme

848a574c

20 Mar, 2024 1 commit

[doc] update open-sora demo (#5479) · d158fc0e

binmakeswell authored Mar 20, 2024

* [doc] update open-sora demo

* [doc] update open-sora demo

* [doc] update open-sora demo

d158fc0e

18 Mar, 2024 2 commits

[doc] release Open-Sora 1.0 with model weights (#5468) · bd998ced

binmakeswell authored Mar 18, 2024

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

bd998ced

[shardformer] fix gathering output when using tensor parallelism (#5431) · 5e16bf79

flybird11111 authored Mar 18, 2024

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

5e16bf79

13 Mar, 2024 1 commit

[devops] fix compatibility (#5444) · f2e8b9ef

Hongxin Liu authored Mar 13, 2024

* [devops] fix compatibility

* [hotfix] update compatibility test on pr

* [devops] fix compatibility

* [devops] record duration during comp test

* [test] decrease test duration

* fix falcon

f2e8b9ef

12 Mar, 2024 1 commit
- [hotfix] fix typo s/keywrods/keywords etc. (#5429) · 385e85af
  digger yu authored Mar 12, 2024
  
  385e85af
11 Mar, 2024 1 commit
- fix tensor data update for gemini loss caluculation (#5442) · da885ed5
  Camille Zhong authored Mar 11, 2024
  
  da885ed5
07 Mar, 2024 2 commits
- [release] update version (#5411) · 8020f426
  Hongxin Liu authored Mar 07, 2024
  
  8020f426
- [colossal-llama2] add stream chat examlple for chat version model (#5428) · 743e7fad
  Camille Zhong authored Mar 07, 2024
```
* add stream chat for chat version

* remove os.system clear

* modify function name
```
  743e7fad
05 Mar, 2024 11 commits
- [hotfix] fix stable diffusion inference bug. (#5289) · 68f55a70
  Youngon authored Mar 05, 2024
```
* Update train_ddp.yaml

delete  "strategy" to fix DDP config loading bug in "main.py"

* Update train_ddp.yaml

fix inference with scripts/txt2img.py config file load bug.

* Update README.md

add pretrain model test code.
```
  68f55a70
- [doc] Fix typo s/infered/inferred/ (#5288) · c8003d46
  hugo-syn authored Mar 05, 2024
```
Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>
```
  c8003d46
- [hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335) · 5e1c93d7
  digger yu authored Mar 05, 2024
```
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
```
  5e1c93d7
- [eval-hotfix] set few_shot_data to None when few shot is disabled (#5422) · a7ae2b5b
  Dongruixuan Li authored Mar 05, 2024
  
  a7ae2b5b
- [hotfix] fix typo change enabel to enable under colossalai/shardformer/ (#5317) · 049121d1
  digger yu authored Mar 05, 2024
  
  049121d1
- [hotfix] fix typo change _descrption to _description (#5331) · 16c96d4d
  digger yu authored Mar 05, 2024
  
  16c96d4d
- [doc] update some translations with README-zh-Hans.md (#5382) · 70cce5cb
  digger yu authored Mar 05, 2024
  
  70cce5cb
- [hotfix] fix typo of openmoe model source (#5403) · e239cf90
  Luo Yihang authored Mar 05, 2024
  
  e239cf90
- [hotfix] fix sd vit import error (#5420) · e304e4db
  MickeyCHAN authored Mar 05, 2024
```
* fix import error

* Update dpt_depth.py

---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
```
  e304e4db
- [devops] fix extention building (#5427) · 070df689
  Hongxin Liu authored Mar 05, 2024
  
  070df689
- [doc] sora release (#5425) · 822241a9
  binmakeswell authored Mar 05, 2024
```
* [doc] sora release

* [doc] sora release

* [doc] sora release

* [doc] sora release
```
  822241a9
04 Mar, 2024 1 commit

[example]add gpt2 benchmark example script. (#5295) · 29695cf7

flybird11111 authored Mar 04, 2024



* benchmark gpt2

* fix

fix

fix

fix

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* fix

* fix

* fix

fix

* fix

fix

fix

* fix

fix

* benchmark gpt2

* fix

fix

fix

fix

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* fix

fix

* fix

fix

fix

* fix

* fix

fix

fix

fix

fix

* fix

* Update shardformer.py

---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>

29695cf7

01 Mar, 2024 1 commit
- fix sft single turn inference example (#5416) · 4b8312c0
  Camille Zhong authored Mar 01, 2024
  
  4b8312c0