Commits · efba0f44b99f595731598d29792f5f229c7c0696 · OpenDAS / ColossalAI

05 Sep, 2023 10 commits
- Merge pull request #4612 from hpcaitech/feature/shardformer · efba0f44
  Hongxin Liu authored Sep 05, 2023
```
[shardformer] update hybrid parallel plugin and fix bugs
```
  efba0f44
- Merge branch 'main' into feature/shardformer · fae6c92e
  Hongxin Liu authored Sep 05, 2023
  
  fae6c92e
- [legacy] move builder and registry to legacy (#4603) · ac178ca5
  Hongxin Liu authored Sep 04, 2023
  
  ac178ca5
- [legacy] move engine to legacy (#4560) · 8accecd5
  Hongxin Liu authored Sep 04, 2023
```
* [legacy] move engine to legacy

* [example] fix seq parallel example

* [example] fix seq parallel example

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [example] update seq parallel requirements
```
  8accecd5
- [legacy] move trainer to legacy (#4545) · 89fe0277
  Hongxin Liu authored Aug 31, 2023
```
* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test
```
  89fe0277
- [test] fix gemini checkpoint and gpt test (#4620) · bd186784
  Hongxin Liu authored Sep 05, 2023
  
  bd186784
- [zero] hotfix master param sync (#4618) · 807e01a4
  Hongxin Liu authored Sep 05, 2023
```
* [zero] add method to update master params

* [zero] update zero plugin

* [plugin] update low level zero plugin
```
  807e01a4
- [test] ignore gpt2 shardformer test (#4619) · e71d2452
  Hongxin Liu authored Sep 05, 2023
  
  e71d2452
- [shardformer] update shardformer readme (#4617) · ec086680
  flybird11111 authored Sep 05, 2023
```
[shardformer] update shardformer readme

[shardformer] update shardformer readme
```
  ec086680
- [shardformer] Add overlap optional for HybridParallelPlugin (#4615) · 86d22581
  Bin Jia authored Sep 05, 2023
```
* add optional overlap for plugin

* remove fixed todo
```
  86d22581
04 Sep, 2023 8 commits

Merge branch 'main' into feature/shardformer · a39a5c66
Hongxin Liu authored Sep 04, 2023

a39a5c66
[checkpointio] support huggingface from_pretrained for all plugins (#4606) · e79b1e80
Baizhou Zhang authored Sep 04, 2023

e79b1e80

[shardformer] update bert finetune example with HybridParallelPlugin (#4584) · 0a94fcd3

flybird11111 authored Sep 04, 2023



* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* [shardformer] fix opt test hanging

* fix

* test

* test

* [shardformer] zero1+pp and the corresponding tests (#4517)

* pause

* finish pp+zero1

* Update test_shard_vit.py

* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)

* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom

* [shardformer] fix emerged bugs after updating transformers (#4526)

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] Add overlap support for gpt2 (#4535)

* add overlap support for gpt2

* remove unused code

* remove unused code

* [shardformer] support pp+tp+zero1 tests (#4531)

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] fix submodule replacement bug when enabling pp (#4544)

* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* rebase feature/shardformer

* update pipeline

* [shardformer] fix

* [shardformer] fix

* [shardformer] bert finetune fix

* [shardformer] add all_reduce operation to loss

add all_reduce operation to loss

* [shardformer] make compatible with pytree.

make compatible with pytree.

* [shardformer] disable tp

disable tp

* [shardformer] add 3d plugin to ci test

* [shardformer] update num_microbatches to None

* [shardformer] update microbatchsize

* [shardformer] update assert

* update scheduler

* update scheduler

---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>

0a94fcd3

[shardformer] Pytree fix (#4533) · 24c07687

Jianghai authored Sep 04, 2023

* pytree test

* test bert

* test bert

* test bert

* revise

* add register

* add register

24c07687

Merge pull request #4542 from hpcaitech/chatglm · aaeb520c
yingliu-hpc authored Sep 04, 2023
```
[coati] Add chatglm in coati
```
aaeb520c
[doc] add llama2 benchmark (#4604) · 8d7b0229
binmakeswell authored Sep 04, 2023
```
* [doc] add llama2 benchmark

* [doc] add llama2 benchmark
```
8d7b0229
[DOC] hotfix/llama2news (#4595) · 7a978eb3
binmakeswell authored Sep 04, 2023
```
* [doc] add llama2 news

* [doc] add llama2 news

* [doc] add llama2 news
```
7a978eb3
[checkpointio] optimize zero optim checkpoint io (#4591) · 63ecafb1
Hongxin Liu authored Sep 04, 2023
```
* [zero] update checkpoint io to save memory

* [checkpointio] add device map to save memory
```
63ecafb1

01 Sep, 2023 5 commits
- [pipeline] 1f1b schedule receive microbatch size (#4589) · 508ca36f
  Hongxin Liu authored Sep 01, 2023
  
  508ca36f
- [Fix] Fix compile error (#4357) · cfa60708
  Mashiro authored Sep 01, 2023
  
  cfa60708
- Update Dockerfile (#4499) · eb952ea8
  栾鹏 authored Sep 01, 2023
```
fix dockerfile build
```
  eb952ea8
- [zero]fix zero ckptIO with offload (#4529) · cbac7822
  LuGY authored Sep 01, 2023
```
* fix zero ckptio with offload

* fix load device

* saved tensors in ckpt should be on CPU

* fix unit test

* fix unit test

* add clear cache

* save memory for CI
```
  cbac7822
- [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) · 38ccb8b1
  Baizhou Zhang authored Sep 01, 2023
```
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
```
  38ccb8b1
31 Aug, 2023 2 commits

[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) · c9625dbb

Baizhou Zhang authored Aug 31, 2023

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

c9625dbb

[shardformer] fix submodule replacement bug when enabling pp (#4544) · 2c787d7f
Baizhou Zhang authored Aug 31, 2023

2c787d7f

30 Aug, 2023 10 commits
- [devops] cancel previous runs in the PR (#4546) · c7b60f75
  Hongxin Liu authored Aug 30, 2023
  
  c7b60f75
- [example] change accelerate version (#4431) · f1ae8c91
  Tian Siyuan authored Aug 30, 2023
```
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
```
  f1ae8c91
- [example] update streamlit 0.73.1 to 1.11.1 (#4386) · 8e2e1992
  ChengDaqi2023 authored Aug 30, 2023
  
  8e2e1992
- [shardformer] support pp+tp+zero1 tests (#4531) · ec18fc73
  flybird11111 authored Aug 30, 2023
```
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1
```
  ec18fc73
- fix runtime prepare pass (#4502) · 12c95a9f
  Lufang Chen authored Aug 30, 2023
```
Co-authored-by: lufang.chen <lufang.chen@nio.com>
```
  12c95a9f
- keep requirements same with main branch · 9f852f24
  Ying Liu authored Aug 30, 2023
  
  9f852f24
- [shardformer] fix opt test hanging (#4521) · d367b887
  flybird11111 authored Aug 30, 2023
```
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix
```
  d367b887
- fix colossalai version in coati examples · c648dc09
  Ying Liu authored Aug 30, 2023
  
  c648dc09
- Merge pull request #4541 from ver217/coati/chatglm · 661a1ef7
  yingliu-hpc authored Aug 30, 2023
```
[coati] update ci
```
  661a1ef7
- [coati] update ci · 1c43bfd5
  ver217 authored Aug 30, 2023
  
  1c43bfd5
29 Aug, 2023 3 commits

[shardformer] Add overlap support for gpt2 (#4535) · e241b74f
Bin Jia authored Aug 29, 2023
```
* add overlap support for gpt2

* remove unused code

* remove unused code
```
e241b74f

[coati] add chatglm model (#4539) · 1467e3b4

yingliu-hpc authored Aug 29, 2023

* update configuration of chatglm and add support in coati

* add unit test & update chatglm default config & fix bos index issue

* remove chatglm due to oom

* add dataset pkg in requirement-text

* fix parameter issue in test_models

* add ref in tokenize & rm unnessary parts

* separate source & target tokenization in chatglm

* add unit test to chatglm

* fix test dataset issue

* update truncation of chatglm

* fix Colossalai version

* fix colossal ai version in test

1467e3b4

[shardformer] fix emerged bugs after updating transformers (#4526) · 0387a47e
Baizhou Zhang authored Aug 29, 2023

0387a47e

28 Aug, 2023 2 commits

[example] add llama2 example (#4527) · 0b00def8

Hongxin Liu authored Aug 28, 2023

* [example] transfer llama-1 example

* [example] fit llama-2

* [example] refactor scripts folder

* [example] fit new gemini plugin

* [cli] fix multinode runner

* [example] fit gemini optim checkpoint

* [example] refactor scripts

* [example] update requirements

* [example] update requirements

* [example] rename llama to llama2

* [example] update readme and pretrain script

* [example] refactor scripts

0b00def8

[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) · c554b7f5
Bin Jia authored Aug 28, 2023
```
* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom
```
c554b7f5