Commits · 3c6b831c264d0657a97034b5cf036c913a762083 · OpenDAS / ColossalAI

18 Sep, 2023 1 commit

[legacy] clean up legacy code (#4743) · b5f9e37c

Hongxin Liu authored Sep 18, 2023

* [legacy] remove outdated codes of pipeline (#4692)

* [legacy] remove cli of benchmark and update optim (#4690)

* [legacy] remove cli of benchmark and update optim

* [doc] fix cli doc test

* [legacy] fix engine clip grad norm

* [legacy] remove outdated colo tensor (#4694)

* [legacy] remove outdated colo tensor

* [test] fix test import

* [legacy] move outdated zero to legacy (#4696)

* [legacy] clean up utils (#4700)

* [legacy] clean up utils

* [example] update examples

* [legacy] clean up amp

* [legacy] fix amp module

* [legacy] clean up gpc (#4742)

* [legacy] clean up context

* [legacy] clean core, constants and global vars

* [legacy] refactor initialize

* [example] fix examples ci

* [example] fix examples ci

* [legacy] fix tests

* [example] fix gpt example

* [example] fix examples ci

* [devops] fix ci installation

* [example] fix examples ci

b5f9e37c

15 Sep, 2023 1 commit
- [legacy] remove deterministic data loader test · cd4e61d1
  Pengtai Xu authored Sep 15, 2023
  
  cd4e61d1
12 Sep, 2023 1 commit
- fix some typo with colossalai/device colossalai/tensor/ etc. (#4171) · 9c2feb2f
  digger yu authored Sep 12, 2023
```
Co-authored-by: flybird11111 <1829166702@qq.com>
```
  9c2feb2f
11 Sep, 2023 3 commits

[Feature] The first PR to Add TP inference engine, kv-cache manager and... · bce0f167

Cuiqing Li authored Sep 12, 2023


[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577)

* [infer] Infer/llama demo (#4503)

* add

* add infer example

* finish

* finish

* stash

* fix

* [Kernels]  add inference token attention kernel (#4505)

* add token forward

* fix tests

* fix comments

* add try import triton

* add adapted license

* add tests check

* [Kernels] add necessary kernels (llama & bloom) for attention forward and kv-cache manager  (#4485)

* added _vllm_rms_norm

* change place

* added tests

* added tests

* modify

* adding kernels

* added tests:

* adding kernels

* modify

* added

* updating kernels

* adding tests

* added tests

* kernel change

* submit

* modify

* added

* edit comments

* change name

* change commnets and fix import

* add

* added

* combine codes (#4509)

* [feature] add KV cache manager for llama & bloom inference (#4495)

* add kv cache memory manager

* add stateinfo during inference

* format

* format

* rename file

* add kv cache test

* revise on BatchInferState

* file dir change

* [Bug FIx] import llama context ops fix (#4524)

* added _vllm_rms_norm

* change place

* added tests

* added tests

* modify

* adding kernels

* added tests:

* adding kernels

* modify

* added

* updating kernels

* adding tests

* added tests

* kernel change

* submit

* modify

* added

* edit comments

* change name

* change commnets and fix import

* add

* added

* fix

* add ops into init.py

* add

* [Infer] Add TPInferEngine and fix file path (#4532)

* add engine for TP inference

* move file path

* update path

* fix TPInferEngine

* remove unused file

* add engine test demo

* revise TPInferEngine

* fix TPInferEngine, add test

* fix

* Add Inference test for llama (#4508)

* add kv cache memory manager

* add stateinfo during inference

* add

* add infer example

* finish

* finish

* format

* format

* rename file

* add kv cache test

* revise on BatchInferState

* add inference test for llama

* fix conflict

* feature: add some new features for llama engine

* adapt colossalai triton interface

* Change the parent class of llama  policy

* add nvtx

* move llama inference code to tensor_parallel

* fix __init__.py

* rm tensor_parallel

* fix: fix bugs in auto_policy.py

* fix:rm some unused codes

* mv colossalai/tpinference to colossalai/inference/tensor_parallel

* change __init__.py

* save change

* fix engine

* Bug fix: Fix hang

* remove llama_infer_engine.py

---------
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>

* [infer] Add Bloom inference policy and replaced methods (#4512)

* add bloom inference methods and policy

* enable pass BatchInferState from model forward

* revise bloom infer layers/policies

* add engine for inference (draft)

* add test for bloom infer

* fix bloom infer policy and flow

* revise bloom test

* fix bloom file path

* remove unused codes

* fix bloom modeling

* fix dir typo

* fix trivial

* fix policy

* clean pr

* trivial fix

* Revert "[infer] Add Bloom inference policy and replaced methods (#4512)" (#4552)

This reverts commit 17cfa5714083a81a505c097f1c411cd28162d922.

* [Doc] Add colossal inference doc (#4549)

* create readme

* add readme.md

* fix typos

* [infer] Add Bloom inference policy and replaced methods (#4553)

* add bloom inference methods and policy

* enable pass BatchInferState from model forward

* revise bloom infer layers/policies

* add engine for inference (draft)

* add test for bloom infer

* fix bloom infer policy and flow

* revise bloom test

* fix bloom file path

* remove unused codes

* fix bloom modeling

* fix dir typo

* fix trivial

* fix policy

* clean pr

* trivial fix

* trivial

* Fix Bugs In Llama Model Forward (#4550)

* add kv cache memory manager

* add stateinfo during inference

* add

* add infer example

* finish

* finish

* format

* format

* rename file

* add kv cache test

* revise on BatchInferState

* add inference test for llama

* fix conflict

* feature: add some new features for llama engine

* adapt colossalai triton interface

* Change the parent class of llama  policy

* add nvtx

* move llama inference code to tensor_parallel

* fix __init__.py

* rm tensor_parallel

* fix: fix bugs in auto_policy.py

* fix:rm some unused codes

* mv colossalai/tpinference to colossalai/inference/tensor_parallel

* change __init__.py

* save change

* fix engine

* Bug fix: Fix hang

* remove llama_infer_engine.py

* bug fix: fix bugs about infer_state.is_context_stage

* remove pollcies

* fix: delete unused code

* fix: delete unused code

* remove unused coda

* fix conflict

---------
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>

* [doc] add colossal inference fig (#4554)

* create readme

* add readme.md

* fix typos

* upload fig

* [NFC] fix docstring for colossal inference (#4555)

Fix docstring and comments in kv cache manager and bloom modeling

* fix docstring in llama modeling (#4557)

* [Infer] check import vllm (#4559)

* change import vllm

* import apply_rotary_pos_emb

* change import location

* [DOC] add installation req (#4561)

* add installation req

* fix

* slight change

* remove empty

* [Feature] rms-norm transfer into inference llama.py  (#4563)

* add installation req

* fix

* slight change

* remove empty

* add rmsnorm polciy

* add

* clean codes

* [infer] Fix tp inference engine (#4564)

* fix engine prepare data

* add engine test

* use bloom for testing

* revise on test

* revise on test

* reset shardformer llama (#4569)

* [infer] Fix engine - tensors on different devices (#4570)


* fix diff device in engine

* [codefactor] Feature/colossal inference (#4579)

* code factors

* remove

* change coding (#4581)

* [doc] complete README of colossal inference (#4585)

* complete fig

* Update README.md

* [doc]update readme (#4586)

* update readme

* Update README.md

* bug fix: fix bus in llama and bloom (#4588)

* [BUG FIX]Fix test engine in CI and non-vllm kernels llama forward  (#4592)

* fix tests

* clean

* clean

* fix bugs

* add

* fix llama non-vllm kernels bug

* modify

* clean codes

* [Kernel]Rmsnorm fix (#4598)

* fix tests

* clean

* clean

* fix bugs

* add

* fix llama non-vllm kernels bug

* modify

* clean codes

* add triton rmsnorm

* delete vllm kernel flag

* [Bug Fix]Fix bugs in llama (#4601)

* fix tests

* clean

* clean

* fix bugs

* add

* fix llama non-vllm kernels bug

* modify

* clean codes

* bug fix: remove rotary_positions_ids

---------
Co-authored-by: cuiqing.li <lixx3527@gmail.com>

* [kernel] Add triton layer norm & replace norm for bloom (#4609)

* add layernorm for inference

* add test for layernorm kernel

* add bloom layernorm replacement policy

* trivial: path

* [Infer] Bug fix rotary embedding in llama (#4608)

* fix rotary embedding

* delete print

* fix init seq len bug

* rename pytest

* add benchmark for llama

* refactor codes

* delete useless code

* [bench] Add bloom inference benchmark (#4621)

* add bloom benchmark

* readme - update benchmark res

* trivial - uncomment for testing (#4622)

* [Infer] add check triton and cuda version for tests (#4627)

* fix rotary embedding

* delete print

* fix init seq len bug

* rename pytest

* add benchmark for llama

* refactor codes

* delete useless code

* add check triton and cuda

* Update sharder.py (#4629)

* [Inference] Hot fix some bugs and typos (#4632)

* fix

* fix test

* fix conflicts

* [typo]Comments fix (#4633)

* fallback

* fix commnets

* bug fix: fix some bugs in test_llama and test_bloom (#4635)

* [Infer] delete benchmark in tests and fix bug for llama and bloom (#4636)

* fix rotary embedding

* delete print

* fix init seq len bug

* rename pytest

* add benchmark for llama

* refactor codes

* delete useless code

* add check triton and cuda

* delete benchmark and fix infer bugs

* delete benchmark for tests

* delete useless code

* delete bechmark function in utils

* [Fix] Revise TPInferEngine, inference tests and benchmarks (#4642)

* [Fix] revise TPInferEngine methods and inference tests

* fix llama/bloom infer benchmarks

* fix infer tests

* trivial fix: benchmakrs

* trivial

* trivial: rm print

* modify utils filename for infer ops test (#4657)

* [Infer] Fix TPInferEngine init & inference tests, benchmarks (#4670)

* fix engine funcs

* TPInferEngine: receive shard config in init

* benchmarks: revise TPInferEngine init

* benchmarks: remove pytest decorator

* trivial fix

* use small model for tests

* [NFC] use args for infer benchmarks (#4674)

* revise infer default (#4683)

* [Fix] optimize/shard model in TPInferEngine init (#4684)

* remove using orig model in engine

* revise inference tests

* trivial: rename

---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>

bce0f167

[shardformer]fix gpt2 double head (#4663) · eedaa3e1

flybird11111 authored Sep 11, 2023

* [shardformer]fix gpt2 test

[shardformer]fix gpt2 test

[shardformer]fix gpt2 test

* fix

* [shardformer] add todo

* [shardformer] add todo

eedaa3e1

[legacy] move communication and nn to legacy and refactor logger (#4671) · 554aa959

Hongxin Liu authored Sep 11, 2023

* [legacy] move communication to legacy (#4640)

* [legacy] refactor logger and clean up legacy codes (#4654)

* [legacy] make logger independent to gpc

* [legacy] make optim independent to registry

* [legacy] move test engine to legacy

* [legacy] move nn to legacy (#4656)

* [legacy] move nn to legacy

* [checkpointio] fix save hf config

* [test] remove useledd rpc pp test

* [legacy] fix nn init

* [example] skip tutorial hybriad parallel example

* [devops] test doc check

* [devops] test doc check

554aa959

09 Sep, 2023 1 commit

[shardformer] update llama2/opt finetune example and fix llama2 policy (#4645) · 7486ed7d

flybird11111 authored Sep 09, 2023

* [shardformer] update shardformer readme

[shardformer] update shardformer readme

[shardformer] update shardformer readme

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] change dataset

* [shardformer] change dataset

* [shardformer] fix CI

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

[example] update opt example

[example] resolve comments

fix

fix

7486ed7d

07 Sep, 2023 1 commit

[pipeline] set optimizer to optional in execute_pipeline (#4630) · 660eed91

Baizhou Zhang authored Sep 07, 2023

* set optimizer to optional in execute_pipeline

* arrange device and mixed precision in booster init

* fix execute_pipeline in booster.py

660eed91

05 Sep, 2023 5 commits

[legacy] move engine to legacy (#4560) · 8accecd5

Hongxin Liu authored Sep 04, 2023

* [legacy] move engine to legacy

* [example] fix seq parallel example

* [example] fix seq parallel example

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [example] update seq parallel requirements

8accecd5

[legacy] move trainer to legacy (#4545) · 89fe0277

Hongxin Liu authored Aug 31, 2023

* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test

89fe0277

[test] fix gemini checkpoint and gpt test (#4620) · bd186784
Hongxin Liu authored Sep 05, 2023

bd186784

[zero] hotfix master param sync (#4618) · 807e01a4

Hongxin Liu authored Sep 05, 2023

* [zero] add method to update master params

* [zero] update zero plugin

* [plugin] update low level zero plugin

807e01a4

[test] ignore gpt2 shardformer test (#4619) · e71d2452
Hongxin Liu authored Sep 05, 2023

e71d2452

04 Sep, 2023 2 commits
- [checkpointio] support huggingface from_pretrained for all plugins (#4606) · e79b1e80
  Baizhou Zhang authored Sep 04, 2023
  
  e79b1e80
- [shardformer] Pytree fix (#4533) · 24c07687
  Jianghai authored Sep 04, 2023
```
* pytree test

* test bert

* test bert

* test bert

* revise

* add register

* add register
```
  24c07687
01 Sep, 2023 3 commits
- [pipeline] 1f1b schedule receive microbatch size (#4589) · 508ca36f
  Hongxin Liu authored Sep 01, 2023
  
  508ca36f
- [zero]fix zero ckptIO with offload (#4529) · cbac7822
  LuGY authored Sep 01, 2023
```
* fix zero ckptio with offload

* fix load device

* saved tensors in ckpt should be on CPU

* fix unit test

* fix unit test

* add clear cache

* save memory for CI
```
  cbac7822
- [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) · 38ccb8b1
  Baizhou Zhang authored Sep 01, 2023
```
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
```
  38ccb8b1
31 Aug, 2023 2 commits

[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) · c9625dbb

Baizhou Zhang authored Aug 31, 2023

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

c9625dbb

[shardformer] fix submodule replacement bug when enabling pp (#4544) · 2c787d7f
Baizhou Zhang authored Aug 31, 2023

2c787d7f

30 Aug, 2023 2 commits

[shardformer] support pp+tp+zero1 tests (#4531) · ec18fc73

flybird11111 authored Aug 30, 2023

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

ec18fc73

[shardformer] fix opt test hanging (#4521) · d367b887

flybird11111 authored Aug 30, 2023

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

d367b887

29 Aug, 2023 2 commits
- [shardformer] Add overlap support for gpt2 (#4535) · e241b74f
  Bin Jia authored Aug 29, 2023
```
* add overlap support for gpt2

* remove unused code

* remove unused code
```
  e241b74f
- [shardformer] fix emerged bugs after updating transformers (#4526) · 0387a47e
  Baizhou Zhang authored Aug 29, 2023
  
  0387a47e
28 Aug, 2023 2 commits
- [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) · c554b7f5
  Bin Jia authored Aug 28, 2023
```
* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom
```
  c554b7f5
- [shardformer] zero1+pp and the corresponding tests (#4517) · 376533a5
  Jianghai authored Aug 28, 2023
```
* pause

* finish pp+zero1

* Update test_shard_vit.py
```
  376533a5
25 Aug, 2023 3 commits

[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506) · 44eab2b2

Baizhou Zhang authored Aug 25, 2023

* add APIs

* implement save_sharded_model

* add test for hybrid checkpointio

* implement naive loading for sharded model

* implement efficient sharded model loading

* open a new file for hybrid checkpoint_io

* small fix

* fix circular importing

* fix docstring

* arrange arguments and apis

* small fix

44eab2b2

[shardformer] opt fix. (#4514) · de8a65ba

flybird11111 authored Aug 25, 2023

* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks

* [Test] test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* fix

de8a65ba

[zero]support zero2 with gradient accumulation (#4511) · 839847b7
LuGY authored Aug 25, 2023
```
* support gradient accumulation with zero2

* fix type
```
839847b7

24 Aug, 2023 2 commits

[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498) · 3353e55c

flybird11111 authored Aug 24, 2023

* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks

3353e55c

[gemini] improve compatibility and add static placement policy (#4479) · 27061426

Hongxin Liu authored Aug 24, 2023

* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example

27061426

23 Aug, 2023 1 commit
- [shardformer] tests for 3d parallel (#4493) · e04436a8
  Jianghai authored Aug 23, 2023
  
  e04436a8
22 Aug, 2023 2 commits

[shardformer] chatglm support sequence parallel (#4482) · 59e252ec

flybird11111 authored Aug 22, 2023

* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

59e252ec

rename chatglm to chatglm2 (#4484) · 5545114f
Jianghai authored Aug 22, 2023

5545114f

21 Aug, 2023 1 commit
- [shardformer] support tp+zero for shardformer (#4472) · 1c7df566
  Baizhou Zhang authored Aug 21, 2023
```
* support tp+zero/input type cast for hybridplugin

* add tp+zero tests

* fix bucket arguments
```
  1c7df566
18 Aug, 2023 2 commits

[shardformer] Pipeline/whisper (#4456) · 8739aa7f

Jianghai authored Aug 18, 2023

* add some base tests and policies

* finish whisper base model

* add conditional generation

* finish basic tests

* whisper

* finish whisper

* finish whisper

* del useless  whisper test

* fix

* add argmin to replace

* finish revision

8739aa7f

[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460) · 7c8be770
Bin Jia authored Aug 18, 2023
```
* support gpt2 seq parallel with pp/dp/tp

* fix a bug when waiting for stream done

* delete unused gpt2_seq file
```
7c8be770

16 Aug, 2023 3 commits

[shardformer] support interleaved pipeline (#4448) · a78daf61

LuGY authored Aug 16, 2023

* support interleaved pipeline

* fix unit test

* remove virtual stage test in stage mgr

* add droped type hint and updated bwd

a78daf61

[devops] add large-scale distributed test marker (#4452) · 26e29d58

Hongxin Liu authored Aug 16, 2023

* [test] remove cpu marker

* [test] remove gpu marker

* [test] update pytest markers

* [ci] update unit test ci

26e29d58

[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446) · 6ef33f75
Baizhou Zhang authored Aug 16, 2023
```
* support DDP for HybridPlugin/add tp+dp tests

* add docstring for HybridParallelPlugin
```
6ef33f75