Commits · 73f4dc578e98c00a618260089cc3eb7f7210edfe · OpenDAS / ColossalAI

29 Jan, 2024 1 commit
- [workflow] updated CI image (#5318) · 73f4dc57
  Frank Lee authored Jan 29, 2024
  
  73f4dc57
25 Jan, 2024 1 commit

[feat] refactored extension module (#5298) · 7cfed5f0

Frank Lee authored Jan 25, 2024

* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

7cfed5f0

16 Jan, 2024 2 commits
- [workflow] fixed oom tests (#5275) · d69cd2eb
  Frank Lee authored Jan 16, 2024
```
* [workflow] fixed oom tests

* polish

* polish

* polish
```
  d69cd2eb
- [workflow] fixed incomplete bash command (#5272) · 04244aaa
  Frank Lee authored Jan 16, 2024
  
  04244aaa
11 Jan, 2024 1 commit

[ci] fixed booster test (#5251) · d5eeeb14

Frank Lee authored Jan 11, 2024

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

d5eeeb14

10 Jan, 2024 1 commit
- [workflow] fixed build CI (#5240) · edf94a35
  Frank Lee authored Jan 10, 2024
```
* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish
```
  edf94a35
09 Jan, 2024 1 commit

[npu] change device to accelerator api (#5239) · d202cc28

Hongxin Liu authored Jan 09, 2024



* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>

d202cc28

03 Jan, 2024 1 commit
- [devops] update torch versoin in ci (#5217) · 7f3400b5
  Hongxin Liu authored Jan 03, 2024
  
  7f3400b5
28 Nov, 2023 1 commit

[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) · 7172459e

Wenhao Chen authored Nov 28, 2023



* [shardformer] implement policy for all GPT-J models and test

* [shardformer] support interleaved pipeline parallel for bert finetune

* [shardformer] shardformer support falcon (#4883)

* [shardformer]: fix interleaved pipeline for bert model (#5048)

* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)

* Add Mistral support for Shardformer (#5103)

* [shardformer] add tests to mistral (#5105)

---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>

7172459e

23 Nov, 2023 1 commit

[Feature] Add document retrieval QA (#5020) · e53e729d

YeAnbang authored Nov 23, 2023



* add langchain

* add langchain

* Add files via upload

* add langchain

* fix style

* fix style: remove extra space

* add pytest; modified retriever

* add pytest; modified retriever

* add tests to build_on_pr.yml

* fix build_on_pr.yml

* fix build on pr; fix environ vars

* seperate unit tests for colossalqa from build from pr

* fix container setting; fix environ vars

* commented dev code

* add incremental update

* remove stale code

* fix style

* change to sha3 224

* fix retriever; fix style; add unit test for document loader

* fix ci workflow config

* fix ci workflow config

* add set cuda visible device script in ci

* fix doc string

* fix style; update readme; refactored

* add force log info

* change build on pr, ignore colossalqa

* fix docstring, captitalize all initial letters

* fix indexing; fix text-splitter

* remove debug code, update reference

* reset previous commit

* update LICENSE update README add key-value mode, fix bugs

* add files back

* revert force push

* remove junk file

* add test files

* fix retriever bug, add intent classification

* change conversation chain design

* rewrite prompt and conversation chain

* add ui v1

* ui v1

* fix atavar

* add header

* Refactor the RAG Code and support Pangu

* Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo.

* resolved conversation. tested scripts under examples. web demo still buggy

* fix ci tests

* Some modifications to add ChatGPT api

* modify llm.py and remove unnecessary files

* Delete applications/ColossalQA/examples/ui/test_frontend_input.json

* Remove OpenAI api key

* add colossalqa

* move files

* move files

* move files

* move files

* fix style

* Add Readme and fix some bugs.

* Add something to readme and modify some code

* modify a directory name for clarity

* remove redundant directory

* Correct a type in  llm.py

* fix AI prefix

* fix test_memory.py

* fix conversation

* fix some erros and typos

* Fix a missing import in RAG_ChatBot.py

* add colossalcloud LLM wrapper, correct issues in code review

---------
Co-authored-by: YeAnbang <anbangy2@outlook.com>
Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu>

e53e729d

08 Nov, 2023 1 commit
- [misc] add code owners (#5024) · 67f53317
  Hongxin Liu authored Nov 08, 2023
  
  67f53317
01 Nov, 2023 1 commit
- [release] update version (#4995) · 8993c8a8
  Hongxin Liu authored Nov 01, 2023
```
* [release] update version

* [hotfix] fix ci
```
  8993c8a8
27 Sep, 2023 1 commit
- [doc] update slack link (#4823) · 822051d8
  binmakeswell authored Sep 27, 2023
  
  822051d8
26 Sep, 2023 1 commit

[lazy] support from_pretrained (#4801) · 4965c0da

Hongxin Liu authored Sep 26, 2023

* [lazy] patch from pretrained

* [lazy] fix from pretrained and add tests

* [devops] update ci

4965c0da

20 Sep, 2023 1 commit

[chat]: update rm, add wandb and fix bugs (#4471) · 7b9b8644

Wenhao Chen authored Sep 20, 2023



* feat: modify forward fn of critic and reward model

* feat: modify calc_action_log_probs

* to: add wandb in sft and rm trainer

* feat: update train_sft

* feat: update train_rm

* style: modify type annotation and add warning

* feat: pass tokenizer to ppo trainer

* to: modify trainer base and maker base

* feat: add wandb in ppo trainer

* feat: pass tokenizer to generate

* test: update generate fn tests

* test: update train tests

* fix: remove action_mask

* feat: remove unused code

* fix: fix wrong ignore_index

* fix: fix mock tokenizer

* chore: update requirements

* revert: modify make_experience

* fix: fix inference

* fix: add padding side

* style: modify _on_learn_batch_end

* test: use mock tokenizer

* fix: use bf16 to avoid overflow

* fix: fix workflow

* [chat] fix gemini strategy

* [chat] fix

* sync: update colossalai strategy

* fix: fix args and model dtype

* fix: fix checkpoint test

* fix: fix requirements

* fix: fix missing import and wrong arg

* fix: temporarily skip gemini test in stage 3

* style: apply pre-commit

* fix: temporarily skip gemini test in stage 1&2

---------
Co-authored-by: Mingyan Jiang <1829166702@qq.com>

7b9b8644

19 Sep, 2023 1 commit

[misc] update pre-commit and run all files (#4752) · 079bf3cb

Hongxin Liu authored Sep 19, 2023

* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format

079bf3cb

18 Sep, 2023 1 commit

[legacy] clean up legacy code (#4743) · b5f9e37c

Hongxin Liu authored Sep 18, 2023

* [legacy] remove outdated codes of pipeline (#4692)

* [legacy] remove cli of benchmark and update optim (#4690)

* [legacy] remove cli of benchmark and update optim

* [doc] fix cli doc test

* [legacy] fix engine clip grad norm

* [legacy] remove outdated colo tensor (#4694)

* [legacy] remove outdated colo tensor

* [test] fix test import

* [legacy] move outdated zero to legacy (#4696)

* [legacy] clean up utils (#4700)

* [legacy] clean up utils

* [example] update examples

* [legacy] clean up amp

* [legacy] fix amp module

* [legacy] clean up gpc (#4742)

* [legacy] clean up context

* [legacy] clean core, constants and global vars

* [legacy] refactor initialize

* [example] fix examples ci

* [example] fix examples ci

* [legacy] fix tests

* [example] fix gpt example

* [example] fix examples ci

* [devops] fix ci installation

* [example] fix examples ci

b5f9e37c

11 Sep, 2023 1 commit
- [devops] fix concurrency group (#4667) · 536397cc
  Hongxin Liu authored Sep 11, 2023
  
  536397cc
08 Sep, 2023 1 commit

[devops] fix concurrency group and compatibility test (#4665) · a686f9dd

Hongxin Liu authored Sep 08, 2023

* [devops] fix concurrency group

* [devops] fix compatibility test

* [devops] fix tensornvme install

* [devops] fix tensornvme install

* [devops] fix colossalai install

a686f9dd

01 Sep, 2023 1 commit
- [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) · 38ccb8b1
  Baizhou Zhang authored Sep 01, 2023
```
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
```
  38ccb8b1
30 Aug, 2023 2 commits
- [devops] cancel previous runs in the PR (#4546) · c7b60f75
  Hongxin Liu authored Aug 30, 2023
  
  c7b60f75
- [coati] update ci · 1c43bfd5
  ver217 authored Aug 30, 2023
  
  1c43bfd5
16 Aug, 2023 1 commit

[devops] add large-scale distributed test marker (#4452) · 26e29d58

Hongxin Liu authored Aug 16, 2023

* [test] remove cpu marker

* [test] remove gpu marker

* [test] update pytest markers

* [ci] update unit test ci

26e29d58

02 Aug, 2023 1 commit

[chat] fix bugs and add unit tests (#4213) · da4f7b85

Wenhao Chen authored Aug 02, 2023

* style: rename replay buffer

Experience replay is typically for off policy algorithms.
Use this name in PPO maybe misleading.

* fix: fix wrong zero2 default arg

* test: update experience tests

* style: rename zero_pad fn

* fix: defer init in CycledDataLoader

* test: add benchmark test

* style: rename internal fn of generation

* style: rename internal fn of lora

* fix: remove unused loss fn

* fix: remove unused utils fn

* refactor: remove generate_with_actor fn

* fix: fix type annotation

* test: add models tests

* fix: skip llama due to long execution time

* style: modify dataset

* style: apply formatter

* perf: update reward dataset

* fix: fix wrong IGNORE_INDEX in sft dataset

* fix: remove DataCollatorForSupervisedDataset

* test: add dataset tests

* style: apply formatter

* style: rename test_ci to test_train

* feat: add llama in inference

* test: add inference tests

* test: change test scripts directory

* fix: update ci

* fix: fix typo

* fix: skip llama due to oom

* fix: fix file mod

* style: apply formatter

* refactor: remove duplicated llama_gptq

* style: apply formatter

* to: update rm test

* feat: add tokenizer arg

* feat: add download model script

* test: update train tests

* fix: modify gemini load and save pretrained

* test: update checkpoint io test

* to: modify nproc_per_node

* fix: do not remove existing dir

* fix: modify save path

* test: add random choice

* fix: fix sft path

* fix: enlarge nproc_per_node to avoid oom

* fix: add num_retry

* fix: make lora config of rm and critic consistent

* fix: add warning about lora weights

* fix: skip some gpt2 tests

* fix: remove grad ckpt in rm and critic due to errors

* refactor: directly use Actor in train_sft

* test: add more arguments

* fix: disable grad ckpt when using lora

* fix: fix save_pretrained and related tests

* test: enable zero2 tests

* revert: remove useless fn

* style: polish code

* test: modify test args

da4f7b85

01 Aug, 2023 1 commit

[release] update version (#4332) · 80647712

Hongxin Liu authored Aug 01, 2023

* [release] update version

* [devops] hotfix cuda extension building

* [devops] pytest ignore useless folders

80647712

21 Jul, 2023 1 commit
- [ci] support testmon core pkg change detection (#4305) · 02192a63
  Hongxin Liu authored Jul 21, 2023
  
  02192a63
04 Jul, 2023 2 commits

[workflow] show test duration (#4159) · cc3cbe9f
Frank Lee authored Jul 04, 2023

cc3cbe9f

[chat] use official transformers and fix some issues (#4117) · 3d8d5d0d

Wenhao Chen authored Jul 04, 2023

* feat: remove on_learn_epoch fn as not used

* revert: add _on_learn_epoch fn

* feat: remove NaiveStrategy

* test: update train_prompts tests

* fix: remove prepare_llama_tokenizer_and_embedding

* test: add lora arg

* feat: remove roberta support in train_prompts due to runtime errs

* feat: remove deberta & roberta in rm as not used

* test: remove deberta and roberta tests

* feat: remove deberta and roberta models as not used

* fix: remove calls to roberta

* fix: remove prepare_llama_tokenizer_and_embedding

* chore: update transformers version

* docs: update transformers version

* fix: fix actor inference

* fix: fix ci

* feat: change llama pad token to unk

* revert: revert ddp setup_distributed

* fix: change llama pad token to unk

* revert: undo unnecessary changes

* fix: use pip to install transformers

3d8d5d0d

28 Jun, 2023 1 commit
- [workflow] added status check for test coverage workflow (#4106) · 1ee947f6
  Frank Lee authored Jun 28, 2023
  
  1ee947f6
22 Jun, 2023 1 commit
- [workflow] cover all public repositories in weekly report (#4069) · b463651f
  Frank Lee authored Jun 22, 2023
  
  b463651f
19 Jun, 2023 1 commit
- [devops] fix build on pr ci (#4043) · 4a81faa5
  Hongxin Liu authored Jun 19, 2023
```
* [devops] fix build on pr ci

* [devops] fix build on pr ci
```
  4a81faa5
13 Jun, 2023 1 commit
- [workflow] fixed the directory check in build (#3980) · 8bcad736
  Frank Lee authored Jun 13, 2023
  
  8bcad736
12 Jun, 2023 2 commits
- [workflow] cancel duplicated workflow jobs (#3960) · 6718a2f2
  Frank Lee authored Jun 12, 2023
  
  6718a2f2
- [workflow] cancel duplicated workflow jobs (#3960) · 4110d1f0
  Frank Lee authored Jun 12, 2023
  
  4110d1f0
09 Jun, 2023 1 commit
- fix typo .github/workflows/scripts/ (#3946) · 1aadeede
  digger yu authored Jun 09, 2023
  
  1aadeede
07 Jun, 2023 3 commits

[workflow] added docker latest tag for release (#3920) · 5e2132dc
Frank Lee authored Jun 07, 2023

5e2132dc
[devops] hotfix testmon cache clean logic (#3917) · c25d421f
Hongxin Liu authored Jun 07, 2023

c25d421f

[chat] add distributed PPO trainer (#3740) · b5f05663

Hongxin Liu authored Jun 07, 2023



* Detached ppo (#9)

* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

* working on pipeline strategy. in progress.

* remove pipeline code. clean this branch

* update remote parameters by state_dict. no test

* nothing

* state_dict sharding transfer

* merge debug branch

* gemini _unwrap_model fix

* simplify code

* simplify code & fix LoRALinear AttributeError

* critic unwrapped state_dict

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add perfomance evaluator and fix bugs (#10)

* [chat] add performance evaluator for ray

* [chat] refactor debug arg

* [chat] support hf config

* [chat] fix generation

* [chat] add 1mmt dummy example

* [chat] fix gemini ckpt

* split experience to send (#11)
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] refactor trainer and maker (#12)

* [chat] refactor experience maker holder

* [chat] refactor model init

* [chat] refactor trainer args

* [chat] refactor model init

* [chat] refactor trainer

* [chat] refactor experience sending logic and training loop args (#13)

* [chat] refactor experience send logic

* [chat] refactor trainer

* [chat] refactor trainer

* [chat] refactor experience maker

* [chat] refactor pbar

* [chat] refactor example folder (#14)

* [chat] support quant (#15)

* [chat] add quant

* [chat] add quant example

* prompt example (#16)

* prompt example

* prompt load csv data

* remove legacy try

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add mmmt dummy example and refactor experience sending (#17)

* [chat] add mmmt dummy example

* [chat] refactor naive strategy

* [chat] fix struck problem

* [chat] fix naive strategy

* [chat] optimize experience maker sending logic

* [chat] refactor sending assignment

* [chat] refactor performance evaluator (#18)

* Prompt Example & requires_grad state_dict & sharding state_dict (#19)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

---------
Co-authored-by: csric <richcsr256@gmail.com>

* state_dict sending adapts to new unwrap function (#20)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

* opt benchmark

* better script

* nothing

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test

* working on lora reconstruction

* state_dict sending adapts to new unwrap function

* remove comments

---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* [chat-ray] add readme (#21)

* add readme

* transparent graph

* add note background

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] get images from url (#22)

* Refactor/chat ray (#23)

* [chat] lora add todo

* [chat] remove unused pipeline strategy

* [chat] refactor example structure

* [chat] setup ci for ray

* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)

* lora support prototype

* lora support

* 1mmt lora & remove useless code

---------
Co-authored-by: csric <richcsr256@gmail.com>

* [chat] fix test ci for ray

* [chat] fix test ci requirements for ray

* [chat] fix ray runtime env

* [chat] fix ray runtime env

* [chat] fix example ci docker args

* [chat] add debug info in trainer

* [chat] add nccl debug info

* [chat] skip ray test

* [doc] fix typo

---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>

b5f05663

06 Jun, 2023 2 commits

[devops] hotfix CI about testmon cache (#3910) · 41fb7236
Hongxin Liu authored Jun 06, 2023
```
* [devops] hotfix CI about testmon cache

* [devops] fix testmon cahe on pr
```
41fb7236

[devops] improving testmon cache (#3902) · ec9bbc00

Hongxin Liu authored Jun 06, 2023

* [devops] improving testmon cache

* [devops] fix branch name with slash

* [devops] fix branch name with slash

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] update readme

ec9bbc00