1. 29 Aug, 2023 1 commit
    • yingliu-hpc's avatar
      [coati] add chatglm model (#4539) · 1467e3b4
      yingliu-hpc authored
      * update configuration of chatglm and add support in coati
      
      * add unit test & update chatglm default config & fix bos index issue
      
      * remove chatglm due to oom
      
      * add dataset pkg in requirement-text
      
      * fix parameter issue in test_models
      
      * add ref in tokenize & rm unnessary parts
      
      * separate source & target tokenization in chatglm
      
      * add unit test to chatglm
      
      * fix test dataset issue
      
      * update truncation of chatglm
      
      * fix Colossalai version
      
      * fix colossal ai version in test
      1467e3b4
  2. 14 Aug, 2023 1 commit
    • Wenhao Chen's avatar
      [doc] update Coati README (#4405) · 6d41c3f2
      Wenhao Chen authored
      * style: apply formatter
      
      * fix: add outdated warnings
      
      * docs: add dataset format and polish
      
      * docs: polish README
      
      * fix: fix json format
      
      * fix: fix typos
      
      * revert: revert 7b example
      6d41c3f2
  3. 02 Aug, 2023 1 commit
    • Wenhao Chen's avatar
      [chat] fix bugs and add unit tests (#4213) · da4f7b85
      Wenhao Chen authored
      * style: rename replay buffer
      
      Experience replay is typically for off policy algorithms.
      Use this name in PPO maybe misleading.
      
      * fix: fix wrong zero2 default arg
      
      * test: update experience tests
      
      * style: rename zero_pad fn
      
      * fix: defer init in CycledDataLoader
      
      * test: add benchmark test
      
      * style: rename internal fn of generation
      
      * style: rename internal fn of lora
      
      * fix: remove unused loss fn
      
      * fix: remove unused utils fn
      
      * refactor: remove generate_with_actor fn
      
      * fix: fix type annotation
      
      * test: add models tests
      
      * fix: skip llama due to long execution time
      
      * style: modify dataset
      
      * style: apply formatter
      
      * perf: update reward dataset
      
      * fix: fix wrong IGNORE_INDEX in sft dataset
      
      * fix: remove DataCollatorForSupervisedDataset
      
      * test: add dataset tests
      
      * style: apply formatter
      
      * style: rename test_ci to test_train
      
      * feat: add llama in inference
      
      * test: add inference tests
      
      * test: change test scripts directory
      
      * fix: update ci
      
      * fix: fix typo
      
      * fix: skip llama due to oom
      
      * fix: fix file mod
      
      * style: apply formatter
      
      * refactor: remove duplicated llama_gptq
      
      * style: apply formatter
      
      * to: update rm test
      
      * feat: add tokenizer arg
      
      * feat: add download model script
      
      * test: update train tests
      
      * fix: modify gemini load and save pretrained
      
      * test: update checkpoint io test
      
      * to: modify nproc_per_node
      
      * fix: do not remove existing dir
      
      * fix: modify save path
      
      * test: add random choice
      
      * fix: fix sft path
      
      * fix: enlarge nproc_per_node to avoid oom
      
      * fix: add num_retry
      
      * fix: make lora config of rm and critic consistent
      
      * fix: add warning about lora weights
      
      * fix: skip some gpt2 tests
      
      * fix: remove grad ckpt in rm and critic due to errors
      
      * refactor: directly use Actor in train_sft
      
      * test: add more arguments
      
      * fix: disable grad ckpt when using lora
      
      * fix: fix save_pretrained and related tests
      
      * test: enable zero2 tests
      
      * revert: remove useless fn
      
      * style: polish code
      
      * test: modify test args
      da4f7b85
  4. 01 Aug, 2023 1 commit
  5. 28 Jul, 2023 1 commit
  6. 26 Jul, 2023 7 commits
  7. 04 Jul, 2023 3 commits
    • Frank Lee's avatar
      [chat] removed cache file (#4155) · f447ca18
      Frank Lee authored
      f447ca18
    • wukong1992's avatar
      [shardformer] shardformer support t5 model (#3994) · c1c672d0
      wukong1992 authored
      test t5
      c1c672d0
    • Wenhao Chen's avatar
      [chat] use official transformers and fix some issues (#4117) · 3d8d5d0d
      Wenhao Chen authored
      * feat: remove on_learn_epoch fn as not used
      
      * revert: add _on_learn_epoch fn
      
      * feat: remove NaiveStrategy
      
      * test: update train_prompts tests
      
      * fix: remove prepare_llama_tokenizer_and_embedding
      
      * test: add lora arg
      
      * feat: remove roberta support in train_prompts due to runtime errs
      
      * feat: remove deberta & roberta in rm as not used
      
      * test: remove deberta and roberta tests
      
      * feat: remove deberta and roberta models as not used
      
      * fix: remove calls to roberta
      
      * fix: remove prepare_llama_tokenizer_and_embedding
      
      * chore: update transformers version
      
      * docs: update transformers version
      
      * fix: fix actor inference
      
      * fix: fix ci
      
      * feat: change llama pad token to unk
      
      * revert: revert ddp setup_distributed
      
      * fix: change llama pad token to unk
      
      * revert: undo unnecessary changes
      
      * fix: use pip to install transformers
      3d8d5d0d
  8. 29 Jun, 2023 2 commits
    • Wenhao Chen's avatar
      [chat] remove naive strategy and split colossalai strategy (#4094) · edd75a59
      Wenhao Chen authored
      * feat: remove on_learn_epoch fn as not used
      
      * revert: add _on_learn_epoch fn
      
      * to: remove the use of NaiveStrategy
      
      * test: remove NaiveStrategy tests
      
      * feat: remove NaiveStrategy
      
      * style: modify comments and params
      
      * feat: split ColossalAIStrategy into LowLevelZeroStrategy and GeminiStrategy
      
      * fix: remove naive
      
      * fix: align with modified colossal strategy
      
      * fix: fix ddp _try_init_dist arg
      edd75a59
    • Wenhao Chen's avatar
      [chat] refactor trainer class (#4080) · b03d64d0
      Wenhao Chen authored
      * to: add SLTrainer
      
      * refactor: refactor RMTrainer and SFTTrainer
      
      * fix: fix init file
      
      * feat: remove on_learn_epoch fn as not used
      
      * fix: align with modified gemini arguments
      
      * to: add OnPolicyTrainer
      
      * revert: add _on_learn_epoch fn
      
      * refactor: refactor PPOTrainer
      
      * style: rename PPOTrainer argument
      
      * fix: align with modified PPO arguments
      
      * test: align with modified train_prompts arguments
      
      * chore: modify train_prompts
      
      * docs: align with modified arguments
      
      * fix: remove unnecessary output
      
      * fix: move dataloader to fit fn of SLTrainer
      
      * fix: move dataloader to fit fn of OnPolicyTrainer
      
      * fix: modify usage of prompt and pretrain dataloader
      b03d64d0
  9. 26 Jun, 2023 1 commit
  10. 25 Jun, 2023 1 commit
    • Wenhao Chen's avatar
      [chat] refactor strategy class with booster api (#3987) · 153b957a
      Wenhao Chen authored
      * refactor: adapt boost API in base and naive strategies
      
      * fix: initialize plugin after setup_distributed
      
      * fix: fix save_pretrained fn
      
      * refactor: adapt boost API in DDPStrategy
      
      * to: add _post_init check
      
      * to: fix ddp backward, modify ddp dataloader and unwrap
      
      * feat: adapt boost API in ColossalAIStrategy
      
      * fix: call setup_distributed before use get_current_device
      
      * fix: fix save_model and save_optimizer
      
      * test: remove save_sharded_optimizer test
      
      * style: apply formatter
      
      * fix: fix stage check and add comments
      
      * feat: allow dict type arg in strategy.prepare
      
      * to: temporarily remove lr_scheduler for testing
      
      * style: simplify init of ColossalAIStrategy
      
      * fix: fix lr_scheduler in sft and rm
      
      * style: modify comments
      
      * test: add train_prompts tests
      
      * fix: fix inference only case and use in train_prompts
      
      * test: skip failed tests in ci
      
      * style: fix CodeFactor check
      
      * fix: do not use model.to('cpu') with GeminiPlugin
      
      * test: enable colossalai_gemini tests
      
      * test: set CUDA_VISIBLE_DEVICES in ci
      
      * docs: add note
      153b957a
  11. 15 Jun, 2023 1 commit
  12. 13 Jun, 2023 1 commit
    • Wenhao Chen's avatar
      [chat] refactor actor class (#3968) · 9d02590c
      Wenhao Chen authored
      * refactor: separate log_probs fn from Actor forward fn
      
      * refactor: separate generate fn from Actor class
      
      * feat: update unwrap_model and get_base_model
      * unwrap_model returns model not wrapped by Strategy
      * get_base_model returns HF model for Actor, Critic and RewardModel
      
      * feat: simplify Strategy.prepare
      
      * style: remove get_base_model method of Actor
      
      * perf: tokenize text in batches
      
      * refactor: move calc_action_log_probs to utils of model
      
      * test: update test with new forward fn
      
      * style: rename forward fn args
      
      * fix: do not unwrap model in save_model fn of naive strategy
      
      * test: add gemini test for train_prompts
      
      * fix: fix _set_default_generate_kwargs
      9d02590c
  13. 07 Jun, 2023 1 commit
    • Hongxin Liu's avatar
      [chat] add distributed PPO trainer (#3740) · b5f05663
      Hongxin Liu authored
      
      
      * Detached ppo (#9)
      
      * run the base
      
      * working on dist ppo
      
      * sync
      
      * detached trainer
      
      * update detached trainer. no maker update function
      
      * facing init problem
      
      * 1 maker 1 trainer detached run. but no model update
      
      * facing cuda problem
      
      * fix save functions
      
      * verified maker update
      
      * nothing
      
      * add ignore
      
      * analyize loss issue
      
      * remove some debug codes
      
      * facing 2m1t stuck issue
      
      * 2m1t verified
      
      * do not use torchrun
      
      * working on 2m2t
      
      * working on 2m2t
      
      * initialize strategy in ray actor env
      
      * facing actor's init order issue
      
      * facing ddp model update issue (need unwarp ddp)
      
      * unwrap ddp actor
      
      * checking 1m2t stuck problem
      
      * nothing
      
      * set timeout for trainer choosing. It solves the stuck problem!
      
      * delete some debug output
      
      * rename to sync with upstream
      
      * rename to sync with upstream
      
      * coati rename
      
      * nothing
      
      * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations
      
      * experience_maker_holder performs target-revolving _send_experience() instead of length comparison.
      
      * move code to ray subfolder
      
      * working on pipeline inference
      
      * apply comments
      
      * working on pipeline strategy. in progress.
      
      * remove pipeline code. clean this branch
      
      * update remote parameters by state_dict. no test
      
      * nothing
      
      * state_dict sharding transfer
      
      * merge debug branch
      
      * gemini _unwrap_model fix
      
      * simplify code
      
      * simplify code & fix LoRALinear AttributeError
      
      * critic unwrapped state_dict
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] add perfomance evaluator and fix bugs (#10)
      
      * [chat] add performance evaluator for ray
      
      * [chat] refactor debug arg
      
      * [chat] support hf config
      
      * [chat] fix generation
      
      * [chat] add 1mmt dummy example
      
      * [chat] fix gemini ckpt
      
      * split experience to send (#11)
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] refactor trainer and maker (#12)
      
      * [chat] refactor experience maker holder
      
      * [chat] refactor model init
      
      * [chat] refactor trainer args
      
      * [chat] refactor model init
      
      * [chat] refactor trainer
      
      * [chat] refactor experience sending logic and training loop args (#13)
      
      * [chat] refactor experience send logic
      
      * [chat] refactor trainer
      
      * [chat] refactor trainer
      
      * [chat] refactor experience maker
      
      * [chat] refactor pbar
      
      * [chat] refactor example folder (#14)
      
      * [chat] support quant (#15)
      
      * [chat] add quant
      
      * [chat] add quant example
      
      * prompt example (#16)
      
      * prompt example
      
      * prompt load csv data
      
      * remove legacy try
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] add mmmt dummy example and refactor experience sending (#17)
      
      * [chat] add mmmt dummy example
      
      * [chat] refactor naive strategy
      
      * [chat] fix struck problem
      
      * [chat] fix naive strategy
      
      * [chat] optimize experience maker sending logic
      
      * [chat] refactor sending assignment
      
      * [chat] refactor performance evaluator (#18)
      
      * Prompt Example & requires_grad state_dict & sharding state_dict (#19)
      
      * prompt example
      
      * prompt load csv data
      
      * remove legacy try
      
      * maker models require_grad set to False
      
      * working on zero redundancy update
      
      * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
      
      * remove legacy examples
      
      * remove legacy examples
      
      * remove replay buffer tp state. bad design
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * state_dict sending adapts to new unwrap function (#20)
      
      * prompt example
      
      * prompt load csv data
      
      * remove legacy try
      
      * maker models require_grad set to False
      
      * working on zero redundancy update
      
      * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
      
      * remove legacy examples
      
      * remove legacy examples
      
      * remove replay buffer tp state. bad design
      
      * opt benchmark
      
      * better script
      
      * nothing
      
      * [chat] strategy refactor unwrap model
      
      * [chat] strategy refactor save model
      
      * [chat] add docstr
      
      * [chat] refactor trainer save model
      
      * [chat] fix strategy typing
      
      * [chat] refactor trainer save model
      
      * [chat] update readme
      
      * [chat] fix unit test
      
      * working on lora reconstruction
      
      * state_dict sending adapts to new unwrap function
      
      * remove comments
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      Co-authored-by: default avatarver217 <lhx0217@gmail.com>
      
      * [chat-ray] add readme (#21)
      
      * add readme
      
      * transparent graph
      
      * add note background
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] get images from url (#22)
      
      * Refactor/chat ray (#23)
      
      * [chat] lora add todo
      
      * [chat] remove unused pipeline strategy
      
      * [chat] refactor example structure
      
      * [chat] setup ci for ray
      
      * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)
      
      * lora support prototype
      
      * lora support
      
      * 1mmt lora & remove useless code
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] fix test ci for ray
      
      * [chat] fix test ci requirements for ray
      
      * [chat] fix ray runtime env
      
      * [chat] fix ray runtime env
      
      * [chat] fix example ci docker args
      
      * [chat] add debug info in trainer
      
      * [chat] add nccl debug info
      
      * [chat] skip ray test
      
      * [doc] fix typo
      
      ---------
      Co-authored-by: default avatarcsric <59389055+CsRic@users.noreply.github.com>
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      b5f05663
  14. 25 May, 2023 1 commit
    • digger yu's avatar
      [nfc] fix typo colossalai/ applications/ (#3831) · e2d81eba
      digger yu authored
      * fix typo colossalai/autochunk auto_parallel amp
      
      * fix typo colossalai/auto_parallel nn utils etc.
      
      * fix typo colossalai/auto_parallel autochunk fx/passes  etc.
      
      * fix typo docs/
      
      * change placememt_policy to placement_policy in docs/ and examples/
      
      * fix typo colossalai/ applications/
      e2d81eba
  15. 23 May, 2023 1 commit
  16. 17 May, 2023 1 commit
  17. 10 May, 2023 1 commit
  18. 04 May, 2023 1 commit
  19. 28 Apr, 2023 1 commit
  20. 27 Apr, 2023 2 commits
    • Hongxin Liu's avatar
      [chat] refactor model save/load logic (#3654) · 842768a1
      Hongxin Liu authored
      * [chat] strategy refactor unwrap model
      
      * [chat] strategy refactor save model
      
      * [chat] add docstr
      
      * [chat] refactor trainer save model
      
      * [chat] fix strategy typing
      
      * [chat] refactor trainer save model
      
      * [chat] update readme
      
      * [chat] fix unit test
      842768a1
    • Hongxin Liu's avatar
      [chat] remove lm model class (#3653) · 6ef70114
      Hongxin Liu authored
      * [chat] refactor lora
      
      * [chat] remove lm class
      
      * [chat] refactor save model
      
      * [chat] refactor train sft
      
      * [chat] fix ci
      
      * [chat] fix ci
      6ef70114
  21. 26 Apr, 2023 3 commits
    • Hongxin Liu's avatar
      [chat] refactor trainer (#3648) · 2a951955
      Hongxin Liu authored
      * [chat] ppo trainer remove useless args
      
      * [chat] update examples
      
      * [chat] update benchmark
      
      * [chat] update examples
      
      * [chat] fix sft training with wandb
      
      * [chat] polish docstr
      2a951955
    • Hongxin Liu's avatar
      [chat] polish performance evaluator (#3647) · f8288315
      Hongxin Liu authored
      f8288315
    • Hongxin Liu's avatar
      [gemini] accelerate inference (#3641) · 50793b35
      Hongxin Liu authored
      * [gemini] support don't scatter after inference
      
      * [chat] update colossalai strategy
      
      * [chat] fix opt benchmark
      
      * [chat] update opt benchmark
      
      * [gemini] optimize inference
      
      * [test] add gemini inference test
      
      * [chat] fix unit test ci
      
      * [chat] fix ci
      
      * [chat] fix ci
      
      * [chat] skip checkpoint test
      50793b35
  22. 24 Apr, 2023 1 commit
  23. 20 Apr, 2023 1 commit
  24. 18 Apr, 2023 2 commits
    • Yuanchen's avatar
      1ec0d386
    • Camille Zhong's avatar
      Update test_ci.sh · 36a519b4
      Camille Zhong authored
      update
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      update
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      update ci
      
      Update test_ci.sh
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update run_chatgpt_examples.yml
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      update test ci
      
      RoBERTa for RLHF Stage 2 & 3 (still in testing)
      
      Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"
      
      This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.
      
      Add RoBERTa for RLHF stage 2 & 3
      
      1. add roberta folder under model folder
      2. add  roberta option in train_reward_model.py
      3. add some test in testci
      
      Update test_ci.sh
      
      Revert "Update test_ci.sh"
      
      This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.
      
      Add RoBERTa for RLHF Stage 2 & 3 (test)
      
      RoBERTa for RLHF Stage 2 & 3 (still in testing)
      
      Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"
      
      This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.
      
      Add RoBERTa for RLHF stage 2 & 3
      
      1. add roberta folder under model folder
      2. add  roberta option in train_reward_model.py
      3. add some test in testci
      
      Update test_ci.sh
      
      Revert "Update test_ci.sh"
      
      This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.
      
      update roberta with coati
      
      chat ci update
      
      Revert "chat ci update"
      
      This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.
      
      [test]chat_update_ci
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      test
      
      Update gpt_critic.py
      
      Update gpt_critic.py
      
      Update run_chatgpt_unit_tests.yml
      
      update test ci
      
      update
      
      update
      
      update
      
      update
      
      Update test_ci.sh
      
      update
      
      Update test_ci.sh
      
      Update test_ci.sh
      
      Update run_chatgpt_examples.yml
      
      Update run_chatgpt_examples.yml
      36a519b4
  25. 17 Apr, 2023 2 commits
    • tingfeng cao's avatar
      fix: fix sft (#3568) · 7788e0b0
      tingfeng cao authored
      7788e0b0
    • csric's avatar
      [chatgpt] Detached PPO Training (#3195) · e3551443
      csric authored
      
      
      * run the base
      
      * working on dist ppo
      
      * sync
      
      * detached trainer
      
      * update detached trainer. no maker update function
      
      * facing init problem
      
      * 1 maker 1 trainer detached run. but no model update
      
      * facing cuda problem
      
      * fix save functions
      
      * verified maker update
      
      * nothing
      
      * add ignore
      
      * analyize loss issue
      
      * remove some debug codes
      
      * facing 2m1t stuck issue
      
      * 2m1t verified
      
      * do not use torchrun
      
      * working on 2m2t
      
      * working on 2m2t
      
      * initialize strategy in ray actor env
      
      * facing actor's init order issue
      
      * facing ddp model update issue (need unwarp ddp)
      
      * unwrap ddp actor
      
      * checking 1m2t stuck problem
      
      * nothing
      
      * set timeout for trainer choosing. It solves the stuck problem!
      
      * delete some debug output
      
      * rename to sync with upstream
      
      * rename to sync with upstream
      
      * coati rename
      
      * nothing
      
      * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations
      
      * experience_maker_holder performs target-revolving _send_experience() instead of length comparison.
      
      * move code to ray subfolder
      
      * working on pipeline inference
      
      * apply comments
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      e3551443
  26. 11 Apr, 2023 1 commit