1. 08 Jun, 2023 4 commits
  2. 07 Jun, 2023 8 commits
    • digger yu's avatar
      [nfc] fix typo colossalai/zero (#3923) · de0d7df3
      digger yu authored
      de0d7df3
    • digger yu's avatar
      a9d1cadc
    • Liu Ziming's avatar
      [example] Modify palm example with the new booster API (#3913) · b306cecf
      Liu Ziming authored
      * Modify torch version requirement to adapt torch 2.0
      
      * modify palm example using new booster API
      
      * roll back
      
      * fix port
      
      * polish
      
      * polish
      b306cecf
    • wukong1992's avatar
      a55fb00c
    • Frank Lee's avatar
      5e2132dc
    • Hongxin Liu's avatar
      c25d421f
    • Hongxin Liu's avatar
      9c88b6cb
    • Hongxin Liu's avatar
      [chat] add distributed PPO trainer (#3740) · b5f05663
      Hongxin Liu authored
      
      
      * Detached ppo (#9)
      
      * run the base
      
      * working on dist ppo
      
      * sync
      
      * detached trainer
      
      * update detached trainer. no maker update function
      
      * facing init problem
      
      * 1 maker 1 trainer detached run. but no model update
      
      * facing cuda problem
      
      * fix save functions
      
      * verified maker update
      
      * nothing
      
      * add ignore
      
      * analyize loss issue
      
      * remove some debug codes
      
      * facing 2m1t stuck issue
      
      * 2m1t verified
      
      * do not use torchrun
      
      * working on 2m2t
      
      * working on 2m2t
      
      * initialize strategy in ray actor env
      
      * facing actor's init order issue
      
      * facing ddp model update issue (need unwarp ddp)
      
      * unwrap ddp actor
      
      * checking 1m2t stuck problem
      
      * nothing
      
      * set timeout for trainer choosing. It solves the stuck problem!
      
      * delete some debug output
      
      * rename to sync with upstream
      
      * rename to sync with upstream
      
      * coati rename
      
      * nothing
      
      * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations
      
      * experience_maker_holder performs target-revolving _send_experience() instead of length comparison.
      
      * move code to ray subfolder
      
      * working on pipeline inference
      
      * apply comments
      
      * working on pipeline strategy. in progress.
      
      * remove pipeline code. clean this branch
      
      * update remote parameters by state_dict. no test
      
      * nothing
      
      * state_dict sharding transfer
      
      * merge debug branch
      
      * gemini _unwrap_model fix
      
      * simplify code
      
      * simplify code & fix LoRALinear AttributeError
      
      * critic unwrapped state_dict
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] add perfomance evaluator and fix bugs (#10)
      
      * [chat] add performance evaluator for ray
      
      * [chat] refactor debug arg
      
      * [chat] support hf config
      
      * [chat] fix generation
      
      * [chat] add 1mmt dummy example
      
      * [chat] fix gemini ckpt
      
      * split experience to send (#11)
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] refactor trainer and maker (#12)
      
      * [chat] refactor experience maker holder
      
      * [chat] refactor model init
      
      * [chat] refactor trainer args
      
      * [chat] refactor model init
      
      * [chat] refactor trainer
      
      * [chat] refactor experience sending logic and training loop args (#13)
      
      * [chat] refactor experience send logic
      
      * [chat] refactor trainer
      
      * [chat] refactor trainer
      
      * [chat] refactor experience maker
      
      * [chat] refactor pbar
      
      * [chat] refactor example folder (#14)
      
      * [chat] support quant (#15)
      
      * [chat] add quant
      
      * [chat] add quant example
      
      * prompt example (#16)
      
      * prompt example
      
      * prompt load csv data
      
      * remove legacy try
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] add mmmt dummy example and refactor experience sending (#17)
      
      * [chat] add mmmt dummy example
      
      * [chat] refactor naive strategy
      
      * [chat] fix struck problem
      
      * [chat] fix naive strategy
      
      * [chat] optimize experience maker sending logic
      
      * [chat] refactor sending assignment
      
      * [chat] refactor performance evaluator (#18)
      
      * Prompt Example & requires_grad state_dict & sharding state_dict (#19)
      
      * prompt example
      
      * prompt load csv data
      
      * remove legacy try
      
      * maker models require_grad set to False
      
      * working on zero redundancy update
      
      * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
      
      * remove legacy examples
      
      * remove legacy examples
      
      * remove replay buffer tp state. bad design
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * state_dict sending adapts to new unwrap function (#20)
      
      * prompt example
      
      * prompt load csv data
      
      * remove legacy try
      
      * maker models require_grad set to False
      
      * working on zero redundancy update
      
      * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
      
      * remove legacy examples
      
      * remove legacy examples
      
      * remove replay buffer tp state. bad design
      
      * opt benchmark
      
      * better script
      
      * nothing
      
      * [chat] strategy refactor unwrap model
      
      * [chat] strategy refactor save model
      
      * [chat] add docstr
      
      * [chat] refactor trainer save model
      
      * [chat] fix strategy typing
      
      * [chat] refactor trainer save model
      
      * [chat] update readme
      
      * [chat] fix unit test
      
      * working on lora reconstruction
      
      * state_dict sending adapts to new unwrap function
      
      * remove comments
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      Co-authored-by: default avatarver217 <lhx0217@gmail.com>
      
      * [chat-ray] add readme (#21)
      
      * add readme
      
      * transparent graph
      
      * add note background
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] get images from url (#22)
      
      * Refactor/chat ray (#23)
      
      * [chat] lora add todo
      
      * [chat] remove unused pipeline strategy
      
      * [chat] refactor example structure
      
      * [chat] setup ci for ray
      
      * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)
      
      * lora support prototype
      
      * lora support
      
      * 1mmt lora & remove useless code
      
      ---------
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      
      * [chat] fix test ci for ray
      
      * [chat] fix test ci requirements for ray
      
      * [chat] fix ray runtime env
      
      * [chat] fix ray runtime env
      
      * [chat] fix example ci docker args
      
      * [chat] add debug info in trainer
      
      * [chat] add nccl debug info
      
      * [chat] skip ray test
      
      * [doc] fix typo
      
      ---------
      Co-authored-by: default avatarcsric <59389055+CsRic@users.noreply.github.com>
      Co-authored-by: default avatarcsric <richcsr256@gmail.com>
      b5f05663
  3. 06 Jun, 2023 4 commits
    • Hongxin Liu's avatar
      [devops] hotfix CI about testmon cache (#3910) · 41fb7236
      Hongxin Liu authored
      * [devops] hotfix CI about testmon cache
      
      * [devops] fix testmon cahe on pr
      41fb7236
    • digger yu's avatar
      [nfc]fix typo colossalai/pipeline tensor nn (#3899) · 0e484e62
      digger yu authored
      * fix typo colossalai/autochunk auto_parallel amp
      
      * fix typo colossalai/auto_parallel nn utils etc.
      
      * fix typo colossalai/auto_parallel autochunk fx/passes  etc.
      
      * fix typo docs/
      
      * change placememt_policy to placement_policy in docs/ and examples/
      
      * fix typo colossalai/ applications/
      
      * fix typo colossalai/cli fx kernel
      
      * fix typo colossalai/nn
      
      * revert change warmuped
      
      * fix typo colossalai/pipeline tensor nn
      0e484e62
    • Baizhou Zhang's avatar
      c1535ccb
    • Hongxin Liu's avatar
      [devops] improving testmon cache (#3902) · ec9bbc00
      Hongxin Liu authored
      * [devops] improving testmon cache
      
      * [devops] fix branch name with slash
      
      * [devops] fix branch name with slash
      
      * [devops] fix edit action
      
      * [devops] fix edit action
      
      * [devops] fix edit action
      
      * [devops] fix edit action
      
      * [devops] fix edit action
      
      * [devops] fix edit action
      
      * [devops] update readme
      ec9bbc00
  4. 05 Jun, 2023 6 commits
    • Yuanchen's avatar
      support evaluation for english (#3880) · 57a6d768
      Yuanchen authored
      
      Co-authored-by: default avatarYuanchen Xu <yuanchen.xu00@gmail.com>
      57a6d768
    • digger yu's avatar
      [nfc] fix typo colossalai/nn (#3887) · 18787497
      digger yu authored
      * fix typo colossalai/autochunk auto_parallel amp
      
      * fix typo colossalai/auto_parallel nn utils etc.
      
      * fix typo colossalai/auto_parallel autochunk fx/passes  etc.
      
      * fix typo docs/
      
      * change placememt_policy to placement_policy in docs/ and examples/
      
      * fix typo colossalai/ applications/
      
      * fix typo colossalai/cli fx kernel
      
      * fix typo colossalai/nn
      
      * revert change warmuped
      18787497
    • Hongxin Liu's avatar
      [bf16] add bf16 support (#3882) · ae02d4e4
      Hongxin Liu authored
      * [bf16] add bf16 support for fused adam (#3844)
      
      * [bf16] fused adam kernel support bf16
      
      * [test] update fused adam kernel test
      
      * [test] update fused adam test
      
      * [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860)
      
      * [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869)
      
      * [bf16] add mixed precision mixin
      
      * [bf16] low level zero optim support bf16
      
      * [text] update low level zero test
      
      * [text] fix low level zero grad acc test
      
      * [bf16] add bf16 support for gemini (#3872)
      
      * [bf16] gemini support bf16
      
      * [test] update gemini bf16 test
      
      * [doc] update gemini docstring
      
      * [bf16] add bf16 support for plugins (#3877)
      
      * [bf16] add bf16 support for legacy zero (#3879)
      
      * [zero] init context support bf16
      
      * [zero] legacy zero support bf16
      
      * [test] add zero bf16 test
      
      * [doc] add bf16 related docstring for legacy zero
      ae02d4e4
    • jiangmingyan's avatar
      [doc]update moe chinese document. (#3890) · 07cb2114
      jiangmingyan authored
      * [doc]update-moe
      
      * [doc]update-moe
      
      * [doc]update-moe
      
      * [doc]update-moe
      
      * [doc]update-moe
      07cb2114
    • Liu Ziming's avatar
      8065cc5f
    • Hongxin Liu's avatar
      [lazy] refactor lazy init (#3891) · dbb32692
      Hongxin Liu authored
      * [lazy] remove old lazy init
      
      * [lazy] refactor lazy init folder structure
      
      * [lazy] fix lazy tensor deepcopy
      
      * [test] update lazy init test
      dbb32692
  5. 02 Jun, 2023 1 commit
    • digger yu's avatar
      [nfc] fix typo colossalai/cli fx kernel (#3847) · 70c8cdec
      digger yu authored
      * fix typo colossalai/autochunk auto_parallel amp
      
      * fix typo colossalai/auto_parallel nn utils etc.
      
      * fix typo colossalai/auto_parallel autochunk fx/passes  etc.
      
      * fix typo docs/
      
      * change placememt_policy to placement_policy in docs/ and examples/
      
      * fix typo colossalai/ applications/
      
      * fix typo colossalai/cli fx kernel
      70c8cdec
  6. 30 May, 2023 3 commits
    • jiangmingyan's avatar
      [doc] update document of zero with chunk. (#3855) · 281b33f3
      jiangmingyan authored
      * [doc] fix title of mixed precision
      
      * [doc]update document of zero with chunk
      
      * [doc] update document of zero with chunk, fix
      
      * [doc] update document of zero with chunk, fix
      
      * [doc] update document of zero with chunk, fix
      
      * [doc] update document of zero with chunk, add doc test
      
      * [doc] update document of zero with chunk, add doc test
      
      * [doc] update document of zero with chunk, fix installation
      
      * [doc] update document of zero with chunk, fix zero with chunk doc
      
      * [doc] update document of zero with chunk, fix zero with chunk doc
      281b33f3
    • jiangmingyan's avatar
      [example] update gemini examples (#3868) · 5f79008c
      jiangmingyan authored
      * [example]update gemini examples
      
      * [example]update gemini examples
      5f79008c
    • Yuanchen's avatar
      [evaluation] improvement on evaluation (#3862) · 2506e275
      Yuanchen authored
      
      
      * fix a bug when the config file contains one category but the answer file doesn't contains that category
      
      * fix Chinese prompt file
      
      * support gpt-3.5-turbo and gpt-4 evaluation
      
      * polish and update README
      
      * resolve pr comments
      
      ---------
      Co-authored-by: default avatarYuanchen Xu <yuanchen.xu00@gmail.com>
      2506e275
  7. 25 May, 2023 7 commits
  8. 24 May, 2023 7 commits