1. 10 Nov, 2023 1 commit
    • flybird11111's avatar
      [gemini] gemini support tensor parallelism. (#4942) · 576a2f7b
      flybird11111 authored
      * [colossalai]fix typo
      
      * [inference] Add smmoothquant for llama (#4904)
      
      * [inference] add int8 rotary embedding kernel for smoothquant (#4843)
      
      * [inference] add smoothquant llama attention (#4850)
      
      * add smoothquant llama attention
      
      * remove uselss code
      
      * remove useless code
      
      * fix import error
      
      * rename file name
      
      * [inference] add silu linear fusion for smoothquant llama mlp  (#4853)
      
      * add silu linear
      
      * update skip condition
      
      * catch smoothquant cuda lib exception
      
      * prcocess exception for tests
      
      * [inference] add llama mlp for smoothquant (#4854)
      
      * add llama mlp for smoothquant
      
      * fix down out scale
      
      * remove duplicate lines
      
      * add llama mlp check
      
      * delete useless code
      
      * [inference] add smoothquant llama (#4861)
      
      * add smoothquant llama
      
      * fix attention accuracy
      
      * fix accuracy
      
      * add kv cache and save pretrained
      
      * refactor example
      
      * delete smooth
      
      * refactor code
      
      * [inference] add smooth function and delete useless code for smoothquant (#4895)
      
      * add smooth function and delete useless code
      
      * update datasets
      
      * remove duplicate import
      
      * delete useless file
      
      * refactor codes (#4902)
      
      * rafactor code
      
      * add license
      
      * add torch-int and smoothquant license
      
      * Update flash_attention_patch.py
      
      To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
      https://github.com/huggingface/transformers/pull/25598
      
      
      
      * [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921)
      
      * [kernel] support pure fp16 for cpu adam (#4896)
      
      * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919)
      
      * [kernel] fix cpu adam
      
      * [test] update gemini optim test
      
      * [format] applied code formatting on changed files in pull request 4908 (#4918)
      Co-authored-by: default avatargithub-actions <github-actions@github.com>
      
      * [gemini] support gradient accumulation (#4869)
      
      * add test
      
      * fix no_sync bug in low level zero plugin
      
      * fix test
      
      * add argument for grad accum
      
      * add grad accum in backward hook for gemini
      
      * finish implementation, rewrite tests
      
      * fix test
      
      * skip stuck model in low level zero test
      
      * update doc
      
      * optimize communication & fix gradient checkpoint
      
      * modify doc
      
      * cleaning codes
      
      * update cpu adam fp16 case
      
      * [hotfix] fix torch 2.0 compatibility (#4936)
      
      * [hotfix] fix launch
      
      * [test] fix test gemini optim
      
      * [shardformer] fix vit
      
      * [test] add no master test for low level zero plugin (#4934)
      
      * [format] applied code formatting on changed files in pull request 4820 (#4886)
      Co-authored-by: default avatargithub-actions <github-actions@github.com>
      
      * [nfc] fix some typo with colossalai/ docs/ etc. (#4920)
      
      * [Refactor] Integrated some lightllm kernels into token-attention  (#4946)
      
      * add some req for inference
      
      * clean codes
      
      * add codes
      
      * add some lightllm deps
      
      * clean codes
      
      * hello
      
      * delete rms files
      
      * add some comments
      
      * add comments
      
      * add doc
      
      * add lightllm deps
      
      * add lightllm cahtglm2 kernels
      
      * add lightllm cahtglm2 kernels
      
      * replace rotary embedding with lightllm kernel
      
      * add some commnets
      
      * add some comments
      
      * add some comments
      
      * add
      
      * replace fwd kernel att1
      
      * fix a arg
      
      * add
      
      * add
      
      * fix token attention
      
      * add some comments
      
      * clean codes
      
      * modify comments
      
      * fix readme
      
      * fix bug
      
      * fix bug
      
      ---------
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      Co-authored-by: default avatarCjhHa1 <cjh18671720497@outlook.com>
      
      * [test] merge old components to test to model zoo (#4945)
      
      * [test] add custom models in model zoo
      
      * [test] update legacy test
      
      * [test] update model zoo
      
      * [test] update gemini test
      
      * [test] remove components to test
      
      * [inference] add reference and fix some bugs (#4937)
      
      * add reference and fix some bugs
      
      * update gptq init
      
      ---------
      Co-authored-by: default avatarXu Kai <xukai16@foxamil.com>
      
      * [Inference]ADD Bench Chatglm2 script (#4963)
      
      * add bench chatglm
      
      * fix bug and make utils
      
      ---------
      
      Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
      
      * [Pipeline inference] Combine kvcache with pipeline inference (#4938)
      
      * merge kvcache with pipeline inference and refactor the code structure
      
      * support ppsize > 2
      
      * refactor pipeline code
      
      * do pre-commit
      
      * modify benchmark
      
      * fix bench mark
      
      * polish code
      
      * add docstring and update readme
      
      * refactor the code
      
      * fix some logic bug of ppinfer
      
      * polish readme
      
      * fix typo
      
      * skip infer test
      
      * updated c++17 compiler flags (#4983)
      
      * [Inference] Dynamic Batching Inference, online and offline (#4953)
      
      * [inference] Dynamic Batching for Single and Multiple GPUs (#4831)
      
      * finish batch manager
      
      * 1
      
      * first
      
      * fix
      
      * fix dynamic batching
      
      * llama infer
      
      * finish test
      
      * support different lengths generating
      
      * del prints
      
      * del prints
      
      * fix
      
      * fix bug
      
      ---------
      
      Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
      
      * [inference] Async dynamic batching  (#4894)
      
      * finish input and output logic
      
      * add generate
      
      * test forward
      
      * 1
      
      * [inference]Re push async dynamic batching (#4901)
      
      * adapt to ray server
      
      * finish async
      
      * finish test
      
      * del test
      
      ---------
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      
      * Revert "[inference]Re push async dynamic batching (#4901)" (#4905)
      
      This reverts commit fbf3c09e673794ed18c91d4bab1a7dfea052e95a.
      
      * Revert "[inference] Async dynamic batching  (#4894)"
      
      This reverts commit fced14025043e29ce816b315f440601188f7f79f.
      
      * Revert "[inference] Async dynamic batching  (#4894)" (#4909)
      
      This reverts commit fced14025043e29ce816b315f440601188f7f79f.
      
      * Add Ray Distributed Environment Init Scripts
      
      * support DynamicBatchManager base function
      
      * revert _set_tokenizer version
      
      * add driver async generate
      
      * add async test
      
      * fix bugs in test_ray_dist.py
      
      * add get_tokenizer.py
      
      * fix code style
      
      * fix bugs about No module named 'pydantic' in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * [infer]Add Ray Distributed Environment Init Scripts (#4911)
      
      * Revert "[inference] Async dynamic batching  (#4894)"
      
      This reverts commit fced14025043e29ce816b315f440601188f7f79f.
      
      * Add Ray Distributed Environment Init Scripts
      
      * support DynamicBatchManager base function
      
      * revert _set_tokenizer version
      
      * add driver async generate
      
      * add async test
      
      * fix bugs in test_ray_dist.py
      
      * add get_tokenizer.py
      
      * fix code style
      
      * fix bugs about No module named 'pydantic' in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * support dynamic batch for bloom model and is_running function
      
      * [Inference]Test for new Async engine (#4935)
      
      * infer engine
      
      * infer engine
      
      * test engine
      
      * test engine
      
      * new manager
      
      * change step
      
      * add
      
      * test
      
      * fix
      
      * fix
      
      * finish test
      
      * finish test
      
      * finish test
      
      * finish test
      
      * add license
      
      ---------
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      
      * add assertion for config (#4947)
      
      * [Inference] Finish dynamic batching offline test (#4948)
      
      * test
      
      * fix test
      
      * fix quant
      
      * add default
      
      * fix
      
      * fix some bugs
      
      * fix some bugs
      
      * fix
      
      * fix bug
      
      * fix bugs
      
      * reset param
      
      ---------
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      Co-authored-by: default avatarCuiqing Li <lixx3527@gmail.com>
      Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
      
      * [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention  (#4965)
      
      * adding flash-decoding
      
      * clean
      
      * adding kernel
      
      * adding flash-decoding
      
      * add integration
      
      * add
      
      * adding kernel
      
      * adding kernel
      
      * adding triton 2.1.0 features for inference
      
      * update bloom triton kernel
      
      * remove useless vllm kernels
      
      * clean codes
      
      * fix
      
      * adding files
      
      * fix readme
      
      * update llama flash-decoding
      
      ---------
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      
      * fix ColossalEval (#4992)
      Co-authored-by: default avatarXu Yuanchen <yuanchen.xu00@gmail.com>
      
      * [doc]Update doc for colossal-inference (#4989)
      
      * update doc
      
      * Update README.md
      
      ---------
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      
      * [hotfix] Fix the bug where process groups were not being properly released. (#4940)
      
      * Fix the bug where process groups were not being properly released.
      
      * test
      
      * Revert "test"
      
      This reverts commit 479900c1398637310abf92eefa3cd168038ea02f.
      
      * [hotfix] fix the bug of repeatedly storing param group (#4951)
      
      * [doc] add supported feature diagram for hybrid parallel plugin (#4996)
      
      * [Pipeline Inference] Merge pp with tp (#4993)
      
      * refactor pipeline into new CaiInferEngine
      
      * updata llama modeling forward
      
      * merge tp with pp
      
      * update docstring
      
      * optimize test workflow and example
      
      * fix typo
      
      * add assert and todo
      
      * [release] update version (#4995)
      
      * [release] update version
      
      * [hotfix] fix ci
      
      * [gemini] gemini support tp
      
      [gemini] gemini support tp
      
      [gemini] gemini support tp
      
      [gemini] gemini support tp
      
      [gemini] gemini support tp
      
      * fix
      
      fix
      
      fix
      
      * update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      update checkpointIO
      
      * support fused layernorm
      
      support fused layernorm
      
      support fused layernorm
      
      * update fusedlayernorm
      
      update fusedlayernorm
      
      update fusedlayernorm
      
      * add sequence parallel to gemini
      
      add sequence parallel to gemini
      
      * fix
      
      * fix comments
      
      fix comments
      
      fix comments
      
      * fix
      
      * fix t5
      
      * clear cache
      
      * fix
      
      * activate ci
      
      * activate ci
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * revert
      
      * modify tp gather method
      
      modify tp gather method
      
      modify tp gather method
      
      modify tp gather method
      
      * fix test
      
      ---------
      Co-authored-by: default avatarXu Kai <xukai16@foxmail.com>
      Co-authored-by: default avatarZian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
      Co-authored-by: default avatarHongxin Liu <lhx0217@gmail.com>
      Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
      Co-authored-by: default avatargithub-actions <github-actions@github.com>
      Co-authored-by: default avatarBaizhou Zhang <eddiezhang@pku.edu.cn>
      Co-authored-by: default avatarZhongkai Zhao <kanezz620@gmail.com>
      Co-authored-by: default avatardigger yu <digger-yu@outlook.com>
      Co-authored-by: default avatarCuiqing Li <lixx3527@gmail.com>
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      Co-authored-by: default avatarCjhHa1 <cjh18671720497@outlook.com>
      Co-authored-by: default avatarXu Kai <xukai16@foxamil.com>
      Co-authored-by: default avatarJianghai <72591262+CjhHa1@users.noreply.github.com>
      Co-authored-by: default avatarBin Jia <45593998+FoolPlayer@users.noreply.github.com>
      Co-authored-by: default avatarアマデウス <kurisusnowdeng@users.noreply.github.com>
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      Co-authored-by: default avatarYuanchen <70520919+chengeharrison@users.noreply.github.com>
      Co-authored-by: default avatarXu Yuanchen <yuanchen.xu00@gmail.com>
      Co-authored-by: default avatarlittsk <1214689160@qq.com>
      Co-authored-by: default avatarppt0011 <143150326+ppt0011@users.noreply.github.com>
      576a2f7b
  2. 09 Nov, 2023 3 commits
  3. 08 Nov, 2023 2 commits
  4. 07 Nov, 2023 1 commit
  5. 06 Nov, 2023 1 commit
  6. 03 Nov, 2023 1 commit
    • littsk's avatar
      [hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926) · 1a3315e3
      littsk authored
      
      
      * [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915)
      
      * Add layer norm gradients all-reduce for sequence parallel.
      
      * skip pipeline inference test
      
      * [hotfix] fixing polices of sequence parallel (#4922)
      
      * Add layer norm gradients all-reduce for sequence parallel.
      
      * fix parameter passing when calling get_autopolicy
      
      ---------
      Co-authored-by: default avatarlittsk <1214689160@qq.com>
      
      * Hotfix/add grad all reduce for sequence parallel (#4927)
      
      * Add layer norm gradients all-reduce for sequence parallel.
      
      
      * fix parameter passing when calling get_autopolicy
      
      * fix bug using wrong variables
      
      ---------
      Co-authored-by: default avatarlittsk <1214689160@qq.com>
      
      * fix policy initialization
      
      * fix bloom and chatglm policices
      
      * polish code of handling layernorm
      
      * fix moe module
      
      * polish code of class initializing
      
      ---------
      Co-authored-by: default avatarZhongkai Zhao <kanezz620@gmail.com>
      1a3315e3
  7. 02 Nov, 2023 2 commits
  8. 01 Nov, 2023 2 commits
  9. 31 Oct, 2023 5 commits
  10. 30 Oct, 2023 2 commits
    • Cuiqing Li's avatar
      [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama... · 459a88c8
      Cuiqing Li authored
      
      [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention  (#4965)
      
      * adding flash-decoding
      
      * clean
      
      * adding kernel
      
      * adding flash-decoding
      
      * add integration
      
      * add
      
      * adding kernel
      
      * adding kernel
      
      * adding triton 2.1.0 features for inference
      
      * update bloom triton kernel
      
      * remove useless vllm kernels
      
      * clean codes
      
      * fix
      
      * adding files
      
      * fix readme
      
      * update llama flash-decoding
      
      ---------
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      459a88c8
    • Jianghai's avatar
      [Inference] Dynamic Batching Inference, online and offline (#4953) · cf579ff4
      Jianghai authored
      
      
      * [inference] Dynamic Batching for Single and Multiple GPUs (#4831)
      
      * finish batch manager
      
      * 1
      
      * first
      
      * fix
      
      * fix dynamic batching
      
      * llama infer
      
      * finish test
      
      * support different lengths generating
      
      * del prints
      
      * del prints
      
      * fix
      
      * fix bug
      
      ---------
      
      Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
      
      * [inference] Async dynamic batching  (#4894)
      
      * finish input and output logic
      
      * add generate
      
      * test forward
      
      * 1
      
      * [inference]Re push async dynamic batching (#4901)
      
      * adapt to ray server
      
      * finish async
      
      * finish test
      
      * del test
      
      ---------
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      
      * Revert "[inference]Re push async dynamic batching (#4901)" (#4905)
      
      This reverts commit fbf3c09e673794ed18c91d4bab1a7dfea052e95a.
      
      * Revert "[inference] Async dynamic batching  (#4894)"
      
      This reverts commit fced14025043e29ce816b315f440601188f7f79f.
      
      * Revert "[inference] Async dynamic batching  (#4894)" (#4909)
      
      This reverts commit fced14025043e29ce816b315f440601188f7f79f.
      
      * Add Ray Distributed Environment Init Scripts
      
      * support DynamicBatchManager base function
      
      * revert _set_tokenizer version
      
      * add driver async generate
      
      * add async test
      
      * fix bugs in test_ray_dist.py
      
      * add get_tokenizer.py
      
      * fix code style
      
      * fix bugs about No module named 'pydantic' in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * [infer]Add Ray Distributed Environment Init Scripts (#4911)
      
      * Revert "[inference] Async dynamic batching  (#4894)"
      
      This reverts commit fced14025043e29ce816b315f440601188f7f79f.
      
      * Add Ray Distributed Environment Init Scripts
      
      * support DynamicBatchManager base function
      
      * revert _set_tokenizer version
      
      * add driver async generate
      
      * add async test
      
      * fix bugs in test_ray_dist.py
      
      * add get_tokenizer.py
      
      * fix code style
      
      * fix bugs about No module named 'pydantic' in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * fix bugs in ci test
      
      * support dynamic batch for bloom model and is_running function
      
      * [Inference]Test for new Async engine (#4935)
      
      * infer engine
      
      * infer engine
      
      * test engine
      
      * test engine
      
      * new manager
      
      * change step
      
      * add
      
      * test
      
      * fix
      
      * fix
      
      * finish test
      
      * finish test
      
      * finish test
      
      * finish test
      
      * add license
      
      ---------
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      
      * add assertion for config (#4947)
      
      * [Inference] Finish dynamic batching offline test (#4948)
      
      * test
      
      * fix test
      
      * fix quant
      
      * add default
      
      * fix
      
      * fix some bugs
      
      * fix some bugs
      
      * fix
      
      * fix bug
      
      * fix bugs
      
      * reset param
      
      ---------
      Co-authored-by: default avataryuehuayingxueluo <867460659@qq.com>
      Co-authored-by: default avatarCuiqing Li <lixx3527@gmail.com>
      Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
      cf579ff4
  11. 27 Oct, 2023 2 commits
  12. 24 Oct, 2023 1 commit
  13. 20 Oct, 2023 2 commits
  14. 19 Oct, 2023 1 commit
    • Cuiqing Li's avatar
      [Refactor] Integrated some lightllm kernels into token-attention (#4946) · 3a41e830
      Cuiqing Li authored
      
      
      * add some req for inference
      
      * clean codes
      
      * add codes
      
      * add some lightllm deps
      
      * clean codes
      
      * hello
      
      * delete rms files
      
      * add some comments
      
      * add comments
      
      * add doc
      
      * add lightllm deps
      
      * add lightllm cahtglm2 kernels
      
      * add lightllm cahtglm2 kernels
      
      * replace rotary embedding with lightllm kernel
      
      * add some commnets
      
      * add some comments
      
      * add some comments
      
      * add
      
      * replace fwd kernel att1
      
      * fix a arg
      
      * add
      
      * add
      
      * fix token attention
      
      * add some comments
      
      * clean codes
      
      * modify comments
      
      * fix readme
      
      * fix bug
      
      * fix bug
      
      ---------
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      Co-authored-by: default avatarCjhHa1 <cjh18671720497@outlook.com>
      3a41e830
  15. 18 Oct, 2023 4 commits
  16. 17 Oct, 2023 2 commits
  17. 16 Oct, 2023 3 commits
    • Hongxin Liu's avatar
      [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) · 4f68b3f1
      Hongxin Liu authored
      * [kernel] support pure fp16 for cpu adam (#4896)
      
      * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919)
      
      * [kernel] fix cpu adam
      
      * [test] update gemini optim test
      4f68b3f1
    • Zian(Andy) Zheng's avatar
      Update flash_attention_patch.py · 7768afba
      Zian(Andy) Zheng authored
      To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
      https://github.com/huggingface/transformers/pull/25598
      7768afba
    • Xu Kai's avatar
      [inference] Add smmoothquant for llama (#4904) · 611a5a80
      Xu Kai authored
      * [inference] add int8 rotary embedding kernel for smoothquant (#4843)
      
      * [inference] add smoothquant llama attention (#4850)
      
      * add smoothquant llama attention
      
      * remove uselss code
      
      * remove useless code
      
      * fix import error
      
      * rename file name
      
      * [inference] add silu linear fusion for smoothquant llama mlp  (#4853)
      
      * add silu linear
      
      * update skip condition
      
      * catch smoothquant cuda lib exception
      
      * prcocess exception for tests
      
      * [inference] add llama mlp for smoothquant (#4854)
      
      * add llama mlp for smoothquant
      
      * fix down out scale
      
      * remove duplicate lines
      
      * add llama mlp check
      
      * delete useless code
      
      * [inference] add smoothquant llama (#4861)
      
      * add smoothquant llama
      
      * fix attention accuracy
      
      * fix accuracy
      
      * add kv cache and save pretrained
      
      * refactor example
      
      * delete smooth
      
      * refactor code
      
      * [inference] add smooth function and delete useless code for smoothquant (#4895)
      
      * add smooth function and delete useless code
      
      * update datasets
      
      * remove duplicate import
      
      * delete useless file
      
      * refactor codes (#4902)
      
      * rafactor code
      
      * add license
      
      * add torch-int and smoothquant license
      611a5a80
  18. 13 Oct, 2023 2 commits
    • Zhongkai Zhao's avatar
      [feature] support no master weights option for low level zero plugin (#4816) · a0684e7b
      Zhongkai Zhao authored
      * [feature] support no master weights for low level zero plugin
      
      * [feature] support no master weights for low level zero plugin, remove data copy when no master weights
      
      * remove data copy and typecasting when no master weights
      
      * not load weights to cpu when using no master weights
      
      * fix grad: use fp16 grad when no master weights
      
      * only do not update working param when no master weights
      
      * fix: only do not update working param when no master weights
      
      * fix: passing params in dict format in hybrid plugin
      
      * fix: remove extra params (tp_process_group) in hybrid_parallel_plugin
      a0684e7b
    • Xu Kai's avatar
      [inference] add llama2 support (#4898) · 77a93283
      Xu Kai authored
      * add llama2 support
      
      * fix multi group bug
      77a93283
  19. 12 Oct, 2023 3 commits