1. 24 Oct, 2023 1 commit
  2. 20 Oct, 2023 2 commits
  3. 19 Oct, 2023 1 commit
    • Cuiqing Li's avatar
      [Refactor] Integrated some lightllm kernels into token-attention (#4946) · 3a41e830
      Cuiqing Li authored
      
      
      * add some req for inference
      
      * clean codes
      
      * add codes
      
      * add some lightllm deps
      
      * clean codes
      
      * hello
      
      * delete rms files
      
      * add some comments
      
      * add comments
      
      * add doc
      
      * add lightllm deps
      
      * add lightllm cahtglm2 kernels
      
      * add lightllm cahtglm2 kernels
      
      * replace rotary embedding with lightllm kernel
      
      * add some commnets
      
      * add some comments
      
      * add some comments
      
      * add
      
      * replace fwd kernel att1
      
      * fix a arg
      
      * add
      
      * add
      
      * fix token attention
      
      * add some comments
      
      * clean codes
      
      * modify comments
      
      * fix readme
      
      * fix bug
      
      * fix bug
      
      ---------
      Co-authored-by: default avatarcuiqing.li <lixx336@gmail.com>
      Co-authored-by: default avatarCjhHa1 <cjh18671720497@outlook.com>
      3a41e830
  4. 18 Oct, 2023 4 commits
  5. 17 Oct, 2023 2 commits
  6. 16 Oct, 2023 3 commits
    • Hongxin Liu's avatar
      [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) · 4f68b3f1
      Hongxin Liu authored
      * [kernel] support pure fp16 for cpu adam (#4896)
      
      * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919)
      
      * [kernel] fix cpu adam
      
      * [test] update gemini optim test
      4f68b3f1
    • Zian(Andy) Zheng's avatar
      Update flash_attention_patch.py · 7768afba
      Zian(Andy) Zheng authored
      To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
      https://github.com/huggingface/transformers/pull/25598
      7768afba
    • Xu Kai's avatar
      [inference] Add smmoothquant for llama (#4904) · 611a5a80
      Xu Kai authored
      * [inference] add int8 rotary embedding kernel for smoothquant (#4843)
      
      * [inference] add smoothquant llama attention (#4850)
      
      * add smoothquant llama attention
      
      * remove uselss code
      
      * remove useless code
      
      * fix import error
      
      * rename file name
      
      * [inference] add silu linear fusion for smoothquant llama mlp  (#4853)
      
      * add silu linear
      
      * update skip condition
      
      * catch smoothquant cuda lib exception
      
      * prcocess exception for tests
      
      * [inference] add llama mlp for smoothquant (#4854)
      
      * add llama mlp for smoothquant
      
      * fix down out scale
      
      * remove duplicate lines
      
      * add llama mlp check
      
      * delete useless code
      
      * [inference] add smoothquant llama (#4861)
      
      * add smoothquant llama
      
      * fix attention accuracy
      
      * fix accuracy
      
      * add kv cache and save pretrained
      
      * refactor example
      
      * delete smooth
      
      * refactor code
      
      * [inference] add smooth function and delete useless code for smoothquant (#4895)
      
      * add smooth function and delete useless code
      
      * update datasets
      
      * remove duplicate import
      
      * delete useless file
      
      * refactor codes (#4902)
      
      * rafactor code
      
      * add license
      
      * add torch-int and smoothquant license
      611a5a80
  7. 13 Oct, 2023 2 commits
    • Zhongkai Zhao's avatar
      [feature] support no master weights option for low level zero plugin (#4816) · a0684e7b
      Zhongkai Zhao authored
      * [feature] support no master weights for low level zero plugin
      
      * [feature] support no master weights for low level zero plugin, remove data copy when no master weights
      
      * remove data copy and typecasting when no master weights
      
      * not load weights to cpu when using no master weights
      
      * fix grad: use fp16 grad when no master weights
      
      * only do not update working param when no master weights
      
      * fix: only do not update working param when no master weights
      
      * fix: passing params in dict format in hybrid plugin
      
      * fix: remove extra params (tp_process_group) in hybrid_parallel_plugin
      a0684e7b
    • Xu Kai's avatar
      [inference] add llama2 support (#4898) · 77a93283
      Xu Kai authored
      * add llama2 support
      
      * fix multi group bug
      77a93283
  8. 12 Oct, 2023 4 commits
  9. 11 Oct, 2023 4 commits
    • littsk's avatar
      ffd9a3cb
    • ppt0011's avatar
      1dcaf249
    • Xu Kai's avatar
      fix test llama (#4884) · fdec650b
      Xu Kai authored
      fdec650b
    • Bin Jia's avatar
      [Pipeline Inference] Sync pipeline inference branch to main (#4820) · 08a9f76b
      Bin Jia authored
      * [pipeline inference] pipeline inference (#4492)
      
      * add pp stage manager as circle stage
      
      * fix a bug when create process group
      
      * add ppinfer basic framework
      
      * add micro batch manager and support kvcache-pp gpt2 fwd
      
      * add generate schedule
      
      * use mb size to control mb number
      
      * support generate with kv cache
      
      * add output, remove unused code
      
      * add test
      
      * reuse shardformer to build model
      
      * refactor some code and use the same attribute name of hf
      
      * fix review and add test for generation
      
      * remove unused file
      
      * fix CI
      
      * add cache clear
      
      * fix code error
      
      * fix typo
      
      * [Pipeline inference] Modify to tieweight (#4599)
      
      * add pp stage manager as circle stage
      
      * fix a bug when create process group
      
      * add ppinfer basic framework
      
      * add micro batch manager and support kvcache-pp gpt2 fwd
      
      * add generate schedule
      
      * use mb size to control mb number
      
      * support generate with kv cache
      
      * add output, remove unused code
      
      * add test
      
      * reuse shardformer to build model
      
      * refactor some code and use the same attribute name of hf
      
      * fix review and add test for generation
      
      * remove unused file
      
      * modify the way of saving newtokens
      
      * modify to tieweight
      
      * modify test
      
      * remove unused file
      
      * solve review
      
      * add docstring
      
      * [Pipeline inference] support llama pipeline inference (#4647)
      
      * support llama pipeline inference
      
      * remove tie weight operation
      
      * [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708)
      
      * add benchmark verbose
      
      * fix export tokens
      
      * fix benchmark verbose
      
      * add P2POp style to do p2p communication
      
      * modify schedule as p2p type when ppsize is 2
      
      * remove unused code and add docstring
      
      * [Pipeline inference] Refactor code, add docsting, fix bug (#4790)
      
      * add benchmark script
      
      * update argparse
      
      * fix fp16 load
      
      * refactor code style
      
      * add docstring
      
      * polish code
      
      * fix test bug
      
      * [Pipeline inference] Add pipeline inference docs (#4817)
      
      * add readme doc
      
      * add a ico
      
      * Add performance
      
      * update table of contents
      
      * refactor code (#4873)
      08a9f76b
  10. 10 Oct, 2023 5 commits
  11. 07 Oct, 2023 5 commits
  12. 06 Oct, 2023 2 commits
  13. 05 Oct, 2023 2 commits
  14. 04 Oct, 2023 2 commits
  15. 02 Oct, 2023 1 commit
    • Yuanheng Zhao's avatar
      [Infer] Serving example w/ ray-serve (multiple GPU case) (#4841) · 573f2705
      Yuanheng Zhao authored
      * fix imports
      
      * add ray-serve with Colossal-Infer tp
      
      * trivial: send requests script
      
      * add README
      
      * fix worker port
      
      * fix readme
      
      * use app builder and autoscaling
      
      * trivial: input args
      
      * clean code; revise readme
      
      * testci (skip example test)
      
      * use auto model/tokenizer
      
      * revert imports fix (fixed in other PRs)
      573f2705