1. 16 Oct, 2023 1 commit
    • Xu Kai's avatar
      [inference] Add smmoothquant for llama (#4904) · 611a5a80
      Xu Kai authored
      * [inference] add int8 rotary embedding kernel for smoothquant (#4843)
      
      * [inference] add smoothquant llama attention (#4850)
      
      * add smoothquant llama attention
      
      * remove uselss code
      
      * remove useless code
      
      * fix import error
      
      * rename file name
      
      * [inference] add silu linear fusion for smoothquant llama mlp  (#4853)
      
      * add silu linear
      
      * update skip condition
      
      * catch smoothquant cuda lib exception
      
      * prcocess exception for tests
      
      * [inference] add llama mlp for smoothquant (#4854)
      
      * add llama mlp for smoothquant
      
      * fix down out scale
      
      * remove duplicate lines
      
      * add llama mlp check
      
      * delete useless code
      
      * [inference] add smoothquant llama (#4861)
      
      * add smoothquant llama
      
      * fix attention accuracy
      
      * fix accuracy
      
      * add kv cache and save pretrained
      
      * refactor example
      
      * delete smooth
      
      * refactor code
      
      * [inference] add smooth function and delete useless code for smoothquant (#4895)
      
      * add smooth function and delete useless code
      
      * update datasets
      
      * remove duplicate import
      
      * delete useless file
      
      * refactor codes (#4902)
      
      * rafactor code
      
      * add license
      
      * add torch-int and smoothquant license
      611a5a80
  2. 27 Sep, 2023 1 commit
  3. 22 Sep, 2023 1 commit
    • Xu Kai's avatar
      [feature] add gptq for inference (#4754) · 946ab56c
      Xu Kai authored
      * [gptq] add gptq kernel (#4416)
      
      * add gptq
      
      * refactor code
      
      * fix tests
      
      * replace auto-gptq
      
      * rname inferance/quant
      
      * refactor test
      
      * add auto-gptq as an option
      
      * reset requirements
      
      * change assert and check auto-gptq
      
      * add import warnings
      
      * change test flash attn version
      
      * remove example
      
      * change requirements of flash_attn
      
      * modify tests
      
      * [skip ci] change requirements-test
      
      * [gptq] faster gptq cuda kernel (#4494)
      
      * [skip ci] add cuda kernels
      
      * add license
      
      * [skip ci] fix max_input_len
      
      * format files & change test size
      
      * [skip ci]
      
      * [gptq] add gptq tensor parallel (#4538)
      
      * add gptq tensor parallel
      
      * add gptq tp
      
      * delete print
      
      * add test gptq check
      
      * add test auto gptq check
      
      * [gptq] combine gptq and kv cache manager (#4706)
      
      * combine gptq and kv cache manager
      
      * add init bits
      
      * delete useless code
      
      * add model path
      
      * delete usless print and update test
      
      * delete usless import
      
      * move option gptq to shard config
      
      * change replace linear to shardformer
      
      * update bloom policy
      
      * delete useless code
      
      * fix import bug and delete uselss code
      
      * change colossalai/gptq to colossalai/quant/gptq
      
      * update import linear for tests
      
      * delete useless code and mv gptq_kernel to kernel directory
      
      * fix triton kernel
      
      * add triton import
      946ab56c