Commits · 62b60db72a8e968dd74720d203a964a5ecb1df8d · OpenDAS / Lmdeploy

17 Aug, 2023 2 commits

[Fix] Implement movmatrix using warp shuffling for CUDA < 11.8 (#267) · f8ed456e
Li Zhang authored Aug 17, 2023

f8ed456e

Support windows platform (#209) · 4c9959f6

Chen Xin authored Aug 17, 2023

* __PRETTY_FUNCTION__

* CASE_K

* uint

* remove not

* HALF_FLT_MAX

* struct init

* port utils

* better build pthread-win32

* port kernels

* port utils/gemm_test

* hide windows header

* port models

* port examples && triton_backend && unittests

* update build readme

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* fix build

* fix build

* cmake version

* fix typos

* update ci

* port kernels/gemm_s_f16

* update ci

* fix ci

* use cudaStreamSynchronize instead of volatile check

* remove gettimeofday

* remove pthread-win32

* remove dirent.h

* update pre-commit

* update

* remove todo

* fix include

* fix build

* fix build

* fix build ci

* fix github action trigger

* update README

* fix linux-build ci

* remove windows folder

* fix lint

* update readme

4c9959f6

15 Aug, 2023 1 commit
- Fix wrong RPATH using the absolute path instead of relative one (#239) · 271a19fe
  Chen Xin authored Aug 15, 2023
  
  271a19fe
14 Aug, 2023 2 commits

feat(quantization): kv cache use asymmetric (#218) · 902a3e16
tpoisonooo authored Aug 14, 2023
```
* feat(quantization): kv cache use asymmetric
```
902a3e16

[Feature] Blazing fast W4A16 inference (#202) · c3290cad

Li Zhang authored Aug 14, 2023

* add w4a16

* fix `deploy.py`

* add doc

* add w4a16 kernels

* fuse w1/w3 & bugfixes

* fix typo

* python

* guard sm75/80 features

* add missing header

* refactor

* qkvo bias

* update cost model

* fix lint

* update `deploy.py`

c3290cad

03 Aug, 2023 1 commit
- Fix build test error and move turbmind csrc test cases to `tests/csrc` (#188) · 44a85546
  lvhan028 authored Aug 03, 2023
```
* fix build tests failure

* move src test cases to tests/csrc
```
  44a85546
31 Jul, 2023 2 commits
- Support Runtime tensor parallelism (#158) · 4767b04d
  q.yao authored Jul 31, 2023
```
* works on interlm and vicuna

* support GQA

* remove comment

* update readme, add logger, default tp=1

* remove log
```
  4767b04d
- [Fix] Remove unused code to reduce binary size (#181) · 981a4610
  Li Zhang authored Jul 31, 2023
```
* clean-up

* fix lint

* fix lint
```
  981a4610
27 Jul, 2023 1 commit

Add manylinux builder (#164) · b9004712

Chen Xin authored Jul 27, 2023



* update builder

* remove root permission

* update readme

* update setup.py

* add install cuda 12.1 script

* use generate.sh

* add nccl to install_requires

* update README.md

* fix lint

* update setup.py

---------
Co-authored-by: chenxin <chenxin@pjlab.org.cn>

b9004712

25 Jul, 2023 1 commit
- support fmha gqa (#160) · 5ed6bb59
  q.yao authored Jul 25, 2023
```
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
```
  5ed6bb59
24 Jul, 2023 1 commit
- [Feature] decode-only forward pass (#153) · 0cc9d095
  Li Zhang authored Jul 24, 2023
```
* decode only forward pass

* fix lint

* batch embedding
```
  0cc9d095
21 Jul, 2023 1 commit

[Feature] Support Llama-2 with GQA (#147) · f07b697b

Li Zhang authored Jul 21, 2023

* add GQA for llama2

* fix model conversion

* fix lint & remove dev log

* update news

* minor

* fix allocation size

* fix split_dim for w_qkv.bias

f07b697b

18 Jul, 2023 1 commit

Tensor Parallel python api (#82) · 7cbfe2ea

q.yao authored Jul 18, 2023

* wip

* profile disable tp

* fix profile

* lint

* fix dlpack

* remove comment

* add tp flag

* add session len check

* add eos

* remove tp and session len inputs

* warp tokenizer

* multithread load weight

* update profile

* refactor tokenizer

* remove pre/post process

* remove mpi4py requirement

* remove

* remove bind

* remove mpi requirement

* check backend_tokenizer

7cbfe2ea

17 Jul, 2023 1 commit
- update log info (#131) · 1f88baa5
  q.yao authored Jul 17, 2023
```
* update log info

* format cuda utils
```
  1f88baa5
06 Jul, 2023 2 commits

Streaming output (#71) · 74a4f3c9

q.yao authored Jul 06, 2023



* streaming-output

* fix end

* fix profile

* support chinese streaming

* lint

* update chat

* lint

* fix benchmark

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>

74a4f3c9

fix clang-format (#68) · 208b6841
AllentDan authored Jul 06, 2023

208b6841

05 Jul, 2023 2 commits

[Feature] Stats Quantization Parameters for KV Cache (#45) · 3fff964d

pppppM authored Jul 05, 2023

* add cal qparams

* support offload inference

* add collect funtions (mod,weight)

* stats kv scales

* update init

* add user guide

* fix hints

* fix comments & support turbomind format

* update user guide

* fix slice kv cache error & support pileval dataset (used in llm-awq)

* fix wrong num heads slice

* update default dataset

* fix conflict

* fix hints

* fix hints

* add gitignore

3fff964d

Python ffi (#34) · 4fd6e710

q.yao authored Jul 05, 2023



* wip

* wip

* example finish

* fix include and namespace

* wtf

* install lib

* batchize

* update cmake install

* multithread

* fix comment

* fix

* add mmengine

* bind llamamodel

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>

4fd6e710

04 Jul, 2023 1 commit
- use format-11.1 (#38) · 5ea40abf
  AllentDan authored Jul 04, 2023
```
* format-11.1

* md-link-config
```
  5ea40abf
03 Jul, 2023 2 commits
- install triton_example and TransformerTritonBackend to runtime and lib respectively (#39) · bb6f8060
  lvhan028 authored Jul 03, 2023
  
  bb6f8060
- fix(kernel): speed degrade (#41) · 6e58fced
  tpoisonooo authored Jul 03, 2023
```
* feat(template): remote diff

* feat(cmake): use c++17
```
  6e58fced
01 Jul, 2023 4 commits
- Change target tritonfastertransformerbackend to trtonturbomindbackend (#36) · 70e6ab26
  lvhan028 authored Jul 01, 2023
```
* change target tritonfastertransformerbackend to tritonturbomindbackend

* install targets to backends/turbomind

* changge model_dir
```
  70e6ab26
- build turbomind (#35) · 35d64462
  lvhan028 authored Jul 01, 2023
```
* build turbomind

* change namespace fastertransformer to turbomind

* change logger name
```
  35d64462
- rename src/fastertransformer to src/turbomind (#33) · 53d2e42c
  lvhan028 authored Jul 01, 2023
  
  53d2e42c
- Add lint action (#32) · fe46dac2
  AllentDan authored Jul 01, 2023
```
* temp

* fix lint

* csrc->src

* remove clang-format

* skip .rst

* skip doc

* clang-format

version

version

* mat_B
```
  fe46dac2
28 Jun, 2023 2 commits

feat(src): add kv cache int8 quantization (#22) · cc93136e

tpoisonooo authored Jun 28, 2023

* feat(src): add int8 and compile passed

* feat(kernels): fix

* feat(llama): update kernel

* feat(src): add debug

* fix(kernel): k_cache use int8_t pointer

* style(llama): clean code

* feat(deploy.py): revert to enable fmha

* style(LlamaV2): clean code

* feat(deploy.py): add default quant policy

cc93136e

fix-gemm-tuning (#24) · 4d42a781
Li Zhang authored Jun 28, 2023

4d42a781

26 Jun, 2023 1 commit
- add gemm tuning (#18) · e357c71f
  Li Zhang authored Jun 26, 2023
  
  e357c71f
24 Jun, 2023 1 commit
- Support attention bias (#14) · 2700abb3
  Li Zhang authored Jun 24, 2023
```
* support attention bias

* fix conflict
```
  2700abb3
22 Jun, 2023 1 commit

Fix fmha on sm 70 (#12) · 4b121180

q.yao authored Jun 22, 2023



* update arch

* clang-format

* remove comment

---------
Co-authored-by: yaoqian <yaoqian@localhost.localdomain>

4b121180

21 Jun, 2023 1 commit

support fmha (#9) · 6c7d9992

q.yao authored Jun 21, 2023

* support fmha

* update sm by cudaarch

* update ldscript path

* clang-format

* clang-format

---------

6c7d9992

20 Jun, 2023 1 commit
- check-in fastertransformer (#7) · 9efcac38
  Li Zhang authored Jun 20, 2023
```
* add ft code

* gitignore

* fix lint

* revert fmha
```
  9efcac38