Commits · ee33e2e71582a43ef560bf83d77f9e5eeeabdeef · OpenDAS / Lmdeploy

27 Nov, 2023 1 commit
- support dtk23.10 · ee33e2e7
  zhouxiang authored Nov 27, 2023
  
  ee33e2e7
18 Nov, 2023 2 commits
- Adapt to rocm FT的修改补充 · 40e07381
  xiabo authored Nov 18, 2023
  
  40e07381
- Adapt to rocm FT的修改补充 · ab8c95cb
  xiabo authored Nov 18, 2023
  
  ab8c95cb
15 Nov, 2023 1 commit
- Adapt to rocm 不适用flashattention2 · bc3c64aa
  xiabo authored Nov 15, 2023
  
  bc3c64aa
14 Nov, 2023 1 commit
- Adapt to rocm · e38ee081
  xiabo authored Nov 14, 2023
  
  e38ee081
11 Oct, 2023 1 commit
- [bug] fix mismatched shape for decoder output tensor (#517) · 0d2a151e
  akhoroshev authored Oct 11, 2023
  
  0d2a151e
09 Oct, 2023 1 commit
- Change `shared_instance` type from `weakptr` to `shared_ptr` (#507) · 19fea86c
  Lyu Han authored Oct 09, 2023
```
* change shared_instances_ from weakptr to sharedptr

* update
```
  19fea86c
26 Sep, 2023 3 commits
- Fix memory leak (#488) · 5d87c20f
  Lyu Han authored Sep 26, 2023
```
* Fix memory leak

* modern c++
```
  5d87c20f
- fix race condition (#460) · a54e3e09
  akhoroshev authored Sep 26, 2023
  
  a54e3e09
- [feature] Graceful termination of background threads in LlamaV2 (#458) · 0cc667e1
  akhoroshev authored Sep 26, 2023
```
* cuda allocator fix

* graceful termination

* lint and compilation fix
```
  0cc667e1
18 Sep, 2023 2 commits

[Fix] Support actual seqlen in flash-attention2 (#418) · abe9f7bd

q.yao authored Sep 18, 2023

* support actual seqlen

* fix lint

* update variable types

* lint

* update type

* fix lint

---------

abe9f7bd

Reduce gil switching (#407) · d44a8bfe

Chen Xin authored Sep 18, 2023

* reduce gil switching

* ffi lock func

* remove unused

* remove unused

* remove unused

d44a8bfe

14 Sep, 2023 1 commit
- Fix memory leak (#415) · 2dec28ae
  Chen Xin authored Sep 14, 2023
  
  2dec28ae
11 Sep, 2023 1 commit

Support codellama (#359) · 65c662f9

Lyu Han authored Sep 11, 2023

* tmp

* add demo for codellama inference

* update

* update

* update

* update codellama.md

* export rope_theta

* update

* update doc

* fix client.py

* define SamplingParam

* rollback 'end'

* rotary_emb_base to rotary_embedding_base

* change to baichuan2-7b

65c662f9

07 Sep, 2023 1 commit
- [Fix] Set max dynamic smem size for decoder MHA to support context length > 8k (#377) · 71ade772
  Lyu Han authored Sep 07, 2023
```
* Fix crash when context window size is large by setting max dynamic smem size

* fix linting
```
  71ade772
01 Sep, 2023 1 commit
- Package 'bin/llama_gemm' to wheel (#320) · 22e8b2ca
  Chen Xin authored Sep 01, 2023
```
* pack llama_gemm

* update CMakeLists.txt

* remove candidate

* update MANIFEST.in
```
  22e8b2ca
29 Aug, 2023 1 commit

Add flashattention2 (#196) · 452822a4

q.yao authored Aug 29, 2023



* first

* fix causal mask

* disable flash attention2 on sm70

* fix 2

* update readme

* clang-format

* disable ft2 on windows

* fix lint

* fix build

* fix build

* fix long kv seq

* fix lint

* sync copy output

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
Co-authored-by: irexyc <irexyc@gmail.com>

452822a4

24 Aug, 2023 1 commit

Pad tok_embedding and output weights to make their shape divisible by TP (#285) · 4903d3cc

Lyu Han authored Aug 24, 2023

* Pad tok_embedding and output weights to make their shape divisible by TP

* update

* update

* update

* update

* update llamaBatch

4903d3cc

22 Aug, 2023 1 commit
- [Fix] Fix building with CUDA 11.3 (#280) · 9e366482
  Li Zhang authored Aug 22, 2023
```
* disable cache hint for CUDA < 11.4

* fix lint

* fix lint

* fix cuda-11.3 build
```
  9e366482
18 Aug, 2023 1 commit

[Feature] Support Qwen-7B, dynamic NTK scaling and logN scaling in turbomind (#230) · 4a60b45d

Li Zhang authored Aug 18, 2023

* qwen support

* dynamic ntk & logn attn

* fix ntk & add chat template

* fix ntk scaling & stop words

* fix lint

* add tiktoken to requirements.txt

* fix tokenizer, set model format automatically

* update model.py

* update readme

* fix lint

4a60b45d

17 Aug, 2023 2 commits

[Fix] Implement movmatrix using warp shuffling for CUDA < 11.8 (#267) · f8ed456e
Li Zhang authored Aug 17, 2023

f8ed456e

Support windows platform (#209) · 4c9959f6

Chen Xin authored Aug 17, 2023

* __PRETTY_FUNCTION__

* CASE_K

* uint

* remove not

* HALF_FLT_MAX

* struct init

* port utils

* better build pthread-win32

* port kernels

* port utils/gemm_test

* hide windows header

* port models

* port examples && triton_backend && unittests

* update build readme

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* fix build

* fix build

* cmake version

* fix typos

* update ci

* port kernels/gemm_s_f16

* update ci

* fix ci

* use cudaStreamSynchronize instead of volatile check

* remove gettimeofday

* remove pthread-win32

* remove dirent.h

* update pre-commit

* update

* remove todo

* fix include

* fix build

* fix build

* fix build ci

* fix github action trigger

* update README

* fix linux-build ci

* remove windows folder

* fix lint

* update readme

4c9959f6

15 Aug, 2023 1 commit
- Fix wrong RPATH using the absolute path instead of relative one (#239) · 271a19fe
  Chen Xin authored Aug 15, 2023
  
  271a19fe
14 Aug, 2023 2 commits

feat(quantization): kv cache use asymmetric (#218) · 902a3e16
tpoisonooo authored Aug 14, 2023
```
* feat(quantization): kv cache use asymmetric
```
902a3e16

[Feature] Blazing fast W4A16 inference (#202) · c3290cad

Li Zhang authored Aug 14, 2023

* add w4a16

* fix `deploy.py`

* add doc

* add w4a16 kernels

* fuse w1/w3 & bugfixes

* fix typo

* python

* guard sm75/80 features

* add missing header

* refactor

* qkvo bias

* update cost model

* fix lint

* update `deploy.py`

c3290cad

03 Aug, 2023 1 commit
- Fix build test error and move turbmind csrc test cases to `tests/csrc` (#188) · 44a85546
  lvhan028 authored Aug 03, 2023
```
* fix build tests failure

* move src test cases to tests/csrc
```
  44a85546
31 Jul, 2023 2 commits
- Support Runtime tensor parallelism (#158) · 4767b04d
  q.yao authored Jul 31, 2023
```
* works on interlm and vicuna

* support GQA

* remove comment

* update readme, add logger, default tp=1

* remove log
```
  4767b04d
- [Fix] Remove unused code to reduce binary size (#181) · 981a4610
  Li Zhang authored Jul 31, 2023
```
* clean-up

* fix lint

* fix lint
```
  981a4610
27 Jul, 2023 1 commit

Add manylinux builder (#164) · b9004712

Chen Xin authored Jul 27, 2023



* update builder

* remove root permission

* update readme

* update setup.py

* add install cuda 12.1 script

* use generate.sh

* add nccl to install_requires

* update README.md

* fix lint

* update setup.py

---------
Co-authored-by: chenxin <chenxin@pjlab.org.cn>

b9004712

25 Jul, 2023 1 commit
- support fmha gqa (#160) · 5ed6bb59
  q.yao authored Jul 25, 2023
```
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
```
  5ed6bb59
24 Jul, 2023 1 commit
- [Feature] decode-only forward pass (#153) · 0cc9d095
  Li Zhang authored Jul 24, 2023
```
* decode only forward pass

* fix lint

* batch embedding
```
  0cc9d095
21 Jul, 2023 1 commit

[Feature] Support Llama-2 with GQA (#147) · f07b697b

Li Zhang authored Jul 21, 2023

* add GQA for llama2

* fix model conversion

* fix lint & remove dev log

* update news

* minor

* fix allocation size

* fix split_dim for w_qkv.bias

f07b697b

18 Jul, 2023 1 commit

Tensor Parallel python api (#82) · 7cbfe2ea

q.yao authored Jul 18, 2023

* wip

* profile disable tp

* fix profile

* lint

* fix dlpack

* remove comment

* add tp flag

* add session len check

* add eos

* remove tp and session len inputs

* warp tokenizer

* multithread load weight

* update profile

* refactor tokenizer

* remove pre/post process

* remove mpi4py requirement

* remove

* remove bind

* remove mpi requirement

* check backend_tokenizer

7cbfe2ea

17 Jul, 2023 1 commit
- update log info (#131) · 1f88baa5
  q.yao authored Jul 17, 2023
```
* update log info

* format cuda utils
```
  1f88baa5
06 Jul, 2023 2 commits

Streaming output (#71) · 74a4f3c9

q.yao authored Jul 06, 2023



* streaming-output

* fix end

* fix profile

* support chinese streaming

* lint

* update chat

* lint

* fix benchmark

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>

74a4f3c9

fix clang-format (#68) · 208b6841
AllentDan authored Jul 06, 2023

208b6841

05 Jul, 2023 2 commits

[Feature] Stats Quantization Parameters for KV Cache (#45) · 3fff964d

pppppM authored Jul 05, 2023

* add cal qparams

* support offload inference

* add collect funtions (mod,weight)

* stats kv scales

* update init

* add user guide

* fix hints

* fix comments & support turbomind format

* update user guide

* fix slice kv cache error & support pileval dataset (used in llm-awq)

* fix wrong num heads slice

* update default dataset

* fix conflict

* fix hints

* fix hints

* add gitignore

3fff964d

Python ffi (#34) · 4fd6e710

q.yao authored Jul 05, 2023



* wip

* wip

* example finish

* fix include and namespace

* wtf

* install lib

* batchize

* update cmake install

* multithread

* fix comment

* fix

* add mmengine

* bind llamamodel

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>

4fd6e710

04 Jul, 2023 1 commit
- use format-11.1 (#38) · 5ea40abf
  AllentDan authored Jul 04, 2023
```
* format-11.1

* md-link-config
```
  5ea40abf
03 Jul, 2023 1 commit
- install triton_example and TransformerTritonBackend to runtime and lib respectively (#39) · bb6f8060
  lvhan028 authored Jul 03, 2023
  
  bb6f8060