Commits · ab1767cf035e5a6f68ae6a744c0d744326ce12ff · OpenDAS / Lmdeploy

10 Nov, 2023 1 commit

Li Zhang authored Nov 10, 2023

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

ab1767cf

18 Sep, 2023 1 commit

[Fix] Support actual seqlen in flash-attention2 (#418) · abe9f7bd

q.yao authored Sep 18, 2023

* support actual seqlen

* fix lint

* update variable types

* lint

* update type

* fix lint

---------

abe9f7bd

29 Aug, 2023 1 commit

Add flashattention2 (#196) · 452822a4

q.yao authored Aug 29, 2023



* first

* fix causal mask

* disable flash attention2 on sm70

* fix 2

* update readme

* clang-format

* disable ft2 on windows

* fix lint

* fix build

* fix build

* fix long kv seq

* fix lint

* sync copy output

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
Co-authored-by: irexyc <irexyc@gmail.com>

452822a4

25 Jul, 2023 1 commit
- support fmha gqa (#160) · 5ed6bb59
  q.yao authored Jul 25, 2023
```
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
```
  5ed6bb59
21 Jul, 2023 1 commit

[Feature] Support Llama-2 with GQA (#147) · f07b697b

Li Zhang authored Jul 21, 2023

* add GQA for llama2

* fix model conversion

* fix lint & remove dev log

* update news

* minor

* fix allocation size

* fix split_dim for w_qkv.bias

f07b697b

01 Jul, 2023 3 commits
- build turbomind (#35) · 35d64462
  lvhan028 authored Jul 01, 2023
```
* build turbomind

* change namespace fastertransformer to turbomind

* change logger name
```
  35d64462
- rename src/fastertransformer to src/turbomind (#33) · 53d2e42c
  lvhan028 authored Jul 01, 2023
  
  53d2e42c
- Add lint action (#32) · fe46dac2
  AllentDan authored Jul 01, 2023
```
* temp

* fix lint

* csrc->src

* remove clang-format

* skip .rst

* skip doc

* clang-format

version

version

* mat_B
```
  fe46dac2
28 Jun, 2023 1 commit

feat(src): add kv cache int8 quantization (#22) · cc93136e

tpoisonooo authored Jun 28, 2023

* feat(src): add int8 and compile passed

* feat(kernels): fix

* feat(llama): update kernel

* feat(src): add debug

* fix(kernel): k_cache use int8_t pointer

* style(llama): clean code

* feat(deploy.py): revert to enable fmha

* style(LlamaV2): clean code

* feat(deploy.py): add default quant policy

cc93136e

20 Jun, 2023 1 commit
- check-in fastertransformer (#7) · 9efcac38
  Li Zhang authored Jun 20, 2023
```
* add ft code

* gitignore

* fix lint

* revert fmha
```
  9efcac38