Commits · 3295eac36c4dbcab57ee3e40ed02e00c278eb8b7 · OpenDAS / Lmdeploy

15 Dec, 2023 2 commits

Support turbomind bf16 (#803) · 3295eac3

q.yao authored Dec 15, 2023

* Add bf16 template sp

* prepare merge

* add enable bf

* add bf16 decode attention support

* fix python lint

* fix yapf

* fix c format

* c format11

* fix cast

* fix on sm<80

* fix linux bf162 cast

* fix type cast

* fix lint

* support from hf pretrained

* fix pybind

* fix converter

* add trust remote code

* fix comment

* fix convert qwen

* fix lint

* fix baichuan

* update weight map

3295eac3

support image_embs input (#799) · b190521b

Chen Xin authored Dec 15, 2023

* support image_embs input

* add some checks

* update interactive/config.pbtxt && TurbomindModelConfig

* update docstring

* refactor

* support convert embeddings to bf16

* update interactive/config.pbtxt

* embeddings -> input_embeddings

* use input_embedding_ranges

* remove embedding_begins/ends

b190521b

06 Dec, 2023 1 commit
- fix local kv head num (#806) · 5b9e454a
  Lyu Han authored Dec 06, 2023
  
  5b9e454a
04 Dec, 2023 1 commit

Unify prefill & decode passes (#775) · 7f943a26

Li Zhang authored Dec 04, 2023

* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

7f943a26

28 Nov, 2023 1 commit
- fix typo (#769) · 2f80c556
  q.yao authored Nov 28, 2023
  
  2f80c556
20 Nov, 2023 1 commit

Optimize for throughput (#701) · 911c0a85

Li Zhang authored Nov 20, 2023



* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------
Co-authored-by: lvhan028 <lvhan_028@163.com>

911c0a85

10 Nov, 2023 1 commit

TurboMind 2 (#590) · ab1767cf

Li Zhang authored Nov 10, 2023

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

ab1767cf

11 Oct, 2023 1 commit
- [bug] fix mismatched shape for decoder output tensor (#517) · 0d2a151e
  akhoroshev authored Oct 11, 2023
  
  0d2a151e
26 Sep, 2023 1 commit
- [feature] Graceful termination of background threads in LlamaV2 (#458) · 0cc667e1
  akhoroshev authored Sep 26, 2023
```
* cuda allocator fix

* graceful termination

* lint and compilation fix
```
  0cc667e1
18 Sep, 2023 1 commit

[Fix] Support actual seqlen in flash-attention2 (#418) · abe9f7bd

q.yao authored Sep 18, 2023

* support actual seqlen

* fix lint

* update variable types

* lint

* update type

* fix lint

---------

abe9f7bd

24 Aug, 2023 1 commit

Pad tok_embedding and output weights to make their shape divisible by TP (#285) · 4903d3cc

Lyu Han authored Aug 24, 2023

* Pad tok_embedding and output weights to make their shape divisible by TP

* update

* update

* update

* update

* update llamaBatch

4903d3cc

18 Aug, 2023 1 commit

[Feature] Support Qwen-7B, dynamic NTK scaling and logN scaling in turbomind (#230) · 4a60b45d

Li Zhang authored Aug 18, 2023

* qwen support

* dynamic ntk & logn attn

* fix ntk & add chat template

* fix ntk scaling & stop words

* fix lint

* add tiktoken to requirements.txt

* fix tokenizer, set model format automatically

* update model.py

* update readme

* fix lint

4a60b45d

17 Aug, 2023 1 commit

Support windows platform (#209) · 4c9959f6

Chen Xin authored Aug 17, 2023

* __PRETTY_FUNCTION__

* CASE_K

* uint

* remove not

* HALF_FLT_MAX

* struct init

* port utils

* better build pthread-win32

* port kernels

* port utils/gemm_test

* hide windows header

* port models

* port examples && triton_backend && unittests

* update build readme

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* fix build

* fix build

* cmake version

* fix typos

* update ci

* port kernels/gemm_s_f16

* update ci

* fix ci

* use cudaStreamSynchronize instead of volatile check

* remove gettimeofday

* remove pthread-win32

* remove dirent.h

* update pre-commit

* update

* remove todo

* fix include

* fix build

* fix build

* fix build ci

* fix github action trigger

* update README

* fix linux-build ci

* remove windows folder

* fix lint

* update readme

4c9959f6

21 Jul, 2023 1 commit

[Feature] Support Llama-2 with GQA (#147) · f07b697b

Li Zhang authored Jul 21, 2023

* add GQA for llama2

* fix model conversion

* fix lint & remove dev log

* update news

* minor

* fix allocation size

* fix split_dim for w_qkv.bias

f07b697b

01 Jul, 2023 3 commits
- build turbomind (#35) · 35d64462
  lvhan028 authored Jul 01, 2023
```
* build turbomind

* change namespace fastertransformer to turbomind

* change logger name
```
  35d64462
- rename src/fastertransformer to src/turbomind (#33) · 53d2e42c
  lvhan028 authored Jul 01, 2023
  
  53d2e42c
- Add lint action (#32) · fe46dac2
  AllentDan authored Jul 01, 2023
```
* temp

* fix lint

* csrc->src

* remove clang-format

* skip .rst

* skip doc

* clang-format

version

version

* mat_B
```
  fe46dac2
28 Jun, 2023 1 commit

feat(src): add kv cache int8 quantization (#22) · cc93136e

tpoisonooo authored Jun 28, 2023

* feat(src): add int8 and compile passed

* feat(kernels): fix

* feat(llama): update kernel

* feat(src): add debug

* fix(kernel): k_cache use int8_t pointer

* style(llama): clean code

* feat(deploy.py): revert to enable fmha

* style(LlamaV2): clean code

* feat(deploy.py): add default quant policy

cc93136e

20 Jun, 2023 1 commit
- check-in fastertransformer (#7) · 9efcac38
  Li Zhang authored Jun 20, 2023
```
* add ft code

* gitignore

* fix lint

* revert fmha
```
  9efcac38