Commits · 2ba9082289e953d39450d97f802b490a9126890a · OpenDAS / Lmdeploy

29 Nov, 2023 2 commits

improvement(build): enable ninja and gold linker (#767) · 8add942d

tpoisonooo authored Nov 29, 2023

* feat(build): enable ninja and lld

* fix(.github): add ninja installation

* fix(CI): remove dimsize=256

* fix(CI): add option for generate.sh

* fix(docs): update

8add942d

fix turbomind build on sm<80 (#754) · 8c672a7b
q.yao authored Nov 29, 2023
```
* fix

* fix lint
```
8c672a7b

28 Nov, 2023 1 commit
- fix typo (#769) · 2f80c556
  q.yao authored Nov 28, 2023
  
  2f80c556
23 Nov, 2023 2 commits
- [Fix] Skip empty batch (#747) · a7c5007c
  Li Zhang authored Nov 23, 2023
  
  a7c5007c
- Fix cache/output length calculation (#738) · 434961c6
  Li Zhang authored Nov 23, 2023
  
  434961c6
22 Nov, 2023 1 commit

Support loading hf model directly (#685) · 6b00f623

Chen Xin authored Nov 22, 2023

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

6b00f623

20 Nov, 2023 1 commit

Optimize for throughput (#701) · 911c0a85

Li Zhang authored Nov 20, 2023



* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------
Co-authored-by: lvhan028 <lvhan_028@163.com>

911c0a85

14 Nov, 2023 1 commit
- Fix init of batch state (#682) · 4eb8dd83
  Li Zhang authored Nov 14, 2023
```
* fix init of finished buf

* fix `finished_count`
```
  4eb8dd83
10 Nov, 2023 1 commit

TurboMind 2 (#590) · ab1767cf

Li Zhang authored Nov 10, 2023

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

ab1767cf

11 Oct, 2023 1 commit
- [bug] fix mismatched shape for decoder output tensor (#517) · 0d2a151e
  akhoroshev authored Oct 11, 2023
  
  0d2a151e
26 Sep, 2023 3 commits
- Fix memory leak (#488) · 5d87c20f
  Lyu Han authored Sep 26, 2023
```
* Fix memory leak

* modern c++
```
  5d87c20f
- fix race condition (#460) · a54e3e09
  akhoroshev authored Sep 26, 2023
  
  a54e3e09
- [feature] Graceful termination of background threads in LlamaV2 (#458) · 0cc667e1
  akhoroshev authored Sep 26, 2023
```
* cuda allocator fix

* graceful termination

* lint and compilation fix
```
  0cc667e1
18 Sep, 2023 2 commits

[Fix] Support actual seqlen in flash-attention2 (#418) · abe9f7bd

q.yao authored Sep 18, 2023

* support actual seqlen

* fix lint

* update variable types

* lint

* update type

* fix lint

---------

abe9f7bd

Reduce gil switching (#407) · d44a8bfe

Chen Xin authored Sep 18, 2023

* reduce gil switching

* ffi lock func

* remove unused

* remove unused

* remove unused

d44a8bfe

11 Sep, 2023 1 commit

Support codellama (#359) · 65c662f9

Lyu Han authored Sep 11, 2023

* tmp

* add demo for codellama inference

* update

* update

* update

* update codellama.md

* export rope_theta

* update

* update doc

* fix client.py

* define SamplingParam

* rollback 'end'

* rotary_emb_base to rotary_embedding_base

* change to baichuan2-7b

65c662f9

01 Sep, 2023 1 commit
- Package 'bin/llama_gemm' to wheel (#320) · 22e8b2ca
  Chen Xin authored Sep 01, 2023
```
* pack llama_gemm

* update CMakeLists.txt

* remove candidate

* update MANIFEST.in
```
  22e8b2ca
29 Aug, 2023 1 commit

Add flashattention2 (#196) · 452822a4

q.yao authored Aug 29, 2023



* first

* fix causal mask

* disable flash attention2 on sm70

* fix 2

* update readme

* clang-format

* disable ft2 on windows

* fix lint

* fix build

* fix build

* fix long kv seq

* fix lint

* sync copy output

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
Co-authored-by: irexyc <irexyc@gmail.com>

452822a4

24 Aug, 2023 1 commit

Pad tok_embedding and output weights to make their shape divisible by TP (#285) · 4903d3cc

Lyu Han authored Aug 24, 2023

* Pad tok_embedding and output weights to make their shape divisible by TP

* update

* update

* update

* update

* update llamaBatch

4903d3cc

22 Aug, 2023 1 commit
- [Fix] Fix building with CUDA 11.3 (#280) · 9e366482
  Li Zhang authored Aug 22, 2023
```
* disable cache hint for CUDA < 11.4

* fix lint

* fix lint

* fix cuda-11.3 build
```
  9e366482
18 Aug, 2023 1 commit

[Feature] Support Qwen-7B, dynamic NTK scaling and logN scaling in turbomind (#230) · 4a60b45d

Li Zhang authored Aug 18, 2023

* qwen support

* dynamic ntk & logn attn

* fix ntk & add chat template

* fix ntk scaling & stop words

* fix lint

* add tiktoken to requirements.txt

* fix tokenizer, set model format automatically

* update model.py

* update readme

* fix lint

4a60b45d

17 Aug, 2023 1 commit

Support windows platform (#209) · 4c9959f6

Chen Xin authored Aug 17, 2023

* __PRETTY_FUNCTION__

* CASE_K

* uint

* remove not

* HALF_FLT_MAX

* struct init

* port utils

* better build pthread-win32

* port kernels

* port utils/gemm_test

* hide windows header

* port models

* port examples && triton_backend && unittests

* update build readme

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* fix build

* fix build

* cmake version

* fix typos

* update ci

* port kernels/gemm_s_f16

* update ci

* fix ci

* use cudaStreamSynchronize instead of volatile check

* remove gettimeofday

* remove pthread-win32

* remove dirent.h

* update pre-commit

* update

* remove todo

* fix include

* fix build

* fix build

* fix build ci

* fix github action trigger

* update README

* fix linux-build ci

* remove windows folder

* fix lint

* update readme

4c9959f6

14 Aug, 2023 2 commits

feat(quantization): kv cache use asymmetric (#218) · 902a3e16
tpoisonooo authored Aug 14, 2023
```
* feat(quantization): kv cache use asymmetric
```
902a3e16

[Feature] Blazing fast W4A16 inference (#202) · c3290cad

Li Zhang authored Aug 14, 2023

* add w4a16

* fix `deploy.py`

* add doc

* add w4a16 kernels

* fuse w1/w3 & bugfixes

* fix typo

* python

* guard sm75/80 features

* add missing header

* refactor

* qkvo bias

* update cost model

* fix lint

* update `deploy.py`

c3290cad

31 Jul, 2023 2 commits
- Support Runtime tensor parallelism (#158) · 4767b04d
  q.yao authored Jul 31, 2023
```
* works on interlm and vicuna

* support GQA

* remove comment

* update readme, add logger, default tp=1

* remove log
```
  4767b04d
- [Fix] Remove unused code to reduce binary size (#181) · 981a4610
  Li Zhang authored Jul 31, 2023
```
* clean-up

* fix lint

* fix lint
```
  981a4610
25 Jul, 2023 1 commit
- support fmha gqa (#160) · 5ed6bb59
  q.yao authored Jul 25, 2023
```
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
```
  5ed6bb59
24 Jul, 2023 1 commit
- [Feature] decode-only forward pass (#153) · 0cc9d095
  Li Zhang authored Jul 24, 2023
```
* decode only forward pass

* fix lint

* batch embedding
```
  0cc9d095
21 Jul, 2023 1 commit

[Feature] Support Llama-2 with GQA (#147) · f07b697b

Li Zhang authored Jul 21, 2023

* add GQA for llama2

* fix model conversion

* fix lint & remove dev log

* update news

* minor

* fix allocation size

* fix split_dim for w_qkv.bias

f07b697b

04 Jul, 2023 1 commit
- use format-11.1 (#38) · 5ea40abf
  AllentDan authored Jul 04, 2023
```
* format-11.1

* md-link-config
```
  5ea40abf
01 Jul, 2023 3 commits
- Change target tritonfastertransformerbackend to trtonturbomindbackend (#36) · 70e6ab26
  lvhan028 authored Jul 01, 2023
```
* change target tritonfastertransformerbackend to tritonturbomindbackend

* install targets to backends/turbomind

* changge model_dir
```
  70e6ab26
- build turbomind (#35) · 35d64462
  lvhan028 authored Jul 01, 2023
```
* build turbomind

* change namespace fastertransformer to turbomind

* change logger name
```
  35d64462
- rename src/fastertransformer to src/turbomind (#33) · 53d2e42c
  lvhan028 authored Jul 01, 2023
  
  53d2e42c