Commits · cad705121e616767c98f466ec76e424fa9828e69 · OpenDAS / Lmdeploy

17 May, 2024 1 commit
- 1、取出dcu不支持的依赖；2、支持gcc7 · cad70512
  zhouxiang authored May 17, 2024
  
  cad70512
29 Apr, 2024 1 commit
- 支持混精和半精切换能力 · 669dc816
  zhouxiang authored Apr 29, 2024
  
  669dc816
21 Mar, 2024 1 commit
- 去掉blaslt的依赖 · 5f83e392
  zhouxiang authored Mar 21, 2024
  
  5f83e392
04 Mar, 2024 1 commit
- 解决原框架unfused_attention不支持大于4k输入的问题 · 2e528580
  zhouxiang authored Mar 04, 2024
  
  2e528580
12 Jan, 2024 1 commit
- 对应官方最新版本0.1.0主要增加page Attention · 3253240a
  xiabo authored Jan 12, 2024
```
修改测试用例
```
  3253240a
20 Dec, 2023 1 commit
- Adapt to 0.1.0 · 9484fd1c
  xiabo authored Dec 20, 2023
  
  9484fd1c
18 Dec, 2023 1 commit
- Return the iterator after erasing it from a map (#864) · d29b70ae
  Chen Xin authored Dec 18, 2023
  
  d29b70ae
15 Dec, 2023 2 commits

Support turbomind bf16 (#803) · 3295eac3

q.yao authored Dec 15, 2023

* Add bf16 template sp

* prepare merge

* add enable bf

* add bf16 decode attention support

* fix python lint

* fix yapf

* fix c format

* c format11

* fix cast

* fix on sm<80

* fix linux bf162 cast

* fix type cast

* fix lint

* support from hf pretrained

* fix pybind

* fix converter

* add trust remote code

* fix comment

* fix convert qwen

* fix lint

* fix baichuan

* update weight map

3295eac3

support image_embs input (#799) · b190521b

Chen Xin authored Dec 15, 2023

* support image_embs input

* add some checks

* update interactive/config.pbtxt && TurbomindModelConfig

* update docstring

* refactor

* support convert embeddings to bf16

* update interactive/config.pbtxt

* embeddings -> input_embeddings

* use input_embedding_ranges

* remove embedding_begins/ends

b190521b

12 Dec, 2023 1 commit
- fix cache verification (#821) · 72869ef8
  Li Zhang authored Dec 12, 2023
  
  72869ef8
11 Dec, 2023 3 commits
- Disable attention mask when it is not needed (#813) · b8354dae
  Li Zhang authored Dec 11, 2023
```
* disable attention mask when not needed

* fix for sm<80 and float data type
```
  b8354dae
- set smem size for repetition penalty kernel (#818) · d5a89465
  Li Zhang authored Dec 11, 2023
  
  d5a89465
- Simplify block manager (#812) · a54b16a2
  Li Zhang authored Dec 11, 2023
```
* simplify block manager

* fix lint
```
  a54b16a2
07 Dec, 2023 1 commit
- fix out of bounds access (#809) · 2d5f5b30
  Li Zhang authored Dec 07, 2023
  
  2d5f5b30
06 Dec, 2023 1 commit
- fix local kv head num (#806) · 5b9e454a
  Lyu Han authored Dec 06, 2023
  
  5b9e454a
04 Dec, 2023 1 commit

Unify prefill & decode passes (#775) · 7f943a26

Li Zhang authored Dec 04, 2023

* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

7f943a26

02 Dec, 2023 1 commit
- Fix early exit condition in attention kernel (#788) · 816022e4
  Li Zhang authored Dec 02, 2023
  
  816022e4
29 Nov, 2023 2 commits

improvement(build): enable ninja and gold linker (#767) · 8add942d

tpoisonooo authored Nov 29, 2023

* feat(build): enable ninja and lld

* fix(.github): add ninja installation

* fix(CI): remove dimsize=256

* fix(CI): add option for generate.sh

* fix(docs): update

8add942d

fix turbomind build on sm<80 (#754) · 8c672a7b
q.yao authored Nov 29, 2023
```
* fix

* fix lint
```
8c672a7b

28 Nov, 2023 1 commit
- fix typo (#769) · 2f80c556
  q.yao authored Nov 28, 2023
  
  2f80c556
23 Nov, 2023 2 commits
- [Fix] Skip empty batch (#747) · a7c5007c
  Li Zhang authored Nov 23, 2023
  
  a7c5007c
- Fix cache/output length calculation (#738) · 434961c6
  Li Zhang authored Nov 23, 2023
  
  434961c6
22 Nov, 2023 1 commit

Support loading hf model directly (#685) · 6b00f623

Chen Xin authored Nov 22, 2023

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

6b00f623

20 Nov, 2023 1 commit

Optimize for throughput (#701) · 911c0a85

Li Zhang authored Nov 20, 2023



* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------
Co-authored-by: lvhan028 <lvhan_028@163.com>

911c0a85

14 Nov, 2023 1 commit
- Fix init of batch state (#682) · 4eb8dd83
  Li Zhang authored Nov 14, 2023
```
* fix init of finished buf

* fix `finished_count`
```
  4eb8dd83
10 Nov, 2023 1 commit

TurboMind 2 (#590) · ab1767cf

Li Zhang authored Nov 10, 2023

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

ab1767cf

11 Oct, 2023 1 commit
- [bug] fix mismatched shape for decoder output tensor (#517) · 0d2a151e
  akhoroshev authored Oct 11, 2023
  
  0d2a151e
09 Oct, 2023 1 commit
- Change `shared_instance` type from `weakptr` to `shared_ptr` (#507) · 19fea86c
  Lyu Han authored Oct 09, 2023
```
* change shared_instances_ from weakptr to sharedptr

* update
```
  19fea86c
26 Sep, 2023 3 commits
- Fix memory leak (#488) · 5d87c20f
  Lyu Han authored Sep 26, 2023
```
* Fix memory leak

* modern c++
```
  5d87c20f
- fix race condition (#460) · a54e3e09
  akhoroshev authored Sep 26, 2023
  
  a54e3e09
- [feature] Graceful termination of background threads in LlamaV2 (#458) · 0cc667e1
  akhoroshev authored Sep 26, 2023
```
* cuda allocator fix

* graceful termination

* lint and compilation fix
```
  0cc667e1
18 Sep, 2023 2 commits

[Fix] Support actual seqlen in flash-attention2 (#418) · abe9f7bd

q.yao authored Sep 18, 2023

* support actual seqlen

* fix lint

* update variable types

* lint

* update type

* fix lint

---------

abe9f7bd

Reduce gil switching (#407) · d44a8bfe

Chen Xin authored Sep 18, 2023

* reduce gil switching

* ffi lock func

* remove unused

* remove unused

* remove unused

d44a8bfe

14 Sep, 2023 1 commit
- Fix memory leak (#415) · 2dec28ae
  Chen Xin authored Sep 14, 2023
  
  2dec28ae
11 Sep, 2023 1 commit

Support codellama (#359) · 65c662f9

Lyu Han authored Sep 11, 2023

* tmp

* add demo for codellama inference

* update

* update

* update

* update codellama.md

* export rope_theta

* update

* update doc

* fix client.py

* define SamplingParam

* rollback 'end'

* rotary_emb_base to rotary_embedding_base

* change to baichuan2-7b

65c662f9

07 Sep, 2023 1 commit
- [Fix] Set max dynamic smem size for decoder MHA to support context length > 8k (#377) · 71ade772
  Lyu Han authored Sep 07, 2023
```
* Fix crash when context window size is large by setting max dynamic smem size

* fix linting
```
  71ade772
01 Sep, 2023 1 commit
- Package 'bin/llama_gemm' to wheel (#320) · 22e8b2ca
  Chen Xin authored Sep 01, 2023
```
* pack llama_gemm

* update CMakeLists.txt

* remove candidate

* update MANIFEST.in
```
  22e8b2ca
29 Aug, 2023 1 commit

Add flashattention2 (#196) · 452822a4

q.yao authored Aug 29, 2023



* first

* fix causal mask

* disable flash attention2 on sm70

* fix 2

* update readme

* clang-format

* disable ft2 on windows

* fix lint

* fix build

* fix build

* fix long kv seq

* fix lint

* sync copy output

---------
Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
Co-authored-by: irexyc <irexyc@gmail.com>

452822a4

24 Aug, 2023 1 commit

Pad tok_embedding and output weights to make their shape divisible by TP (#285) · 4903d3cc

Lyu Han authored Aug 24, 2023

* Pad tok_embedding and output weights to make their shape divisible by TP

* update

* update

* update

* update

* update llamaBatch

4903d3cc

22 Aug, 2023 1 commit
- [Fix] Fix building with CUDA 11.3 (#280) · 9e366482
  Li Zhang authored Aug 22, 2023
```
* disable cache hint for CUDA < 11.4

* fix lint

* fix lint

* fix cuda-11.3 build
```
  9e366482