Commits · d26f4c7385d64a89f42cb11e69a16a2e97c42397 · OpenDAS / Lmdeploy

27 May, 2024 1 commit
- 增加awq模块 · d26f4c73
  gaoqiong authored May 27, 2024
  
  d26f4c73
20 Dec, 2023 1 commit
- Adapt to 0.1.0 · 9484fd1c
  xiabo authored Dec 20, 2023
  
  9484fd1c
15 Dec, 2023 1 commit

Support turbomind bf16 (#803) · 3295eac3

q.yao authored Dec 15, 2023

* Add bf16 template sp

* prepare merge

* add enable bf

* add bf16 decode attention support

* fix python lint

* fix yapf

* fix c format

* c format11

* fix cast

* fix on sm<80

* fix linux bf162 cast

* fix type cast

* fix lint

* support from hf pretrained

* fix pybind

* fix converter

* add trust remote code

* fix comment

* fix convert qwen

* fix lint

* fix baichuan

* update weight map

3295eac3

22 Nov, 2023 1 commit

Support loading hf model directly (#685) · 6b00f623

Chen Xin authored Nov 22, 2023

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

6b00f623

10 Nov, 2023 1 commit

TurboMind 2 (#590) · ab1767cf

Li Zhang authored Nov 10, 2023

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

ab1767cf

17 Aug, 2023 1 commit

Support windows platform (#209) · 4c9959f6

Chen Xin authored Aug 17, 2023

* __PRETTY_FUNCTION__

* CASE_K

* uint

* remove not

* HALF_FLT_MAX

* struct init

* port utils

* better build pthread-win32

* port kernels

* port utils/gemm_test

* hide windows header

* port models

* port examples && triton_backend && unittests

* update build readme

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* fix build

* fix build

* cmake version

* fix typos

* update ci

* port kernels/gemm_s_f16

* update ci

* fix ci

* use cudaStreamSynchronize instead of volatile check

* remove gettimeofday

* remove pthread-win32

* remove dirent.h

* update pre-commit

* update

* remove todo

* fix include

* fix build

* fix build

* fix build ci

* fix github action trigger

* update README

* fix linux-build ci

* remove windows folder

* fix lint

* update readme

4c9959f6

14 Aug, 2023 2 commits

feat(quantization): kv cache use asymmetric (#218) · 902a3e16
tpoisonooo authored Aug 14, 2023
```
* feat(quantization): kv cache use asymmetric
```
902a3e16

[Feature] Blazing fast W4A16 inference (#202) · c3290cad

Li Zhang authored Aug 14, 2023

* add w4a16

* fix `deploy.py`

* add doc

* add w4a16 kernels

* fuse w1/w3 & bugfixes

* fix typo

* python

* guard sm75/80 features

* add missing header

* refactor

* qkvo bias

* update cost model

* fix lint

* update `deploy.py`

c3290cad

31 Jul, 2023 1 commit

Support Runtime tensor parallelism (#158) · 4767b04d

q.yao authored Jul 31, 2023

* works on interlm and vicuna

* support GQA

* remove comment

* update readme, add logger, default tp=1

* remove log

4767b04d

21 Jul, 2023 1 commit

[Feature] Support Llama-2 with GQA (#147) · f07b697b

Li Zhang authored Jul 21, 2023

* add GQA for llama2

* fix model conversion

* fix lint & remove dev log

* update news

* minor

* fix allocation size

* fix split_dim for w_qkv.bias

f07b697b

01 Jul, 2023 3 commits
- build turbomind (#35) · 35d64462
  lvhan028 authored Jul 01, 2023
```
* build turbomind

* change namespace fastertransformer to turbomind

* change logger name
```
  35d64462
- rename src/fastertransformer to src/turbomind (#33) · 53d2e42c
  lvhan028 authored Jul 01, 2023
  
  53d2e42c
- Add lint action (#32) · fe46dac2
  AllentDan authored Jul 01, 2023
```
* temp

* fix lint

* csrc->src

* remove clang-format

* skip .rst

* skip doc

* clang-format

version

version

* mat_B
```
  fe46dac2
28 Jun, 2023 1 commit

feat(src): add kv cache int8 quantization (#22) · cc93136e

tpoisonooo authored Jun 28, 2023

* feat(src): add int8 and compile passed

* feat(kernels): fix

* feat(llama): update kernel

* feat(src): add debug

* fix(kernel): k_cache use int8_t pointer

* style(llama): clean code

* feat(deploy.py): revert to enable fmha

* style(LlamaV2): clean code

* feat(deploy.py): add default quant policy

cc93136e

24 Jun, 2023 1 commit
- Support attention bias (#14) · 2700abb3
  Li Zhang authored Jun 24, 2023
```
* support attention bias

* fix conflict
```
  2700abb3
20 Jun, 2023 1 commit
- check-in fastertransformer (#7) · 9efcac38
  Li Zhang authored Jun 20, 2023
```
* add ft code

* gitignore

* fix lint

* revert fmha
```
  9efcac38