- 10 Nov, 2023 1 commit
-
-
Li Zhang authored
* refresh decoder attention kernel * block-level kv cache * `BlockManager` & `SequenceManager` * update * update * update * update * rename * GQA support * fix context length * GQA dispatch * kv8 * tune * async stream cb * nvtx * config parsing * debug * optimize output cost * split-k decoding * minor * truncate `session_len` by available blocks * minor * license * fix * dispatch `cp.async` * fix linking * fix * fix deadlock * guard input length * correct start offset * fix prefill chunking * fix `cache_block_seq_len` param passing * fix `block_size` fmtstr * fix output tokens * fix batch resizing * fix masking of finished sequences * add debug util * free unused block early * add ntk scaling and logn scaling * cmake flags * fix typo * w4a16 for sm75 * fix msvc build * fix msvc build * fix block verification * fix msvc build * use `std::shuffle` * fix lint * fix lint * fix lint * clear incoming buffer * clear finished requests * fix batch initialization * fix typo * fix typo * fix comparison
-
- 18 Sep, 2023 1 commit
-
-
q.yao authored
* support actual seqlen * fix lint * update variable types * lint * update type * fix lint ---------
-
- 29 Aug, 2023 1 commit
-
-
q.yao authored
* first * fix causal mask * disable flash attention2 on sm70 * fix 2 * update readme * clang-format * disable ft2 on windows * fix lint * fix build * fix build * fix long kv seq * fix lint * sync copy output --------- Co-authored-by:
grimoire <yaoqian@pjlab.org.cn> Co-authored-by:
irexyc <irexyc@gmail.com>
-
- 25 Jul, 2023 1 commit
-
-
q.yao authored
Co-authored-by:grimoire <yaoqian@pjlab.org.cn>
-
- 21 Jul, 2023 1 commit
-
-
Li Zhang authored
* add GQA for llama2 * fix model conversion * fix lint & remove dev log * update news * minor * fix allocation size * fix split_dim for w_qkv.bias
-
- 01 Jul, 2023 3 commits
- 28 Jun, 2023 1 commit
-
-
tpoisonooo authored
* feat(src): add int8 and compile passed * feat(kernels): fix * feat(llama): update kernel * feat(src): add debug * fix(kernel): k_cache use int8_t pointer * style(llama): clean code * feat(deploy.py): revert to enable fmha * style(LlamaV2): clean code * feat(deploy.py): add default quant policy
-
- 20 Jun, 2023 1 commit
-
-
Li Zhang authored
* add ft code * gitignore * fix lint * revert fmha
-