Unverified Commit ab1767cf authored by Li Zhang's avatar Li Zhang Committed by GitHub
Browse files

TurboMind 2 (#590)

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison
parent 06125966
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -5,11 +5,12 @@ ...@@ -5,11 +5,12 @@
namespace turbomind { namespace turbomind {
struct LlamaAttentionParams { struct LlamaAttentionParams {
int rotray_embedding_dim; int rotary_embedding_dim;
float rotary_embedding_base; float rotary_embedding_base;
int max_position_embeddings; int max_position_embeddings;
bool use_dynamic_ntk; float rope_scaling_factor;
bool use_logn_attn; // bool use_dynamic_ntk;
bool use_logn_attn;
}; };
} // namespace turbomind } // namespace turbomind
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -93,7 +93,8 @@ private: ...@@ -93,7 +93,8 @@ private:
int step_length_; int step_length_;
int start_id_; int start_id_;
int end_id_; int end_id_;
int cache_max_entry_count_; float cache_max_block_count_;
int cache_block_seq_len_;
int cache_chunk_size_; int cache_chunk_size_;
int use_context_fmha_; int use_context_fmha_;
size_t tensor_para_size_; size_t tensor_para_size_;
......
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment