- 14 Nov, 2023 1 commit
-
-
Li Zhang authored
* fix init of finished buf * fix `finished_count`
-
- 13 Nov, 2023 2 commits
- 10 Nov, 2023 2 commits
-
-
Li Zhang authored
* refresh decoder attention kernel * block-level kv cache * `BlockManager` & `SequenceManager` * update * update * update * update * rename * GQA support * fix context length * GQA dispatch * kv8 * tune * async stream cb * nvtx * config parsing * debug * optimize output cost * split-k decoding * minor * truncate `session_len` by available blocks * minor * license * fix * dispatch `cp.async` * fix linking * fix * fix deadlock * guard input length * correct start offset * fix prefill chunking * fix `cache_block_seq_len` param passing * fix `block_size` fmtstr * fix output tokens * fix batch resizing * fix masking of finished sequences * add debug util * free unused block early * add ntk scaling and logn scaling * cmake flags * fix typo * w4a16 for sm75 * fix msvc build * fix msvc build * fix block verification * fix msvc build * use `std::shuffle` * fix lint * fix lint * fix lint * clear incoming buffer * clear finished requests * fix batch initialization * fix typo * fix typo * fix comparison
-
RunningLeon authored
* update reqs * update docs * resolve comments * upgrade pydantic * fix rebase * update doc * update * update * update readme * update * add flash-attn
-
- 09 Nov, 2023 3 commits
- 08 Nov, 2023 3 commits
-
-
RunningLeon authored
* add check env * update issue template' * remove some reqs from check env * resolve comment
-
Chen Xin authored
-
AllentDan authored
* fix benchmark serving computation mistake * fix timestamps computations * remove speed up * no mp * mp seems faster? * remove * update * remove * fix * update * update print log * typo * print fist token latency only stream==True * remove renew_session * update AsyncEngine
-
- 06 Nov, 2023 2 commits
-
-
aisensiy authored
* Use session id from gradio state * use a new session id after reset * rename session id like a state * update comments * reformat files * init session id on block loaded * use auto increased session id * remove session id textbox * apply to api_server and tritonserver * update docstring * add lock for safety --------- Co-authored-by:AllentDan <AllentDan@yeah.net>
-
yunzhongyan0 authored
* FIX: fix stop_session func bug * keep sequence_end = False --------- Co-authored-by:
honglei.yan <honglei.yan@nio.com> Co-authored-by:
AllentDan <AllentDan@yeah.net>
-
- 03 Nov, 2023 6 commits
-
-
pppppM authored
* fix awq * adapt new qwen code * adapt qwen 14b and baichuan2 7b * add docstring * add runtime error for qwen
-
AllentDan authored
-
liukuikun authored
-
Chen Xin authored
* split deploy.py * fix get_cuda_tensor * deploy qwen_awq * fix lint * add docstring * fix * support baichuan/baichuan-awq * parameterizing size_per_head * remove try/except * limit input model_format * add quant_path param * remove old deploy.py * fix path * fix transformer layer range when load bins * fix qwen init * split & save log * relative import * update get_config * WeightFileMgr -> Reader * rename * update * fix init_layer_id * rename llama.py -> meta_llama.py, hf.py -> llama.py * reduce code * update arg description * fix meta llama * manually cleanup meta model params
-
RunningLeon authored
* update * resolve comment
-
Yam(长琴) authored
-
- 01 Nov, 2023 1 commit
-
-
AllentDan authored
* make IPv6 compatible, safe run for coroutine interrupting * instance_id -> session_id and fix api_client.py * update doc * remove useless faq * safe ip mapping * update app.py * WIP completion * completion * update doc * disable interactive mode for /v1/chat/completions * docstring * docstring * refactor gradio * update gradio * udpate * update doc * rename * session_id default -1 * missed two files * add a APIClient * add chat func for APIClient * refine * add concurrent function * sequence_start, sequence_end --> interactive_mode * update doc * comments * doc * better text completion * remove /v1/embeddings * comments * deprecate generate and use /v1/interactive/completions * /v1/interactive/completion -> /v1/chat/interactive * embeddings * rename * remove wrong arg description * docstring * fix * update cli * update doc * strict session_len limit condition * pass model args to api_server
-
- 30 Oct, 2023 1 commit
-
-
Lyu Han authored
-
- 25 Oct, 2023 3 commits
-
-
AllentDan authored
* support inference a batch of prompts * docstring and assert
-
RunningLeon authored
* add * import fire in main * wrap to speed up fire cli * update * update docs * update docs * fix * resolve commennts * resolve confict and add test for cli
-
Lyu Han authored
* add build from docker section * update * install python package * update * update * update
-
- 24 Oct, 2023 2 commits
- 23 Oct, 2023 2 commits
-
-
AllentDan authored
- 19 Oct, 2023 2 commits
- 18 Oct, 2023 2 commits
- 17 Oct, 2023 1 commit
-
-
Lyu Han authored
-
- 16 Oct, 2023 2 commits
- 13 Oct, 2023 3 commits
-
-
del-zhenwu authored
* [doc] Update benchmark command in w4a16.md * Update w4a16.md * Update w4a16.md add pip install nvidia-ml-py * [doc] Update w4a16.md * fix lint error Signed-off-by:
del-zhenwu <dele.zhenwu@gmail.com> * [doc] update model_path & prompt_tokens Signed-off-by:
del-zhenwu <dele.zhenwu@gmail.com> --------- Signed-off-by:
del-zhenwu <dele.zhenwu@gmail.com>
-
Chen Xin authored
* add tp hint for deploy * fix lint * assert tp in turbomind * fix lint
-
YiiSh authored
-
- 12 Oct, 2023 2 commits