src/turbomind/triton_backend/llama/LlamaTritonModel.cc · 62b60db72a8e968dd74720d203a964a5ecb1df8d · OpenDAS / Lmdeploy

"tests/gpt2/test_modeling_flax_gpt2.py" did not exist on "75f6641eaf5ccf130cd6b6f4d4a04fb08e6e5ada"

[Feature] Blazing fast W4A16 inference (#202) · c3290cad

Li Zhang authored Aug 14, 2023

* add w4a16

* fix `deploy.py`

* add doc

* add w4a16 kernels

* fuse w1/w3 & bugfixes

* fix typo

* python

* guard sm75/80 features

* add missing header

* refactor

* qkvo bias

* update cost model

* fix lint

* update `deploy.py`

c3290cad

LlamaTritonModel.cc 17.4 KB

Replace LlamaTritonModel.cc