"src/runtime/cuda/cuda_hashtable.hip" did not exist on "889798fec5323070bef950c23f7f1d36a22588b7"
Add support for exl2 quantization
Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM.
Showing
Please register or sign in to comment