[CUDA] Branchless NF4/FP4 kDequantizeBlockwise kernel for faster dequantization (#1746)
* Added branchless LUT-based dequantization for FP4 and NF4 * Added extra command line options to control reproducibility * Restore FP4 quantization/dequantization order
Showing
Please register or sign in to comment