"llama/ggml-cuda/template-instances/mmq-instance-q4_1.cu" did not exist on "de982616f1dde636e46b2cef2edd971b54ef7691"
Theoretically this might have lower numerical error since the scaling is in fp32 instead of fp16 (not sure, I haven't thought too carefully about it). However, in practice, the numerical errors seem about the same.