perf: prefer batched matmuls for attention. added fast-path to Decoder when num_heads=1
Attach a file by drag & drop or click to upload