Re-order attention head outputs for better perf
Significant performance boost over the original orderings on an already somewhat optimised branch this gave me > 2x end-to-end throughput on a squad xlnet fine-tuning task (batch 8, seq-length 612, fp16)
Showing
Please register or sign in to comment