Refactor vectorization and preloading for pointwise fusions (#1184)
Improves performance for add_gelu. In bert it is 4x faster and for mul_add it is 50% faster than what we current have.
Showing
Please register or sign in to comment
Improves performance for add_gelu. In bert it is 4x faster and for mul_add it is 50% faster than what we current have.