• Paul Fultz II's avatar
    Improve layernorm performance (#613) · 56b3bf58
    Paul Fultz II authored
    * Use increment instead of division to compute register offset
    
    * Formatting
    
    * Limit layernorm to 1024 elements
    
    * Formatting
    
    * Add verification to driver
    
    * Formatting
    
    * Remove early return
    
    * Use block_size 256
    
    * Vectorize the kernel
    
    * Formatting
    
    * Convert to vector type
    
    * Add layernorm tests
    
    * Formatting
    
    * Formatting
    
    * Refactor layernorm to run both algos
    
    * Formatting
    
    * Fix compile error
    
    * Fix tidy warnings
    
    * Formatting
    
    * Add layernorm function
    
    * Formatting
    56b3bf58
fuse_ops.cpp 23.6 KB