• Chao Liu's avatar
    FP16 data in-register transpose (#41) · b491ebf3
    Chao Liu authored
    * start fixing 16bit data packing
    
    * adding StaticTensor
    
    * adding StaticTensor
    
    * adding StaticTensor
    
    * add missing constexpr
    
    * adding static tensor
    
    * adding static tensor
    
    * adding transpose
    
    * add inline asm for transpose 2x2 of half_t
    
    * add general transpose_vectors(), but have unnecessary register initialization using v_mov
    
    * fix unnecessary register initialization in transpose_vector by using more pass-by-reference
    
    * add hardcoded logic for NHWC wrw
    
    * improve asm for v_pack
    
    * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
    
    * tweak
    
    * reorganize file
    b491ebf3
device_gemm_xdl.hpp 18.7 KB