1. 26 Apr, 2022 1 commit
    • ltqin's avatar
      Implement MI200 FP16 Denorm fix inside threadwise copy (#191) · b39f07f1
      ltqin authored
      
      
      * start convert
      
      * using buffer load
      
      * add kernel transfer fun
      
      * using asm for transfer
      
      * add transpose_half_to_bhalf_2x2
      
      * add TypeMap struct
      
      * add LDSDataType to v2r3 and v2r4r2
      
      * change convert fun name
      
      * remove asm in half transfer to bhalf
      
      * fix bug for type_convert
      
      * cshuffle_v1 add LDSDataType
      
      * add ldstype for gridegemm v2r4
      
      * add lds datat ype to v3r1 2 3
      
      * init complete
      
      * fix function name
      
      * remove comments
      
      * format
      
      * fix for merge develop
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      b39f07f1
  2. 22 Apr, 2022 1 commit
  3. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf
  4. 15 Nov, 2021 1 commit
    • Chao Liu's avatar
      FP16 data in-register transpose (#41) · b491ebf3
      Chao Liu authored
      * start fixing 16bit data packing
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * add missing constexpr
      
      * adding static tensor
      
      * adding static tensor
      
      * adding transpose
      
      * add inline asm for transpose 2x2 of half_t
      
      * add general transpose_vectors(), but have unnecessary register initialization using v_mov
      
      * fix unnecessary register initialization in transpose_vector by using more pass-by-reference
      
      * add hardcoded logic for NHWC wrw
      
      * improve asm for v_pack
      
      * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
      
      * tweak
      
      * reorganize file
      b491ebf3