• zjing14's avatar
    xdlops_v4r4_fwd fp32/fp16 (#34) · 3835318c
    zjing14 authored
    
    
    * create files for xdlops
    
    * working on blockwise_gemm_xdlops
    
    * add KReduction
    
    * add m/n repeats
    
    * add 2x2 pipeline
    
    * added 128x128 wavegemm
    
    * use StaticBuffer of vector_type
    
    * break vector type to blk_size
    
    * add kpack into xldops_gemm and blockwise_gemm
    
    * abroadcast only
    
    * add fp32 mfma instructions
    
    * adding fp16 mfma
    
    * pack half4_t
    
    * rename kperwave to kpack
    
    * add 32x32x8fp16
    
    * add fp16 mfma
    
    * clean code
    
    * clean code
    
    * V4r4 xdlops kpack (#35)
    
    * add kpack with incorrect results
    
    * bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2
    
    * add 1x1 kernel
    
    * add gridwise_gemm_v2 - single_buffer
    
    * enabled dwordx4 for fp16
    Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
    
    * refactor fwd-v4r4-xdlops
    
    * add v4r4-nhwc-xdlop
    
    * improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop
    
    * tweak scheduling in gridwise gemm
    
    * add v4r3 with a single output copy
    
    * init commit: output with slice win
    
    * adding sliceWin
    
    * add multiple repeats pattern
    
    * starting adding bwd-v4r1-xdlops
    
    * use tuple as SrcBuffer
    
    * adding bwd-data v4r1 nhwc xdlops
    
    * fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()
    
    * fix bug in host bwd-data conv
    
    * initial implementation of bwd-data v4r1 nhwc xdlops
    
    * add launch bound flags
    
    * enable launch bound
    
    * add m/nrepeat=4
    
    * tweak bwd-data v4r1 nhwc xdlops
    
    * added bwd-data v4r1 nhwc xlops with output A and weight B
    
    * add fwd-v4r4 nhwc xdlops, A input, B weight, C output
    Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
    3835318c
host_tensor_generator.hpp 1.27 KB