1. 04 Mar, 2022 3 commits
    • rocking5566's avatar
      [Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32
      rocking5566 authored
      
      
      * Add int8 of mk_nk_mn to the ckProfiler
      
      * Add example of int8 gemm
      
      * Fix typo, use ushort instead of half_t for bfloat16
      
      * replace ushortXXX_t to bhalfXXX_t
      
      * rename ushort to bhalf_t
      
      * Add bf16 example
      
      * Add bf16 gemm to ckProfiler
      
      * Fix alignment
      
      * Fix typo
      
      * Add unit test for gemm_xdl int8
      
      * Add gemm_xdl fp32 unit test
      
      * Add gemm_xdl bf16 unit test
      
      * fix build
      
      * fix build issue due to merge conflict
      
      * Fix build
      
      * Fix build error
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      7e9a9d32
    • Jianfeng Yan's avatar
      Refactor threadwise copy using sfcurve (#101) · 0619ebf7
      Jianfeng Yan authored
      
      
      * add space_filling_curve
      
      * cleanup and move space_filling_curve into test
      
      * WIP: start refactoring threadwise_transfer_v1r3
      
      * threadwise_copy works but needs further refactoring
      
      * add some comments
      
      * add SpaceFillingCurve::GetIndices()
      
      * minor changes
      
      * removed GetIndices; refactored GetDstCoordinateResetStep
      
      * add DynamicBuffer::Transfer, but Add is not tested
      
      * rebased agaist develop
      
      * threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work
      
      * minor changes
      
      * refactored threadcopy v3r1, v2; removed old implementations
      
      * clang-format
      
      * cleanup
      
      * fix a typo in v6r3
      
      * format
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0619ebf7
    • ltqin's avatar
      NHWC conv 2d: bwd fp32/fp16/bfp16/int8, Device level tuning and host API (#92) · c254e5ab
      ltqin authored
      
      
      * start conv2d bwd api
      
      * kernel running
      
      * add bwd reference
      
      * change to no shuffle
      
      * fix bwd reference
      
      * pass verification
      
      * add Filter1x1Stride1Pad0 and start testing
      
      * change some tuning parameter
      
      * fix test error
      
      * add fp16 tuning parameter
      
      * add bf16 tuning parameter
      
      * add int8 tuning parameters
      
      * change fp32 tuning parameter
      
      * add bwd to profiler
      
      * fix bug for bwd profiler
      
      * fix ckProfiler bug
      
      * change conv2d_bwd_xdl to fp16
      
      * fix bug in comments
      
      * fix precompile id
      
      * fix enum conv name
      
      * chage _bwd_ to _bwd_data_
      
      * change conv2d_bwd example id
      
      * bwd to bwd data
      
      * fix prehead
      
      * fix MakeDefaultBlock2CTileMap ,import form merge develop
      
      * format bwd instance
      
      * bwd to bwd data
      
      * change name bwd to bwd data
      
      * change name bwd to bwd data in example
      
      * formate code
      
      * change conv2d bwd data id in example
      
      * rewrite readme for example
      
      * fix CalculateMagicNumbers about div zero
      
      * add workaround CK_WORKAROUND_SWDEV_325164
      
      * change test_conf2d_bwd_data show info
      
      * format
      
      * fix bug for workaround:CK_WORKAROUND_SWDEV_325164
      
      * formate tuning parameters
      
      * formate tuning parameters again
      
      * formate tuning parameters 3
      
      * formate tuning parameters 4
      
      * remove add function template
      
      * format
      
      * update comment
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c254e5ab
  2. 28 Feb, 2022 1 commit
    • Anthony Chang's avatar
      Allow distinct K0/K1 values for A/B block descriptor (#98) · 6d4450ef
      Anthony Chang authored
      
      
      * add gitignore
      
      * host tensor: allow generating sequentially increasing value in a given dimension
      
      * gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor
      
      - remove dangling header include
      - modify example gemm_xdl accordingly
      - infer KPack value from M/NPerXdl
      - device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
      (API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
      
      * add LDS data dump utility
      
      * profiler: reflect API change for distinct K0/K1 for A/B matrices
      
      * profiler: add conflict-free LDS write FP16 kernel instances
      
      * fix accidental perf regression
      
      * address feedback; cosmetic changes
      
      * clang-format for new files
      
      * format
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6d4450ef
  3. 23 Feb, 2022 3 commits
    • Chao Liu's avatar
      Add gridwise GEMM pipeline (#89) · 22d438ae
      Chao Liu authored
      * clean up
      
      * add mutilple thread scratch to ThreadwiseTensorSliceTransfer_v3r1
      
      * add 2 stage prefetch
      
      * add more sanity check into transform_tensor_descriptor
      
      * tweak
      
      * enabling 2 stage prefetch to exsiting gridwise gemm; tweak
      
      * enabling 2 stage prefetch to exsiting gridwise gemm
      
      * move gridwise gemm pipeline in class; clean up
      
      * add some irregular tile size
      
      * update CalculateHasMainK0BlockLoop for multi-stage-prefetch
      
      * refactor gridwise gemm pipeline class
      22d438ae
    • Adam Osewski's avatar
      Unify Convolution FWD XDL 1D/2D implementation. (#93) · 756a7617
      Adam Osewski authored
      
      
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example to prevent conflicts.
      
      * Split convnd instance into separate files for 1d/2d
      
      * Address review comments.
      
      * Fix data type for flops/gbytes calculations.
      
      * Assign example number 11.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      756a7617
    • Jianfeng Yan's avatar
      Conv3d new (#94) · 6dfb92bb
      Jianfeng Yan authored
      
      
      * conv3d compiles but has memory error
      
      * conv3d works
      
      * fix performance issue by using __builtin_amdgc_readfirstlane
      
      * change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*
      
      * clang-format
      
      * remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d
      
      * format
      
      * remove useless marc
      
      * add comment
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6dfb92bb
  4. 21 Feb, 2022 1 commit
  5. 12 Feb, 2022 1 commit
    • ltqin's avatar
      NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9
      ltqin authored
      
      
      * add fwd bf16 conv
      
      * change tunning parametor
      
      * add int8 for conv fwd
      
      * remove comments
      
      * change tunning parametor for int8
      
      * change init int8 example
      
      * add test for conv2d fwd
      
      * change device operation file pos because merge develop
      
      * fwd int8 use reference
      
      * test_conv_fwd use reference
      
      * add braket for if statement
      
      * rename fwd example name
      
      * remove StaticBufferOfVectorTypeV2
      
      * tweak example
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      880fbee9
  6. 11 Feb, 2022 1 commit
  7. 07 Feb, 2022 1 commit
    • Chao Liu's avatar
      GEMM+Bias+ReLU+Add (#76) · 823657ed
      Chao Liu authored
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * fix build
      
      * clean up
      
      * added example for gemm+bias+relu+add
      
      * added example for gemm+bias+relu
      
      * add profiler for gemm_s_shuffle; re-org files
      
      * add profiler
      
      * fix build
      
      * clean up
      
      * clean up
      
      * clean up
      
      * fix build
      823657ed
  8. 04 Feb, 2022 1 commit
  9. 21 Jan, 2022 1 commit
    • rocking5566's avatar
      Add gemm_shuffle host api (#71) · 4d40b197
      rocking5566 authored
      * [What]
      1. Add DeviceGemmXdl_C_Shuffle
      2. Revise example of gemm_xdl
      [Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
      [How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp
      4d40b197
  10. 18 Jan, 2022 1 commit
  11. 26 Dec, 2021 1 commit
    • Chao Liu's avatar
      Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7
      Chao Liu authored
      * fix relu
      
      * clean up
      
      * clean up
      
      * adding 1x1 conv
      
      * adding 1x1 conv
      
      * added 1x1 conv
      
      * refactor
      
      * refactor
      
      * refactor
      
      * added profiler for conv+bias+relu+add
      
      * clean up
      
      * adding conv+bias+relu
      
      * adding conv+bias+relu
      
      * added conv+bias+relu
      
      * Update README.md
      
      * update cpu verification
      
      * adding c shuffle
      
      * update static_tensor for dealing with invalid element
      
      * adding c shuffle
      
      * debugging
      
      * fix bug
      
      * convert to fp16 before shuffle
      
      * shuffle more than one M/NRepeat
      
      * clean up
      
      * remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
      
      * clean up
      
      * remove coordinate step hack from all gridwise gemm xdl
      
      * clean up coordinate step hack
      
      * clean up coordinate step hack
      
      * ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
      
      * adding output shuffle in conv+bias+relu+add
      
      * update
      
      * added conv+bias+relu+add with c shuffle
      
      * added conv+bias+relu+add with c shuffle
      
      * fix forward_sweep bugs in threadwise copy
      
      * clean up
      
      * refactor
      
      * clean up
      
      * clean up
      
      * added conv_c_shuffle+bias_relu
      
      * clean up
      
      * added conv+bias+relu+atomic_add
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * misc fixes; add 1x1 specialization
      
      * clean up
      
      * delete unused device op
      
      * clean up
      
      * add support for odd C value
      acbd7bd7
  12. 04 Dec, 2021 1 commit
  13. 03 Dec, 2021 1 commit
    • Chao Liu's avatar
      GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380
      Chao Liu authored
      * gemm+activation
      
      * move C pointwise operation into threadwise copy
      
      * add pointwise operation to A/B matrix
      
      * update ckProfiler
      
      * adding bias add
      
      * adding bias add
      
      * adding bias add
      
      * added bias add; worked around compiler issues
      
      * clean up
      
      * clean up
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * clean up
      
      * add conv_xdl example
      
      * adding conv_xdl_bias_relu_add example
      
      * add conv+bias+relu+add, but has register spill issue
      
      * tweak
      
      * tweak
      
      * refactor
      
      * Update README.md
      
      update readme for example/2_gemm_xdl_bias_relu_add
      
      * clean up
      
      * Update README.md
      
      update readme for example/3_conv_xdl
      
      * Update README.md
      41cdd380
  14. 18 Nov, 2021 1 commit
  15. 14 Nov, 2021 1 commit
    • Chao Liu's avatar
      ckProfiler and device-level XDL GEMM operator (#48) · e823d518
      Chao Liu authored
      * add DeviceGemmXdl
      
      * update script
      
      * fix naming issue
      
      * fix comment
      
      * output HostTensorDescriptor
      
      * rename
      
      * padded GEMM for fwd v4r4r4 nhwc
      
      * refactor
      
      * refactor
      
      * refactor
      
      * adding ckProfiler
      
      * adding ckProfiler
      
      * refactor
      
      * fix tuning parameter bug
      
      * add more gemm instances
      
      * add more fp16 GEMM instances
      
      * fix profiler driver
      
      * fix bug in tuning parameter
      
      * add fp32 gemm instances
      
      * small fix
      
      * refactor
      
      * rename
      
      * refactor gemm profiler; adding DeviceConv and conv profiler
      
      * refactor
      
      * fix
      
      * add conv profiler
      
      * refactor
      
      * adding more GEMM and Conv instance
      
      * Create README.md
      
      Add build instruction for ckProfiler
      
      * Create README.md
      
      Add Readme for gemm_xdl example
      
      * Update README.md
      
      Remove build instruction from top most folder
      
      * Update README.md
      
      * clean up
      e823d518