1. 19 Jun, 2023 1 commit
    • rocking's avatar
      Maxpool bwd (#750) · 341ad956
      rocking authored
      * Add maxpool f32 kernel and example
      
      * Revise copyright
      
      * Add device pool bwd device op
      
      * Support f16 and bf16
      
      * Add compute datatype for reference code.
      Prevent error in bf16
      
      * Fix type error
      
      * Remove layout
      
      * Fix bf16 error
      
      * Add f16 and bf16 example
      
      * Add more operations
      
      * Implement IsSupportedArgument
      
      * Add changelog
      
      * Add comment
      
      * Add comment
      
      * Remove useless header
      
      * Move initialize of workspace to the run
      
      * Move set din zero to the device operator
      
      * Save din_length_raw
      
      * Remove useless header
      
      * Calculate gridsize according to the number of CU
      
      * Calculate gridSize according to the number of CU.
      Remove useless header
      
      * Add put example
      
      * Remove useless header
      
      * Fix CI fail
      341ad956
  2. 15 Jun, 2023 1 commit
    • Illia Silin's avatar
      Enable gfx941 and gfx942 architectures. (#752) · 027e46ee
      Illia Silin authored
      * enable gfx941/942 targets
      
      * fix clang format
      
      * fix the cmake logic for multiple targets
      
      * fix cmake syntax for looping over targets
      
      * add gfx941/942 support for gemm_xdl instances
      027e46ee
  3. 12 Jun, 2023 1 commit
  4. 01 Jun, 2023 1 commit
    • Po Yen Chen's avatar
      Simplify kernel argument of device operator Device(Batched)GemmXdl<> (#723) · 9eae73df
      Po Yen Chen authored
      
      
      * Remove M/N/KPad local variables
      
      * Use M/N/KPad to name padded lengths
      
      * Replace duplicated local variable by parameters
      
      * Rename variables M/N/KRaw to M/N/K
      
      * Move AK0/BK0 compute logic into GridwiseGemm
      
      * Use macro to shorten code
      
      * Move CalculateGridSize() logic into GridwiseGemm
      
      * Add comment to credit the implementation source
      
      * Reuse the existing implementation
      
      * Remove no-longer used data members
      
      * Remove elementwise-op objects from interfaces
      
      * Reserve kernel arg as whole object in interfaces
      
      * Remove redundant data member
      
      * Make 3rd type parameter optional
      
      * Remove unnesscary type parameters
      
      * Remove no-longer used descriptor-creation methods
      
      * Move kernel arg type definition into GridwiseGemm
      
      * Add macro to switch between code sections
      
      * Move argument field computing logic into device op side
      
      * Make utility method 'static'
      
      * Declare special methods
      
      * Unify MakeArgument() usage
      
      * Adapt the new GridwiseGemm interface
      
      * Push-down class 'GridwiseGemm::Argument' fields
      
      * Remove no-longer used methods
      
      * Add unused parameters
      
      * Force copying parameters in 'Embed' ctor
      
      * Remove no-longer used descriptors
      
      * Fallback change on BaseArgument
      
      * Remove macro 'INTEGER_DIVIDE_CEIL'
      
      * Make variable naming more consistent
      
      * Make sure methods are only invoked on right place
      
      * Remove tailing underscore in public attribute name
      
      * Remove necessary methods
      
      * Hide computing logic of derived attributes
      
      * Make new 'Embed' ctor only available for device code
      
      * Make sure 'Embed' type args are not references
      
      * Move check for karg.K into CheckValidity()
      
      * Remove more integer division logic form device code
      
      * Undo changes on Embed
      
      * Separate 'Problem' concept out from 'Argument'
      
      * Add overloaded version of __builtin_amdgcn_readfirstlane()
      
      * Remove 'static' specifiers
      
      * Remove more 'static' specifier
      
      * Replace unsigne char by std::byte
      
      * Add 'const' specifier to never changing variable
      
      * Add 'inline' specifier to funcion definition
      
      * Share same name for kernel interfaces
      
      * Fix wrong boundar calculation logic
      
      * Leave the third template arg for compatibility
      
      * Remove unnecessary parameters
      
      * Fix wrong error message (for type name)
      
      * Create descriptor on device side
      
      * Fix wrong debug message
      
      * Remove no-longer used data members
      
      * Rename type trait
      
      * Remove std:: qualifier from standard types
      
      * Replace 'size_t' by 'unsigned'
      
      * Use type alias to hint usage
      
      * Replace static_for<> by ordinary 'for' loop
      
      * Reject unsupported argument
      
      * Rename readfirstlane() to amd_wave_read_first_lane()
      
      * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
      
      * Update function calls
      
      * Reorder statements
      
      * Re-format files
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      9eae73df
  5. 31 May, 2023 1 commit
  6. 30 May, 2023 1 commit
  7. 24 May, 2023 1 commit
    • rocking's avatar
      Pool3d fwd (#697) · 76ec0089
      rocking authored
      * Expand the base class of pool2d, prepare to share base class with pool3d
      
      * Add pool3d device op
      
      * Add pool3d f16 example
      
      * Refactor the base class. implement generic pooling in the future
      
      * clang format
      
      * get original index in max pooling
      
      * Add outputindex to base class
      
      * Fix dimension
      
      * Add pooling instance
      
      * Use indexType instead
      
      * Remove useless header
      
      * Extract IndexDataType to template
      
      * Extract pooling reference code
      
      * clang format
      
      * clang format
      
      * Fix typo
      
      * Add tensor stride
      
      * Add missing header
      
      * Add index stride and output stride
      
      * Refine naming
      
      * Add type to base class
      
      * Rename file
      
      * Use proper size
      
      * Fix typo
      
      * Refine naming
      
      * Modify the argument into vector.
      
      * Add max pool profiler
      
      * Refine naming
      
      * Support f32 pool
      
      * Fix typo
      
      * Add avg pool2d fwd in profiler
      
      * clang format
      
      * Rename AccDatatype to ComputeDatatype
      
      * Fix init
      
      * test pool
      
      * Extract variable
      
      * Add client example
      
      * Check the pooling dim
      
      * clang format
      
      * Connect argv and arg_parser
      
      * Add found check
      
      * Remove useless header
      
      * Refine naming
      
      * Adjust the order of device_pool_fwd
      76ec0089
  8. 23 May, 2023 1 commit
    • Illia Silin's avatar
      Enable gemm_dl and other kernels on Navi3x. (#714) · d821d1e5
      Illia Silin authored
      * enable dl kernels on navi3
      
      * do not build xdl tests and examples on Navi
      
      * run tests before building everything on jenkins
      
      * disable gemm_bilinear on gfx1030
      
      * add gpu targets to installer on Navi
      
      * put tests in the same order as before
      
      * reduce the number of navi targets in CI
      
      * build CI installed for gfx940 as well
      
      * only build for MI300 during QA runs
      d821d1e5
  9. 15 May, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Add contraction profiler and tests (#701) · 642d5e91
      Bartłomiej Kocot authored
      * Add contraction profiler and tests
      
      * Build and style fixes
      
      * Allow to use any elementwise operator for ref_contraction
      
      * Introduce profile_contraction_scale and profile_contraction_bilinear
      
      * Make ref_contraction generic and extend interface tests
      
      * Stylistic minor fixes
      
      * Extend test_contraction_interface
      642d5e91
  10. 11 May, 2023 1 commit
  11. 03 May, 2023 1 commit
  12. 28 Apr, 2023 1 commit
  13. 24 Apr, 2023 2 commits
    • Adam Osewski's avatar
      Grouped Gemm + SplitK + simplified Kernel Args (#669) · 8bb2bb4a
      Adam Osewski authored
      
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * B2C with 3D grid for KSplit
      
      * Remove unused code.
      
      * Use default B2C (3D grid) in grid gemm v2r4r2.
      
      * Device gemm splitk use B2C map.
      
      * Device GroupedGemmXdlSplitKCShuffle
      
      * Example for GroupedGemm Xdl SplitK
      
      * Introduce Device GroupedGemmSplitK
      
      * Fix updating kbatch size.
      
      * Add instance mk-nk-mn
      
      * Enable set kbatch in profiler.
      
      * Add GGemmSplitK mk-kn-mn instances
      
      * Add more instances & split into multiple files.
      
      * minor fix
      
      * tuning
      
      * clean
      
      * disabled failed instances
      
      * use pipe v2
      
      * Ignore arg on not supported arch.
      
      * fix warning
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      8bb2bb4a
    • rocking's avatar
      Revise layout of group convolution (#675) · 3eecbfb6
      rocking authored
      * [What] Remove pure conv int8 instance
      [Why] We will never use pure int8 conv in AI, use int8 quantization instead
      
      * Change layout
      
      * Share the kernel parameter
      
      * Support more type of NHWGC for group conv
      
      * Revise client example of conv 2d, use NHWGC layout
      
      * Add instance to cmake
      
      * Revise layout of group conv quantization instance
      
      * Revise layout of external api of group conv quantization
      
      * Revise layout of group conv quantization client example
      
      * Fix clang format
      
      * Add comment to describe meaning of each parameter
      3eecbfb6
  14. 10 Apr, 2023 1 commit
    • rocking5566's avatar
      Groupnorm + swish external api (#668) · ed3a2e52
      rocking5566 authored
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      ed3a2e52
  15. 29 Mar, 2023 2 commits
  16. 23 Mar, 2023 1 commit
  17. 15 Mar, 2023 2 commits
    • rocking5566's avatar
      gemm/Conv xdlops + dlops quantization (#625) · 16dc18e0
      rocking5566 authored
      
      
      * Add conv perlayer quantization
      
      * Add gemm_dlops quantization
      
      * Support int8 for innerproduct
      
      * Refine gemm dlops int8 kernel parameter
      
      * Support gfx908(MI100) and gfx90a(MI200)
      
      * clang-format
      
      * Rename example number
      
      * Support different layout for d tensor
      
      * Add conv dlops perchannel quantization example
      
      * Move to example 40
      
      * Extract the common code for different platform (dlops and xdlops)
      
      * Move ot subfolder. Prepare to add other op of quantization
      
      * Refine the quantization instance library
      
      * Add conv dl instances and client example
      
      * Remove unnecessary type
      
      * Add gemm quantization instance
      
      * Add external api and client example
      
      * Refine num_bytes
      
      * Separete different layout to different cpp
      
      * Add more xdl instances
      
      * Revert "Remove unnecessary type"
      
      This reverts commit 820869182f6a8f62b2c9004101ba6bf76b96be14.
      
      * Remove CShuffleDataType in dlops
      Let acc and CShuffleDataType be the same in xdlops
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      16dc18e0
    • Adam Osewski's avatar
      Device Op GroupedGemmMultipleD + example fp16 (#633) · a2d5ca8e
      Adam Osewski authored
      
      
      * Pass shared mem pointer as pointer to void.
      
      * Device Op GroupedGEMM Multiple D
      
      * Example for grouped gemm multiple d.
      
      * Add MI200 to supported archs.
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      a2d5ca8e
  18. 09 Mar, 2023 1 commit
  19. 01 Mar, 2023 1 commit
  20. 27 Feb, 2023 1 commit
  21. 22 Feb, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Add Grouped Conv Backward Weight on Navi21 for ResNet50. (#505) · 246ceee4
      Rostyslav Geyyer authored
      
      
      * Add DeviceOp and examples
      
      * Format DeviceOp template arguments
      
      * Remove bf16 example
      
      * Format
      
      * Format
      
      * Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N
      
      * Refactor argument preparation
      
      * Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl
      
      * Rename device op file
      
      * Update include directive in the example file
      
      * Update descriptor preparation for grouped op
      
      * Update the argument
      
      * Update batch handling
      
      * Add gridwise gemm supporting batched input
      
      * Update blockwise indexing, working version
      
      * Update copyright year
      
      * Update check if argument is supported
      
      * Refactor and make consistent with xdl examples
      
      * Update check if argument is supported
      
      * Add changelog entry
      
      * Added comments on Dl op split_k>1 support
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      246ceee4
  22. 15 Feb, 2023 4 commits
    • zjing14's avatar
      Add contraction_fp64 example (#570) · 24c9ee1d
      zjing14 authored
      
      
      * add contraction_bilinear
      
      * add contraction_scale_xdl_fp64
      
      * reduce tile size to avoid register spill
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc16.amd.com>
      24c9ee1d
    • rocking5566's avatar
      Improve normalization (#580) · 6a6163a3
      rocking5566 authored
      * Sync the order of type string with template parameter
      
      * Add more instances
      
      * Check the vector size and remove redundant var
      
      * Extract var to static, prepare to separate sweep once kernel
      
      * Separate sweeponce flow and optimize the flow
      
      * 1. Rename AccDatatype in normalization to computeData
      2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
      
      * Remove useless code
      
      * Update naive variance kernel
      
      * Refine string
      
      * Fix typo
      
      * Support naive variance for device_normalization
      
      * Check the blocksize
      
      * Share the VGPR of x and y
      
      * Share the VGPR of gamma and beta
      
      * Add more instances
      
      * Support fp16 sqrt for experiment
      
      * Add CHANGELOG
      
      * Fix typo
      
      * clang-format
      6a6163a3
    • Haocong WANG's avatar
      [Navi3x] Add Device Operations (#567) · 0cfda84d
      Haocong WANG authored
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      
      * navi3x_multipleD+example
      
      * temp save
      
      * workable
      
      * batchedgemm[OK], groupconv[debug]
      
      * groupconv: Sanity check[OK], Performance[Bad]
      
      * navi3x_groupconv_need_optimization
      
      * format
      
      * Add arch limitation to all wmma examples
      
      * fix bug: example30 input conv args
      0cfda84d
    • Adam Osewski's avatar
      Conv3D FWD BWD WRW fp16 fp32 client examples (#559) · e9fd1228
      Adam Osewski authored
      
      
      * Conv3d bwd weight client example.
      
      * Update year in license
      
      * Convolution bwd data 3D fp16/fp32 client example.
      
      * Client example for convnd fwd fp16 fp32
      
      * clang-format
      
      * Review remarks.
      
      * Fix compiler err.
      
      * Update data layout to standard one.
      
      * Add conv 3d fwd NDHWGC instances
      
      * clang-format
      
      * Conv3d fwd NDHWGC instances.
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      e9fd1228
  23. 09 Feb, 2023 1 commit
    • rocking5566's avatar
      Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e
      rocking5566 authored
      * Add gemm + layernorm instance
      
      * Add ckProfiler
      
      * Add test
      
      * Add client example
      
      * Detect if user forger to set the workrspace
      
      * Use literal in the example
      
      * [What] use builtin function for sqrt
      [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
      
      * check gemm vaildity in IsSupportedArgument
      
      * Add more testcases
      
      * Merge duplicated folder in client example
      
      * Print more infomation
      
      * Use better kernel parameter for MS problem size
      
      * clang format
      
      * Add constexpr for if condition and remove redundant include
      
      * Remove cstdlib and add constexpr
      f7d28f3e
  24. 08 Feb, 2023 1 commit
    • ltqin's avatar
      Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576) · 332ccc33
      ltqin authored
      * add instance for gemm bias softmax gemm
      
      * add client example
      
      * change CGridDesc_G_M_N to CGridDesc_G_M_O
      
      * add gridwise
      
      * change c grid name
      
      * device add d0s data
      
      * fix 08 client_example
      
      * add example 47_fused_attention
      
      * example output correct
      
      * add d0 to example
      
      * add d0 element op
      
      * rechange instance code
      
      * change Acc0ElementwiseOperation to C0DEElementwiseOperation
      
      * change example name
      
      * update instance for cdeelementwiseop
      
      * add bhalf_t ScaleAdd
      
      * add test
      
      * not surport geem1 bias
      
      * remove some ignore
      
      * fix test bug
      332ccc33
  25. 25 Jan, 2023 1 commit
    • Qianfeng's avatar
      Batchnorm inference instances, external API, client examples and gtests (#531) · a1b2441f
      Qianfeng authored
      * File renaming and class renaming for device element-wise operation
      
      * Add batchnorm-infer instances, external API and client example
      
      * Add batchnorm-infer profiler module and gtests
      
      * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
      
      * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
      
      * Rename class and file due to conflict from device_elementwise_2d.hpp
      
      * Fix namespace in batcnnorm_infer_nhwc client example
      a1b2441f
  26. 18 Jan, 2023 4 commits
    • Qianfeng's avatar
      Use double for all scaling values and float-point constant values at the Device Op API (#557) · 52abc2f3
      Qianfeng authored
      * Use double as alpha/beta values type in reduce device op api
      
      * Use double as alpha/beta values type in softmax device op api
      
      * Use double as alpha/beta values type in multiple-reduce device op api
      
      * Use double as epsilon value type in normalization/elementwise-normalization device op api
      52abc2f3
    • Raman R jana's avatar
      Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760
      Raman R jana authored
      
      
      * wavelet gemm programming model support for CK
      
      * GEMM pipeline update for wavelet progrmmaing model
      
      * Updated wavelet programming pipeline
      
      * fixes for global-write for math-wave
      
      * fixed bug in global writes
      
      * Updated comments for better readability
      
      * fixed clang format errors
      
      * added block_lds without barrier sync
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * refactor
      
      * prototype
      
      4 layouts
      
      fix default stride
      
      all problem sizes
      
      tidy
      
      move file
      
      update build script
      
      restore old file
      
      fix build
      
      * refactor standalone test to use gemm test harness
      
      * simplify gemm test
      
      * update build script
      
      * remove redundant
      
      * early return when cmd arg doesn't match
      
      * tidy
      
      * report failure when result not validated
      
      * tidy
      
      * Add comment depicting B2C mapping pattern.
      
      * Formatting & comments.
      
      * Comparison with custom B2C mapping pattern.
      
      * Example for wavelet gemm.
      
      * Add wavelet to Gemm standalone test.
      
      * Remove debug code.
      
      * Remove dangling #endif directive.
      
      Co-authored-by: root <Raman Jana>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      1cfa8760
    • ltqin's avatar
      Add multiD Gemm client APIs (#534) · d66421fe
      ltqin authored
      
      
      * start add example
      
      * fix config
      
      * fix showinfo bug
      
      * add an elementop
      
      * change to padding
      
      * add xdl example
      
      * change elementwiseop
      
      * add instance
      
      * add instance to profiler
      
      * change file name
      
      * fix deive not support issue
      
      * add client example
      
      * fix client gemm_add_multiply name
      
      * change AddMultiply elementwiseop
      
      * fix elementwiseop
      
      * fix client example
      
      * fix addmultiply op
      
      * fix comments and fun name
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      d66421fe
    • who who who's avatar
      add multi embeddings support (#542) · 147b7db5
      who who who authored
      * add multi embeddings support
      
      * fix format
      
      * optimize sqrt
      
      * add reduce operation
      
      * change to elementwise op
      
      * fix name
      
      * rename
      
      * run ci cd
      
      * format example
      
      * format code
      
      * format code
      147b7db5
  27. 17 Jan, 2023 3 commits
    • Qianfeng's avatar
      Reduction external API and client examples (#493) · 80e05267
      Qianfeng authored
      
      
      * Change to the DeviceReduce base class template to include all problem description information
      
      * Add external api for reduction
      
      * Add client example to test the reduction external api
      
      * Spelling correction
      
      * Re-implement the host_reduction to follow the DeviceReduce base API format
      
      * Change the reduce profiler to call the external API for collecting device instances
      
      * Rename reduce client example directory from 08_reduce to 12_reduce
      
      * Remove (void) before the functional call
      
      * Tiny update in reduce client example
      
      * Tiny update in profile_reduce_impl.hpp
      
      * Rename the reduce client example directory
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      80e05267
    • rocking5566's avatar
      Gemm layernorm welford (#413) · 7829d729
      rocking5566 authored
      
      
      * Add device op of gemm layernorm
      
      * [What] Rename F to H
      [Why] F and G prepare for welford tensor
      
      * Add gridwise gemm + welford
      
      * Extract template parameter
      
      * Rename kernel. Prepare to add second half kernel
      
      * Extract var
      
      * Add second kernel for gemm+layernorm
      
      * Move to the gemm_layernorm folder
      
      * Rename F and G to mean and var
      
      * Do not use snakeCurved, it makes determination of padding  for welford difficult
      
      * Rewrite the device interface and rename some var
      
      * Add welford count
      
      * Update interface
      
      * Sync code, prepare to test on MI200
      
      * Clean the code
      
      * Implement layernorm
      
      * Add comment to mension hipFree
      
      * Wrtie out the e for debug.
      This could be remove and use h for instead
      
      * 1. Allocate mean, var and count into by SetWorkSpacePointer.
      2. Add GetWorkSpaceSize to calculate the space size
      
      * Add gemm layernorm host code
      
      * use reference layernorm
      
      * Fix bug of blockwise welford for first kernel
      
      * Fix bug of mean var padding for layernorm
      
      * Use sgpr for shuffleM_index
      
      * padding for GemmMeanVarCountGridDescriptor_M_NBlock
      
      * Add layout parameter
      
      * Check argument for gemm
      
      * calculate max count for tail block
      
      * Share E and H memory in device op
      
      * Hard code the vector dim
      
      * Refine the MakeDescriptor
      
      * 1. Remove E parameter, because E is inside of device op
      2. Check vector size
      
      * [What] Rename MakeMeanVarDescriptor_M_N
      [Why] Prepare to add count version of make descriptor
      
      * Use 1D global memory for count
      
      * Prevent redundant IO
      
      * Update parameter
      
      * Add pipeline v1/v2 selector
      
      * Rename the example name
      
      * Add base class for gemm layernorm
      
      * Refine naming to distinguish naive and welford
      
      * Add comment to explan in detail
      
      * We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1
      
      * Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel
      
      * Share the vector size
      
      * Refine var name
      
      * [What] Force LayernormThreadSliceSize_N = vector size.
      [Why] Memory coalesce
      
      * Add comment
      
      * Extract divisor out of the loop in reference layernorm
      
      * Pad different size for E and H in layernorm kernel according to different block tile
      
      * Refine naming
      
      * Refine naming
      
      * Prevent implicit cast
      
      * [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
      [Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting
      
      * Cast only constant
      
      * Change of post shuffle thread descriptor
      
      * Add EMeanVarDataType parameter.
      
      * Merge the mean and var threadwise copy
      
      * Add missing index
      
      * Fix Typo
      
      * Sync the variable with previous if
      
      * 1. Declare e inside the host_gemm_layernorm()
      2. Prevent implicit cast in reference code
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      7829d729
    • Haocong WANG's avatar
      [Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f
      Haocong WANG authored
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      919aeb1f
  28. 12 Dec, 2022 1 commit
    • arai713's avatar
      Gridwise elementwise 2d (#466) · 0e5c264c
      arai713 authored
      
      
      * added 2d gridwise elementwise
      
      * added 2d version of device elementwise
      
      * added example file with updated device elementwise call
      
      * added Cmake file
      
      * changed NumDim into 2D
      
      * fixed compiler issues
      
      * fixed indexing for loop step
      
      * fixed NumDim dimension error
      
      * changed blockID to 2D
      
      * updated Grid Desc
      
      * updated kernel call
      
      * fixed 2d thread indexing
      
      * added dimensions for example file
      
      * commented out unused code
      
      * changed vector load
      
      * removed extra code
      
      * temporarily removing vector load on 2nd dim
      
      * changed vector load back, still causing errors
      
      * altered indexing
      
      * changed isSupportedArgument for 2D
      
      * changed indexing + do/while
      
      * fixed isSupportedArgument
      
      * changed dimension for debugging
      
      * fixed
      
      * added testing printouts
      
      * testing change
      
      * added variables to distribute threads through both dimensions
      
      * testing changes
      
      * integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise
      
      * removed most of the extraneous code, testing with different dimensions
      
      * testing
      
      * removed debugging print statements
      
      * moved 2d elementwise permute into elementwise permute directory
      
      * fixed formatting
      
      * removed debugging comments from threadwise transfer
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      0e5c264c
  29. 02 Dec, 2022 1 commit
    • ltqin's avatar
      Add multiple d gridwise gemm on Navi21 for ResNet50 (#517) · 23ecf0fa
      ltqin authored
      
      
      * start add example
      
      * add multiple d fp16 example
      
      * device transfer elementwiseop to gridwise
      
      * gridwise add multiple d
      
      * change example for multiple d
      
      * fix spill registers
      
      * fix for passthrough element op
      
      * fix int8 overflow
      
      * change example file name
      
      * add instance for dl multiple d
      
      * example add DsDataType
      
      * remove grouped_convolution_forward_dl.hpp
      
      * add head file(was deleted before)
      
      * fix not support device issue
      
      * format
      
      * remove passthrough check
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      23ecf0fa