1. 06 Sep, 2023 1 commit
    • Bartlomiej Wroblewski's avatar
      Redesign the DPP8 GEMM kernel to use warp-wise component (#863) · 37a8c1f7
      Bartlomiej Wroblewski authored
      * Redesign the DPP8 GEMM kernel to use warp-wise component
      
      * Review: Improve error messages
      
      * Review: Remove unnecessary empty lines
      
      * Review: Fix M, N per thread names
      
      * Review: Rename mfma_input_type to dpp_input_type
      
      * Review: Fix tensor adaptor; remove unnecessary element
      
      * Review: Remove calls to dpp_gemm's MakeCDescriptor
      
      * Review: Add blockwise doc, change function names to include dimension names
      
      * Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file
      
      * Review: Add __restrict__ keywords
      
      * Review: Use MatrixPadder for padding A, B, C matrices
      
      * Review: Remove hardcoded datatypes
      
      * Review: Change names from FloatX to XDataType
      
      * Review: Introduce AK0 and BK0 instead of a single K0
      
      * Review: Remove construction of dpp_datatypes object
      
      * Review: Rename DppInstrRunner to DppLanegroupGemm
      37a8c1f7
  2. 31 Aug, 2023 1 commit
    • rocking's avatar
      MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861) · 866377de
      rocking authored
      * Add maxpool instances
      
      * Rename index pool to max pool.
      
      * Add maxpool bwd bf16 instances
      
      * Add avg pool bwd instances
      
      * Rename avgpool and maxpool to avg_pool3d and max_pool
      
      * Add bf16 pool fwd instances
      
      * Add max pool bwd to ckProfiler
      
      * Add avg pool3d bwd to ckProfiler
      
      * Add avg pool bwd test
      
      * Fix bug of reference pool fwd (dilation)
      
      * Fix bug of max pool bwd  (dilation and initZero)
      
      * Support bf16 compute data type
      
      * Force compute type be f32. Because atomicAdd only support f32
      
      * Add max pool bwd test
      
      * Rename folder
      
      * Rename pool
      
      * Add max pool bwd client example
      
      * Add avg pool bwd client example
      
      * Add missing workspace
      
      * clang format
      
      * Rename macro
      
      * remove useless header
      
      * remove useless layout
      866377de
  3. 17 Aug, 2023 1 commit
  4. 14 Aug, 2023 1 commit
  5. 27 Jul, 2023 1 commit
  6. 26 Jul, 2023 1 commit
    • carlushuang's avatar
      initial stream-k implementation with example (#699) · e7dca79d
      carlushuang authored
      
      
      * initial stream-k implementation with example
      
      * fix unexpected change in err
      
      * improve a little bit performance by reorganize pipeline.
      
      * improve perf a little bit by swizzle block idx
      
      * add profiler
      
      * update example
      
      * fix spelling
      
      * shrink karg for streamk
      
      * support dynamic buffer using memory coherence glc_slc bit from template
      
      * control memory coherence while construct dynamic buffer
      
      * update reduction for streamk(not ready yet)
      
      * Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
      
      * fix build issue
      
      * fix several bug
      
      * now result is correct, everything works (but has scratch)
      
      * remove scratch by manually reset coordinate
      
      * update device code
      
      * fix a bug in final reduce
      
      * fix something in example
      
      * update async memset
      
      * fix enum as camel case
      
      * modify coherence enum name
      
      * clean code and use atomic streamk by default
      
      * remove unused var
      
      * throw exception if have empty pointer
      
      * fix format
      
      * fix CI warning
      
      * fix type in init
      
      * modify CI error
      
      * filter out on gfx10+
      
      * restore changed example code
      
      ---------
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      e7dca79d
  7. 18 Jul, 2023 1 commit
  8. 06 Jul, 2023 1 commit
    • Qianfeng's avatar
      Batchnorm splitk single kernel (#771) · 8f5cafaf
      Qianfeng authored
      * Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]
      
      * Add CountDataType as template parameter in blockwise_welford
      
      * Add utility/get_shift.hpp
      
      * Add BatchNorm multiblock single-kernel implementation
      
      * Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a
      
      * Renaming in device_batchnorm_forward_impl.hpp
      
      * Tiny fix in the batchnorm_fwd profiler
      
      * Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"
      
      This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.
      
      * Use the old two-kernel batchnorm multiblock method for gfx1030
      
      * Use the old two-kernel batchnorm multiblock method for gfx908
      
      * use the single-kernel batchnorm multiblock method only for gfx90a
      
      * Remove get_wave_id() from utility/get_id.hpp since it is not used
      
      * Set true for testing running mean/variance and saving mean/invvariance in the examples
      
      * Fix to copy-right words
      
      * Remove un-needed including in utility/get_id.hpp
      
      * Add comments to workgroup_synchronization.hpp
      
      * Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp
      
      * Renaming in the kernels
      
      * Remove un-used kernel file
      8f5cafaf
  9. 05 Jul, 2023 2 commits
  10. 19 Jun, 2023 1 commit
    • Rostyslav Geyyer's avatar
      FP8 enablement - add a pseudorandom number generator, add conversion methods (#708) · f0c620c4
      Rostyslav Geyyer authored
      * Add basic fp8 definitions and prn-generator
      
      * Format
      
      * Add fp8<->fp32 type_convert
      
      * Format
      
      * Split type_convert and cast_to/from_f8
      
      * Format
      
      * Minor fix
      
      * Minor fix
      
      * Move fp8 utils to a separate header
      
      * Add elementwise ops
      
      * Add fp8_convert_sr
      
      * Format
      
      * Add element op
      
      * Eliminate magic numbers
      
      * Split f8_convert_sr in host and device
      
      * Format
      
      * Add some constexpr
      
      * Add a datatype test
      
      * Format
      
      * Another format
      
      * Add fp8<->fp16 tests
      
      * Update type_converts
      
      * Format
      
      * Add fp16 casting functions
      
      * Format
      
      * Use seed as a runtime arg
      
      * Use element location for PRNG
      
      * Format
      
      * Add fp8<->fp16 to PassThrough element op
      
      * Clean up
      
      * Merge host and device implementations
      
      * Add comments on rounding modes
      
      * Remove leftover code
      
      * Put type_converts into a separate header
      
      * Put random number gen to a separate header
      
      * Rearrange f8_utils' namespaces
      
      * Refactor type_convert.hpp
      
      * Move f8_t definition
      f0c620c4
  11. 15 Jun, 2023 1 commit
    • Illia Silin's avatar
      Enable gfx941 and gfx942 architectures. (#752) · 027e46ee
      Illia Silin authored
      * enable gfx941/942 targets
      
      * fix clang format
      
      * fix the cmake logic for multiple targets
      
      * fix cmake syntax for looping over targets
      
      * add gfx941/942 support for gemm_xdl instances
      027e46ee
  12. 12 Jun, 2023 1 commit
    • Po Yen Chen's avatar
      Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (#738) · 7c24654c
      Po Yen Chen authored
      * Fix wrong pointer type
      
      * Rename type trait get_unsigned_int<> to get_carrier<>
      
      * Add 3-bytes carrier type
      
      * Add missing __device__ specifier
      
      * Rename template non-type parameter
      
      * Leave the rest byte uninitialized
      
      * Avoid invoking (host) STL algorithms
      
      * Remove unnecessary 'inline' specifier
      
      * Extract common logic out as helper method
      
      * Hide dummy member function
      
      * Add missing __device__ specifier
      7c24654c
  13. 08 Jun, 2023 1 commit
  14. 31 May, 2023 2 commits
    • Illia Silin's avatar
      update copyright headers (#726) · b94fd0b2
      Illia Silin authored
      b94fd0b2
    • Po Yen Chen's avatar
      Add class type support for __builtin_amdgcn_readfirstlane() (#711) · 582e31e8
      Po Yen Chen authored
      * Add overloaded version of __builtin_amdgcn_readfirstlane()
      
      * Remove 'static' specifiers
      
      * Remove more 'static' specifier
      
      * Replace unsigne char by std::byte
      
      * Add 'const' specifier to never changing variable
      
      * Add 'inline' specifier to funcion definition
      
      * Fix wrong boundar calculation logic
      
      * Rename type trait
      
      * Remove std:: qualifier from standard types
      
      * Replace 'size_t' by 'unsigned'
      
      * Use type alias to hint usage
      
      * Replace static_for<> by ordinary 'for' loop
      
      * Rename readfirstlane() to amd_wave_read_first_lane()
      
      * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
      
      * Reorder statements
      582e31e8
  15. 24 May, 2023 1 commit
  16. 04 May, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Optimize bf16 conversion (#664) · b076a02a
      Rostyslav Geyyer authored
      * Add TypeConvert class and start refactoring
      
      * Refactor TypeConvert as a struct
      
      * Get back to template functions type_convert
      
      * Add a type_convert_bf16_rtn, set rtz as default
      
      * Clean up
      
      * Add UnaryConvertPrecision struct for high-precision workloads
      
      * Format
      
      * Update type_convert to UnaryConvert on threadwise level
      
      * Update UnaryConvertPrecision
      
      * Format
      
      * Fix chmod
      
      * Add a flag to pick converion method
      
      * Format
      
      * Remove the added flag
      
      * Merge elementwise op with type conversion
      
      * Move type_convert to elemwise op, update the op
      
      * Update type_convert_precision -> bf16_convert_rtn
      
      * Clean up
      
      * Update comments
      
      * Update the CK_WORKAROUND_DENORM_FIX flag handling
      
      * Update the unneeded op to work but warn user
      
      * Remove the message
      
      * Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
      
      * Format
      
      * Add missing include
      b076a02a
  17. 28 Apr, 2023 1 commit
  18. 10 Apr, 2023 1 commit
    • rocking5566's avatar
      Groupnorm + swish external api (#668) · ed3a2e52
      rocking5566 authored
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      ed3a2e52
  19. 29 Mar, 2023 2 commits
  20. 20 Mar, 2023 1 commit
  21. 15 Mar, 2023 2 commits
    • rocking5566's avatar
      gemm/Conv xdlops + dlops quantization (#625) · 16dc18e0
      rocking5566 authored
      
      
      * Add conv perlayer quantization
      
      * Add gemm_dlops quantization
      
      * Support int8 for innerproduct
      
      * Refine gemm dlops int8 kernel parameter
      
      * Support gfx908(MI100) and gfx90a(MI200)
      
      * clang-format
      
      * Rename example number
      
      * Support different layout for d tensor
      
      * Add conv dlops perchannel quantization example
      
      * Move to example 40
      
      * Extract the common code for different platform (dlops and xdlops)
      
      * Move ot subfolder. Prepare to add other op of quantization
      
      * Refine the quantization instance library
      
      * Add conv dl instances and client example
      
      * Remove unnecessary type
      
      * Add gemm quantization instance
      
      * Add external api and client example
      
      * Refine num_bytes
      
      * Separete different layout to different cpp
      
      * Add more xdl instances
      
      * Revert "Remove unnecessary type"
      
      This reverts commit 820869182f6a8f62b2c9004101ba6bf76b96be14.
      
      * Remove CShuffleDataType in dlops
      Let acc and CShuffleDataType be the same in xdlops
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      16dc18e0
    • Haocong WANG's avatar
      Fix arch limitation bug (#639) · ea028ac6
      Haocong WANG authored
      ea028ac6
  22. 10 Mar, 2023 1 commit
  23. 09 Mar, 2023 1 commit
  24. 15 Feb, 2023 1 commit
    • rocking5566's avatar
      Improve normalization (#580) · 6a6163a3
      rocking5566 authored
      * Sync the order of type string with template parameter
      
      * Add more instances
      
      * Check the vector size and remove redundant var
      
      * Extract var to static, prepare to separate sweep once kernel
      
      * Separate sweeponce flow and optimize the flow
      
      * 1. Rename AccDatatype in normalization to computeData
      2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
      
      * Remove useless code
      
      * Update naive variance kernel
      
      * Refine string
      
      * Fix typo
      
      * Support naive variance for device_normalization
      
      * Check the blocksize
      
      * Share the VGPR of x and y
      
      * Share the VGPR of gamma and beta
      
      * Add more instances
      
      * Support fp16 sqrt for experiment
      
      * Add CHANGELOG
      
      * Fix typo
      
      * clang-format
      6a6163a3
  25. 09 Feb, 2023 1 commit
    • rocking5566's avatar
      Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e
      rocking5566 authored
      * Add gemm + layernorm instance
      
      * Add ckProfiler
      
      * Add test
      
      * Add client example
      
      * Detect if user forger to set the workrspace
      
      * Use literal in the example
      
      * [What] use builtin function for sqrt
      [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
      
      * check gemm vaildity in IsSupportedArgument
      
      * Add more testcases
      
      * Merge duplicated folder in client example
      
      * Print more infomation
      
      * Use better kernel parameter for MS problem size
      
      * clang format
      
      * Add constexpr for if condition and remove redundant include
      
      * Remove cstdlib and add constexpr
      f7d28f3e
  26. 18 Jan, 2023 1 commit
    • Raman R jana's avatar
      Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760
      Raman R jana authored
      
      
      * wavelet gemm programming model support for CK
      
      * GEMM pipeline update for wavelet progrmmaing model
      
      * Updated wavelet programming pipeline
      
      * fixes for global-write for math-wave
      
      * fixed bug in global writes
      
      * Updated comments for better readability
      
      * fixed clang format errors
      
      * added block_lds without barrier sync
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * refactor
      
      * prototype
      
      4 layouts
      
      fix default stride
      
      all problem sizes
      
      tidy
      
      move file
      
      update build script
      
      restore old file
      
      fix build
      
      * refactor standalone test to use gemm test harness
      
      * simplify gemm test
      
      * update build script
      
      * remove redundant
      
      * early return when cmd arg doesn't match
      
      * tidy
      
      * report failure when result not validated
      
      * tidy
      
      * Add comment depicting B2C mapping pattern.
      
      * Formatting & comments.
      
      * Comparison with custom B2C mapping pattern.
      
      * Example for wavelet gemm.
      
      * Add wavelet to Gemm standalone test.
      
      * Remove debug code.
      
      * Remove dangling #endif directive.
      
      Co-authored-by: root <Raman Jana>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      1cfa8760
  27. 17 Jan, 2023 2 commits
    • Qianfeng's avatar
      Reduction external API and client examples (#493) · 80e05267
      Qianfeng authored
      
      
      * Change to the DeviceReduce base class template to include all problem description information
      
      * Add external api for reduction
      
      * Add client example to test the reduction external api
      
      * Spelling correction
      
      * Re-implement the host_reduction to follow the DeviceReduce base API format
      
      * Change the reduce profiler to call the external API for collecting device instances
      
      * Rename reduce client example directory from 08_reduce to 12_reduce
      
      * Remove (void) before the functional call
      
      * Tiny update in reduce client example
      
      * Tiny update in profile_reduce_impl.hpp
      
      * Rename the reduce client example directory
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      80e05267
    • Haocong WANG's avatar
      [Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f
      Haocong WANG authored
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      919aeb1f
  28. 12 Jan, 2023 1 commit
    • Qianfeng's avatar
      Remove including of cmath (#551) · a17b0414
      Qianfeng authored
      * Let cmath included when compiling host codes in math_v2.hpp
      
      * Remove including of cmath in device_base.hpp and device_permute.hpp
      a17b0414
  29. 07 Dec, 2022 1 commit
  30. 02 Dec, 2022 1 commit
  31. 15 Nov, 2022 1 commit
    • guangzlu's avatar
      Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328
      guangzlu authored
      
      
      * fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
      
      * added bf16 tests for batched_gemm_softmax_gemm_permute
      
      * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * aligned annotations
      
      * modified CMakeLists for examples
      
      * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
      
      * use macro to control the instances
      
      * added macro control into instances
      
      * clang-format some files
      
      * changed error tolerance for bf16
      
      * changed index for 10_elementwise_normalization
      
      * fixed xdlops code bug in amd_xdlops.hpp
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      4c4c7328
  32. 20 Sep, 2022 1 commit
    • Po Yen Chen's avatar
      Add 'Permute' device op & example (#408) · f584ab0c
      Po Yen Chen authored
      * Add example folder for 'DeviceElementwise'
      
      * Re-structure example files
      
      * Move common parts into common.hpp
      
      * Use more strict input
      
      * Add more helper methods in 'DeviceElementwise'
      
      * Use more specific method to write example
      
      * Allow specify problem through command line argument
      
      * Allow specify problem 'axes' through command line argument
      
      * Add check to template type argument
      
      * Add transpose_shape() to generalize shape permute
      
      * Generalize transpose utility functions
      
      * Use better name for tensor indices
      
      * Add checks in helper functions
      
      * Remove debug messages
      
      * Refine error message for check_err()
      
      * Generalize variable naming in example code
      
      * Add device op 'DevicePermute'
      
      This device op is clone of 'DeviceElementwise'
      
      * Use 'DevicePermute' device op in example
      
      * Remove 'elementwise' from identifiers
      
      * Remove 'elementwise' from file paths
      
      * Remove base class of 'DevicePermute'
      
      * Let 'DevicePermute' inherit from 'BaseOperator'
      
      * Add simple type traits to validate device op type
      
      * Add static_assert() to check type constraints
      
      * Create 'DevicePermuteBase' to generate methods
      
      * Use indirect base type to generate methods
      
      * Remove 'is_device_op<>' type traits
      
      * Only accept single-input-single-output for 'DervicePermute'
      
      * Simplify 'DevicePermute' interface
      
      * Re-format 'DeviceElementwise'
      
      * Use CRTP to generate overridden virtual method
      
      * Remove unnecessary include directives
      
      * Distinguish input & output shape in 'DevicePermute'
      
      * Passing 'axes' to 'DevicePermute'
      
      * Use more reasonable return value for Invoker::Run()
      
      * Add 'GridwisePermute' kernel
      
      This kernel is a clone of 'GridwiseElementwise_1D'
      
      * Remove no-longer used type argument
      
      * Check if input/output shape meet the requirement
      
      * Remove no-longer used method
      
      * Remove never-entered-if-clause
      
      * Change problem description for 'DevicePermute'
      
      * Transform descriptor into 3 dimensions
      
      * Add debug code the verify result
      
      * Add comment to indicate template argument location
      
      * Add N/H/WPerBlock template parameter to 'DevicePermute'
      
      * Rename 'GridwisePermute' to 'GridwiseCopy'
      
      * Check tensor descriptor dimensions in 'GridwiseElementwise_1D'
      
      * Add missing include directive
      
      * Add 'BlockSize' parameter to 'DevicePermute'
      
      * Remove no-longer used method
      
      * Add 'BlockToTileMap' for 'GridwiseCopy'
      
      * Use the normal Block2TileMap convention
      
      * Rename 'BlockToTileMap' as 'Block2TileMap'
      
      * Fix most of compilation errors
      
      * Let 'Block2TileMap' map block to 2d coordinate
      
      * Allow data transfer in 'GridwiseCopy'
      
      * Fix wrong output descriptor for 2nd blockwise copy
      
      * Rename 'GridwiseCopy' as 'GridwisePermute'
      
      * Remove '1d' in identifiers
      
      * Remove commented-out codes
      
      * Remove 'MPerThread' template parameter
      
      * Seperate template parameters
      
      * Unify variable namming convention
      
      * Use more verbose way to create expressions
      
      * Add template parameter 'InBlockLdsExtraW'
      
      * Release the constraint on In/OutGridDesc
      
      * Use date type directly as template argument
      
      * Re-arrange template arguments for blockwise copy
      
      * Remove no-longer used template parameters
      
      * Embed layout in the variable names
      
      * Add GridwisePermute::CheckValidity()
      
      * Extract local types as template parameters
      
      * Rename local type alias
      
      * Add more template parameters (vector width related)
      
      * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions
      
      * Fill tensor values start from 1
      
      * Re-formate example code
      
      * Avoid too-large block id
      
      * Add comment
      
      * Make sure 'SrcVectorDim' is not same as 'DstVectorDim'
      
      * Add check for the 'VectorDim' & 'ScalarPerVector' template params
      
      * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc
      
      * Remove no-longer used template parameter 'NPerBlock'
      
      * Fix wrong descriptor creation logics
      
      * Specify problem in each examples
      
      * Use better example name
      
      * Add new example 'example_permute_NxHxW_fp32'
      
      * Add example for demonstrating bundle multiple elems in tensor
      
      * Add support to permute multiple elements together
      
      * Change the default problem size
      
      * Add span<> class template
      
      * Use span<> to generalize check_err() interface
      
      * Fix ambiguous ctor call
      
      * Avoid create necessary objects
      
      * Use helper functions to simplify example code
      
      * Add example for 4xfp16 permute
      
      * Disable failed-to-compile example
      
      * Add check for the NUM_ELEMS_IN_BUNDLE
      
      * Remove redundant parameter in helper lambda function
      
      * Add check for the input tensor type's byte-size
      
      * Check scalar-per-vector with padded length
      
      * Use more verbose name to avoid name collision
      
      * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS
      
      * Embed shape info in name of descriptor constructor
      
      * Rename example folder '36_permute' into '37_permute'
      
      * Avoid using too-large LDS in kernel code
      
      * Remove redundant example
      
      * Usw switch() to group similar codes
      
      * Add const to the span<> type arguement
      
      * Simply initialize tensor with floating point values
      
      * Use fp16 as data type in all examples
      
      * Enlarge tensor size in example
      
      * Enalrge N-dim in example
      
      * Add check for the bundled type in example
      
      * Use more stricter error threshold
      
      * Remove global load/store loop in kernel code
      
      * Measure execution time by default
      
      * Use faster device op config for example 'NxHxW_fp16'
      
      * Use faster device op config for example '1xHxW_fp16'
      
      * Use faster device op config for example 'HxWx4_fp16'
      
      * Remove cmd arg parsing logics
      
      * Rename functions
      
      * Extract bundle permutation logic out
      
      * Simplify permute bundle example
      
      * Add Tensor<>::GetElementSpaceSizeInBytes()
      
      * Add Tensor<>::data()
      
      * Use new methods to simplify code
      
      * Use type alias to replace duplicated code
      
      * Use existing method to shorten code
      
      * Allow FillUniformDistribution accept range arugment
      
      * Intialize random values in range
      
      * Add Tensor<>::size()
      
      * Use more meaningful names in permute bundle example
      
      * Use more meaningful names in permute element examples
      
      * Use rangified copy() to copy elements
      
      * Use function return value directly to eliminate variables
      
      * Add to_array() conversion tool to eliminate more variables
      
      * Add Tensor<>::AsSpan<>() to create view of tensor values
      
      * Use AsSpan() to shorten check_err() calls
      
      * Remove no-longer-used 'using' directives
      
      * Move 'using' directive to proper code position
      
      * Remove redudant variables
      
      * Remove useless static_assert()
      
      * Add check for range types
      
      * Declare variable right before first use
      
      * Move long return type as tailing return type
      
      * Add BaseInvokerCRTP<> class template to generate method
      
      * Create new base type for 'DervicePermute' implementations
      
      * Move 'NumDim' template param to the first
      
      * Rename 'DevicePermute' to 'DevicePermuteImpl'
      
      * Add 'noexcept' specifier to CRTP generated method
      
      * Move 'Block2TileMap' definition into 'GridwisePermute'
      
      * Use type alias to reduce code
      
      * Unify naming style in 'DevicePermute'
      
      * Add comments in 'GridwisePermute'
      
      * Rename permute example folder
      
      * Use std::cerr to report error
      
      * Use larger shape in examples
      
      * Rename '38_permute' to '39_permute'
      
      * Make sure we use unsigned type for shape & indices
      
      * Remove opt-ed out assertion
      
      * Remove template BaseInvokerCRTP<>
      f584ab0c
  33. 19 Sep, 2022 2 commits
    • Anthony Chang's avatar
    • Shaojie WANG's avatar
      Conv bwd data multiple d (#404) · 27858374
      Shaojie WANG authored
      
      
      * init commit of convnd bwd data
      
      * begin compiling example
      
      * have a first version that produce a right result
      
      * refine device level launch kernel code
      
      * add more instances in example and get right results
      
      * clang-format
      
      * format example file
      
      * add more instances
      
      * fix instances
      
      * adding conv_bwd_data multile_d
      
      * adding conv_bwd_data multile_d
      
      * adding conv_bwd multiple d
      
      * adding conv_bwd multiple d
      
      * adding conv_bwd multiple d
      
      * refactor
      
      * refactor
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * refactor
      
      * update conv fwd's bias impl
      
      * refactor
      
      * reorg file
      
      * clean up cmake
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      27858374
  34. 09 Sep, 2022 1 commit