1. 30 Jun, 2022 1 commit
    • Anthony Chang's avatar
      Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d
      Anthony Chang authored
      * use 'sweep once' softmax kernel where applicable
      
      * threadwise copy's dst buffer can specify invalid element value
      
      * add int8 in/out float compute softmax support
      
      give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
      
      * format
      
      * softmax inherits DeviceNormalization
      
      * softmax profiler stub
      
      * tighten up reference softmax interface
      
      * example prints tensor dimension
      
      * add fp32 to softmax profiler
      
      * rename header
      
      * hook with ckProfiler
      
      * format
      
      * resolve merge conflict
      
      * resolve merge conflicts
      
      * update normalization profiler help string
      
      * resolve conflict
      
      * typo
      
      * remove residual
      
      * softmax profiler: address feedback
      
      * test for mixed precision input/output
      
      * fully qualify ck::math::isnan
      
      * add comment for device normalization interface
      
      * revise wording
      
      * constness for alpha/beta scaler pointer
      93c99f3d
  2. 25 Jun, 2022 2 commits
    • Chao Liu's avatar
      add license in file (#303) · d3051d75
      Chao Liu authored
      d3051d75
    • Chao Liu's avatar
      Absolute include path (#281) · d1db6a0c
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add client app example
      
      * update readme
      
      * delete obselete files
      
      * remove old client app
      
      * delete old file
      
      * cleaning
      
      * clean
      
      * remove half
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path for all examples
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * revert client app example
      
      * clean build
      
      * fix build
      
      * temporary disable client test on Jenkins
      
      * clean
      
      * clean
      
      * clean
      d1db6a0c
  3. 21 Jun, 2022 1 commit
    • Anthony Chang's avatar
      Standalone softmax kernel (#284) · 15c89e81
      Anthony Chang authored
      * initial stub for standalone softmax
      
      * start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
      
      * host softmax validates
      
      * compiles; to implement beta scaling
      
      * use NaN trick to efficiently ignore OOB values during sum of exponentials
      
      * freeload device_reduce's utility functions
      
      * clean up interface
      
      * adding prior value (beta scaling)
      
      * remove restriction related to perf considerations
      
      * apply clang-format
      
      * clean; disable diagnostics
      
      * resolve conflicts
      
      * add exp wrapper
      
      * honor HostTensorDesc interface; allow implicit cast from different vector<T> type
      
      * test softmax for fp16/fp32
      
      * update readme
      
      * amend commit NaN trick
      
      * remove redundant param added during development
      
      * format
      
      * replace ScalarDataType with AccDataType
      
      * separate out test programs by precision type
      
      * move softmax sample code to its own folder
      
      * format
      
      * keep up with recent changes in reduction API
      
      * remove extra header
      15c89e81
  4. 02 Jun, 2022 1 commit
    • Qianfeng's avatar
      Unify the naming of the math functions used by the host and kernel (#262) · 86185bd7
      Qianfeng authored
      * Use the unified naming for math functions on host and HIP kernel
      
      * Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming
      
      * Renaming GetReductionZeroVal() to GetIdentityValue()
      
      * Tiny renaming in profile_reduce_impl.hpp
      
      * More renaming in profile_reduce_impl.hpp
      
      * Replace zeroVal by identiyVal
      
      * Remove ck_ prefix in the naming of ck::math provided functions
      86185bd7
  5. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf
  6. 05 Mar, 2022 1 commit
    • Qianfeng's avatar
      Reduction in Composable Kernel (#82) · e17c0d80
      Qianfeng authored
      
      
      * Initial adding of generic reduction
      
      * Initial adding of generic reduction ...
      
      * Updates to make compiling done
      
      * clang-format all files
      
      * clang-format some files again
      
      * Renaming in profiler/include/profile_reduce.hpp
      
      * Updates and make BlockWise cases passed
      
      * Updates and make ThreadWise and MultiBlockTwoCall cases passed
      
      * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances
      
      * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes
      
      * format
      
      * adding pooling
      
      * added max and average pooling
      
      * comment out cout and kernel timing
      
      * Tiny simplification in profiler/reduce_profiler.cpp
      
      * Add example for reduce_blockwise
      
      * Tiny updates
      
      * Change to pass the ElementWiseOp from device layer to kernel
      
      * Fix the vectorDim and vectorSize in Device layer
      
      * Enable vector load on both dim0 and dim1 for Threadwise method
      
      * Tiny updates
      
      * Change to let the user to pass the preUnaryOp and posUnaryOp
      
      * Make pooling example work
      
      * split device_reduce_instance into two libraries
      
      * Tiny update
      
      * Replace nanPropaOpt enum by boolean propagate_nan
      
      * Simplification in DeviceReduce layer codes
      
      * update build
      
      * Change to clarify the difference between ck::half_t and half_float::half
      
      * Renaming in all the reduction codes
      
      * Add VectorSize as template parameter for device layer
      
      * Add BetaIsZero as kernel template and as AccDataType for alpha
      
      * print
      
      * Small updates for pooling
      
      * Updates for host_generic_reduction for reference
      
      * Update to make AVG pooling pass
      
      * Update to make MAX pooling with indices output pass
      
      * fix
      
      * add OutDst vector store to threadwise reduction and pooling
      
      * tweak
      
      * turn off check_indices that caused build issue
      
      * refactor pooling
      
      * clean up
      
      * turn off check_indices for building issue for php-compiler
      
      * add more tile size for odd C
      
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op
      
      * Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations
      
      * Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp
      
      * Change to the unary operators
      
      * Move the definitions of unary operations to element_wise_operation.hpp
      
      * re-org files
      
      * Refine in device interfaces and multiblock kernels
      
      * Split the reduction configurations into instances for specific methods
      
      * Update in getTypeString() of device pool2d
      
      * Renaming in host and kernel
      
      * Tiny update in profiler/src/profiler.cpp
      
      * Uncomment in device_operation/CMakeLists.txt to enable the building of all operations
      
      * Make check_indices a templated function to remove some linking issue
      
      * Renaming in the profiler reduce module
      
      * Add support for double Reduction (but disable MultiblockAtomicAdd for double)
      
      * Tiny correction of literal string
      
      * Rename DevicePoolFwd to DevicePool2dFwd
      
      * Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling
      
      * Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx
      
      * Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp
      
      * Renaming and refining in the Reduction codes
      
      * Tiny change in the unary operators
      
      * Renaming symbols and files
      
      * Renaming symbols in the kernels
      
      * Move kernel kernel_set_buffer_value to separate file
      
      * Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer
      
      * Tiny update in the kernels
      
      * Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue
      
      * Simplify a helper function in device layer
      
      * Tiny adjustment in testing data initialization
      
      * Renaming in kernel/device/host
      
      * Add two testing scripts for reduction
      
      * Refine the Unary operators in element_wise_operation.hpp
      
      * Update in the reduce profiler module
      
      * Update to the reduction testing scripts
      
      * reduce compile parallelism
      
      * change CI docker to rocm5.0
      
      * remove unused variables
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e17c0d80
  7. 27 Aug, 2021 1 commit
    • Qianfeng's avatar
      [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) · 9e80cdce
      Qianfeng authored
      
      
      * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      * make inner product compatible on gfx900
      
      * Update src/include/miopen/solver/ck_utility_common.hpp
      
      * compiler parameter use stream
      
      * use int instead of index_t in kernel wrapper
      
      * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      
      * Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)
      
      * Some updates to dynamic composable kernel facility for the need of dynamic generic reduction
      
      * Update to generic reduction C++ host interface layer to support dynamic generic reduction
      
      * Update to remove tidy complaints in host interface layer
      
      * Change the unary operator form from void op(T &x) to T op(T x)
      
      * Update to pass single workspace pointer for all kernels (fix for OpenCL backend)
      
      * Use cppcheck-suppress to prevent some strange warnings
      
      * Re-use operator [] and () for DynamicBuffer and update to depending codes
      
      * Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers
      
      * [performance] Remove un-needed local buffer initialization
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      9e80cdce