"include/ck/utility/amd_inline_asm.hpp" did not exist on "21f7e9f103231b01889ff30ed9f016fc89d3a669"
- 10 Mar, 2023 1 commit
-
-
Haocong WANG authored
* Change gridwise gemm mD blockwise gemm to naive * RRR Gemm fix * Fix RCR gemm bug * Isolate wmma instructions * Update amd_inline_asm.hpp * Update amd_wmma.hpp * Update amd_wmma.hpp * fix syntax and update Jenkinsfile --------- Co-authored-by:
zjing14 <zhangjing14@gmail.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by:
illsilin <Illia.Silin@amd.com>
-
- 09 Mar, 2023 1 commit
-
-
carlushuang authored
Co-authored-by:zjing14 <zhangjing14@gmail.com>
-
- 15 Feb, 2023 1 commit
-
-
rocking5566 authored
* Sync the order of type string with template parameter * Add more instances * Check the vector size and remove redundant var * Extract var to static, prepare to separate sweep once kernel * Separate sweeponce flow and optimize the flow * 1. Rename AccDatatype in normalization to computeData 2. Rename AccElementwiseOperation to YElementwiseOperation in normalization * Remove useless code * Update naive variance kernel * Refine string * Fix typo * Support naive variance for device_normalization * Check the blocksize * Share the VGPR of x and y * Share the VGPR of gamma and beta * Add more instances * Support fp16 sqrt for experiment * Add CHANGELOG * Fix typo * clang-format
-
- 09 Feb, 2023 1 commit
-
-
rocking5566 authored
* Add gemm + layernorm instance * Add ckProfiler * Add test * Add client example * Detect if user forger to set the workrspace * Use literal in the example * [What] use builtin function for sqrt [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt() * check gemm vaildity in IsSupportedArgument * Add more testcases * Merge duplicated folder in client example * Print more infomation * Use better kernel parameter for MS problem size * clang format * Add constexpr for if condition and remove redundant include * Remove cstdlib and add constexpr
-
- 18 Jan, 2023 1 commit
-
-
Raman R jana authored
* wavelet gemm programming model support for CK * GEMM pipeline update for wavelet progrmmaing model * Updated wavelet programming pipeline * fixes for global-write for math-wave * fixed bug in global writes * Updated comments for better readability * fixed clang format errors * added block_lds without barrier sync * clean * clean * clean * clean * refactor * prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Add comment depicting B2C mapping pattern. * Formatting & comments. * Comparison with custom B2C mapping pattern. * Example for wavelet gemm. * Add wavelet to Gemm standalone test. * Remove debug code. * Remove dangling #endif directive. Co-authored-by: root <Raman Jana> Co-authored-by:
Chao Liu <chao.liu2@amd.com> Co-authored-by:
Adam Osewski <aosewski@amd.com> Co-authored-by:
Anthony Chang <ac.chang@outlook.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 17 Jan, 2023 2 commits
-
-
Qianfeng authored
* Change to the DeviceReduce base class template to include all problem description information * Add external api for reduction * Add client example to test the reduction external api * Spelling correction * Re-implement the host_reduction to follow the DeviceReduce base API format * Change the reduce profiler to call the external API for collecting device instances * Rename reduce client example directory from 08_reduce to 12_reduce * Remove (void) before the functional call * Tiny update in reduce client example * Tiny update in profile_reduce_impl.hpp * Rename the reduce client example directory Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size
-
- 12 Jan, 2023 1 commit
-
-
Qianfeng authored
* Let cmath included when compiling host codes in math_v2.hpp * Remove including of cmath in device_base.hpp and device_permute.hpp
-
- 07 Dec, 2022 1 commit
-
-
guangzlu authored
Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 02 Dec, 2022 1 commit
-
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * Remote int4 related * delete deprecated test Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 15 Nov, 2022 1 commit
-
-
guangzlu authored
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm * added bf16 tests for batched_gemm_softmax_gemm_permute * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * aligned annotations * modified CMakeLists for examples * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl * use macro to control the instances * added macro control into instances * clang-format some files * changed error tolerance for bf16 * changed index for 10_elementwise_normalization * fixed xdlops code bug in amd_xdlops.hpp Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 20 Sep, 2022 1 commit
-
-
Po Yen Chen authored
* Add example folder for 'DeviceElementwise' * Re-structure example files * Move common parts into common.hpp * Use more strict input * Add more helper methods in 'DeviceElementwise' * Use more specific method to write example * Allow specify problem through command line argument * Allow specify problem 'axes' through command line argument * Add check to template type argument * Add transpose_shape() to generalize shape permute * Generalize transpose utility functions * Use better name for tensor indices * Add checks in helper functions * Remove debug messages * Refine error message for check_err() * Generalize variable naming in example code * Add device op 'DevicePermute' This device op is clone of 'DeviceElementwise' * Use 'DevicePermute' device op in example * Remove 'elementwise' from identifiers * Remove 'elementwise' from file paths * Remove base class of 'DevicePermute' * Let 'DevicePermute' inherit from 'BaseOperator' * Add simple type traits to validate device op type * Add static_assert() to check type constraints * Create 'DevicePermuteBase' to generate methods * Use indirect base type to generate methods * Remove 'is_device_op<>' type traits * Only accept single-input-single-output for 'DervicePermute' * Simplify 'DevicePermute' interface * Re-format 'DeviceElementwise' * Use CRTP to generate overridden virtual method * Remove unnecessary include directives * Distinguish input & output shape in 'DevicePermute' * Passing 'axes' to 'DevicePermute' * Use more reasonable return value for Invoker::Run() * Add 'GridwisePermute' kernel This kernel is a clone of 'GridwiseElementwise_1D' * Remove no-longer used type argument * Check if input/output shape meet the requirement * Remove no-longer used method * Remove never-entered-if-clause * Change problem description for 'DevicePermute' * Transform descriptor into 3 dimensions * Add debug code the verify result * Add comment to indicate template argument location * Add N/H/WPerBlock template parameter to 'DevicePermute' * Rename 'GridwisePermute' to 'GridwiseCopy' * Check tensor descriptor dimensions in 'GridwiseElementwise_1D' * Add missing include directive * Add 'BlockSize' parameter to 'DevicePermute' * Remove no-longer used method * Add 'BlockToTileMap' for 'GridwiseCopy' * Use the normal Block2TileMap convention * Rename 'BlockToTileMap' as 'Block2TileMap' * Fix most of compilation errors * Let 'Block2TileMap' map block to 2d coordinate * Allow data transfer in 'GridwiseCopy' * Fix wrong output descriptor for 2nd blockwise copy * Rename 'GridwiseCopy' as 'GridwisePermute' * Remove '1d' in identifiers * Remove commented-out codes * Remove 'MPerThread' template parameter * Seperate template parameters * Unify variable namming convention * Use more verbose way to create expressions * Add template parameter 'InBlockLdsExtraW' * Release the constraint on In/OutGridDesc * Use date type directly as template argument * Re-arrange template arguments for blockwise copy * Remove no-longer used template parameters * Embed layout in the variable names * Add GridwisePermute::CheckValidity() * Extract local types as template parameters * Rename local type alias * Add more template parameters (vector width related) * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions * Fill tensor values start from 1 * Re-formate example code * Avoid too-large block id * Add comment * Make sure 'SrcVectorDim' is not same as 'DstVectorDim' * Add check for the 'VectorDim' & 'ScalarPerVector' template params * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc * Remove no-longer used template parameter 'NPerBlock' * Fix wrong descriptor creation logics * Specify problem in each examples * Use better example name * Add new example 'example_permute_NxHxW_fp32' * Add example for demonstrating bundle multiple elems in tensor * Add support to permute multiple elements together * Change the default problem size * Add span<> class template * Use span<> to generalize check_err() interface * Fix ambiguous ctor call * Avoid create necessary objects * Use helper functions to simplify example code * Add example for 4xfp16 permute * Disable failed-to-compile example * Add check for the NUM_ELEMS_IN_BUNDLE * Remove redundant parameter in helper lambda function * Add check for the input tensor type's byte-size * Check scalar-per-vector with padded length * Use more verbose name to avoid name collision * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS * Embed shape info in name of descriptor constructor * Rename example folder '36_permute' into '37_permute' * Avoid using too-large LDS in kernel code * Remove redundant example * Usw switch() to group similar codes * Add const to the span<> type arguement * Simply initialize tensor with floating point values * Use fp16 as data type in all examples * Enlarge tensor size in example * Enalrge N-dim in example * Add check for the bundled type in example * Use more stricter error threshold * Remove global load/store loop in kernel code * Measure execution time by default * Use faster device op config for example 'NxHxW_fp16' * Use faster device op config for example '1xHxW_fp16' * Use faster device op config for example 'HxWx4_fp16' * Remove cmd arg parsing logics * Rename functions * Extract bundle permutation logic out * Simplify permute bundle example * Add Tensor<>::GetElementSpaceSizeInBytes() * Add Tensor<>::data() * Use new methods to simplify code * Use type alias to replace duplicated code * Use existing method to shorten code * Allow FillUniformDistribution accept range arugment * Intialize random values in range * Add Tensor<>::size() * Use more meaningful names in permute bundle example * Use more meaningful names in permute element examples * Use rangified copy() to copy elements * Use function return value directly to eliminate variables * Add to_array() conversion tool to eliminate more variables * Add Tensor<>::AsSpan<>() to create view of tensor values * Use AsSpan() to shorten check_err() calls * Remove no-longer-used 'using' directives * Move 'using' directive to proper code position * Remove redudant variables * Remove useless static_assert() * Add check for range types * Declare variable right before first use * Move long return type as tailing return type * Add BaseInvokerCRTP<> class template to generate method * Create new base type for 'DervicePermute' implementations * Move 'NumDim' template param to the first * Rename 'DevicePermute' to 'DevicePermuteImpl' * Add 'noexcept' specifier to CRTP generated method * Move 'Block2TileMap' definition into 'GridwisePermute' * Use type alias to reduce code * Unify naming style in 'DevicePermute' * Add comments in 'GridwisePermute' * Rename permute example folder * Use std::cerr to report error * Use larger shape in examples * Rename '38_permute' to '39_permute' * Make sure we use unsigned type for shape & indices * Remove opt-ed out assertion * Remove template BaseInvokerCRTP<>
-
- 19 Sep, 2022 2 commits
-
-
Anthony Chang authored
-
Shaojie WANG authored
* init commit of convnd bwd data * begin compiling example * have a first version that produce a right result * refine device level launch kernel code * add more instances in example and get right results * clang-format * format example file * add more instances * fix instances * adding conv_bwd_data multile_d * adding conv_bwd_data multile_d * adding conv_bwd multiple d * adding conv_bwd multiple d * adding conv_bwd multiple d * refactor * refactor * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * refactor * update conv fwd's bias impl * refactor * reorg file * clean up cmake * clean * clean * clean Co-authored-by:
Chao Liu <lc.roy86@gmail.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 09 Sep, 2022 1 commit
-
-
carlushuang authored
* add gridwise/device sparse embedding * update code * update code * remove useless makefile * code fix * workable * work properly * emb add * add more instance * format * remove useless code * fix format * fix clang-tidy * clean * fix a compile error Co-authored-by:
Chao Liu <chao.liu2@amd.com> Co-authored-by:
Chao Liu <lc.roy86@gmail.com>
-
- 30 Aug, 2022 2 commits
-
-
Adam Osewski authored
* GEMM + Reduce max fp16+fp32 * GEmm + Max bf16 + int8 * Refactor common definitions. * Refactor common func of mean meansquare example. * More examples for mean meansquare. * Update int8 examples and skip them cause of random errors. * Int4 examples. * Fix examples for max int4/8 * Tensor conversion for int4 input data for mean meansquare example. * Remove int4 mean_meansquare example * Fix int8 mean_meansquare example. -All ReductionAccData and R<N>DataType have to be F32. The INT32 data type is giving wrong results. * Guard int4 with ifdef * Change int8 example to add_addsquare due to div rounding err. * Clang format * Change the return type of common function. * Get back int8 example with division. * Remove int8 mean meansquare. * Use proper cast for BF16 data type. * Use ck::literals. * Use proper data type for host tensors & reference. - Use ReduceAccDataType for reference gemm output data type. - Cast host reference output tensor to EDataType - Fix ifdefs for int4. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
Shaojie WANG authored
* add padding algo for bmm+scale+softmax+bmm. Version for verification * remove verification code * remove comments * add padded bmm scale softmax bmm example * format * refactor * add comments for usages of padding bmm+scale+softmax+bmm Co-authored-by:Chao Liu <lc.roy86@gmail.com>
-
- 25 Aug, 2022 1 commit
-
-
Adam Osewski authored
* More int4 UT. * Disable BitwiseRepresentation UT. * Add UT with static_cast * Surround cout statements with #if Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 24 Aug, 2022 1 commit
-
-
Po Yen Chen authored
-
- 23 Aug, 2022 1 commit
-
-
Anthony Chang authored
* GemmPadder and GemmGemmPadder * proper padding using GemmGemmPadder * test gemm_gemm padding * properly check size K in IsSupportedArgument() * properly check size requirement given SrcScalarPerVector in IsSupportedArgument() * comment * format
-
- 18 Aug, 2022 1 commit
-
-
Adam Osewski authored
* Introduce int4 data type. * Add unit-tests for int4 * Compile int4 UT only when int4 enabled. * clang-format Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 13 Aug, 2022 4 commits
-
-
rocking5566 authored
* Add threadwise and blockwise welford * Rename gridwise op, prepare to add welford version * implement welford and integrate welford into layernorm * Take care of tail loop * Fix buf when ThreadSliceK > 1 * Fix bug of merging of two empty set * Rename clip to clamp * 1. Fix type of count 2. Remove useless static_assert * Do not inherit Reduction::Argument * [What] replace __syncthreads() with block_sync_lds() [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0) * Add y stride * Rename. DeviceLayernorm -> DeviceLayernormImpl DeviceNormalization2 -> DeviceLayernorm * Move literal ""_uz & ""_zu into namespace 'literals' * Move namespace 'literals' as 'ck::literals' Co-authored-by:
Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Anthony Chang authored
* initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * add gemm_gemm instances and tests * avoid LDS data hazard * fix build Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
ltqin authored
* start * read for gridwise gemm * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1 * add thread copy desc and register buffer * add K0PerBlock dim * add read global data * finish gridwise gemm * finish blockwise gemm * add print data * add smallest config * add compare code for gridwis gemm * fix NXdlPerWave * fix k0perthread and gridewis gemm main loop * remove b matrix lds alloc * fix name * add test code * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter * add double register * modify b_thread_desc_ * add float * fp16 tag * add tail for pipeline * finish main loop * optimize main loop * start clear gridwise gemm * clear code * clear redundant code * change file name * change file name * fix bug after merge develop * fix input parameters * using MultiK0 control b load data loop * fix some config * 4 buffer * fix bug * one can use * change read order * change buffer array to tuple * change to 8 buffer * interleave buffer load * change to 16 * read 8 buffer * add data buffer to template * fix after merge develop(head file) * format * change to 4 buffer * remove unnecessary lambda fun
-
Anthony Chang authored
* initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * attention host validation * add blockwsie softmax v1 * iteratively update softmax+gemm * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum * add init method for easier debugging * do away with manual thread cluster calculation * generalize blockwise softmax interface * row-wise softmax sum & max * format * rename to DeviceBatchedGemmSoftmaxGemm * add gemm_softmax_gemm instances and tests * comment Co-authored-by:
ltqin <letao.qin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 11 Aug, 2022 1 commit
-
-
Po Yen Chen authored
* Add always_false<> util to delay symbol resolution * Use always_false<> to prevent trying instantiate unwanted method * Add new specializations of AddAddFastGelu::operator() method * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32 * Use floating point literal to simplify code * Remove unnecessary capture in lambda expressions * Extract fast GeLU calculation as standalone method * Mark methods as 'constexpr' * Add constraint for HostTensorDescriptor templated ctors * Simplify HostTensorDescriptor ctor calls * Add C++23 std::size_t literal suffix * Use _uz suffix to shorten example code * Remove unnecessary conversion to std::array<> * Re-order include directives * Remove C-style casting by literal suffix * Remove unnecessary statements in main() * Remove unused type parameter of always_false<> * Remove unused include directive * Exit main() by returning meaningful value * Use 'if constexpr' to switch example flow * Use std::is_same_v<> to shorten example code * Add 'inline' specifier to literal functions * Unify output methods in example * Move common codes into .inc file * Add type check in type_convert<>() * Add type_convert<float>() before computation * Merge AddAddFastGelu method specializations * Remove always_false<> * Add constraint to AddAddFastGelu::operator() parameter types
-
- 02 Aug, 2022 1 commit
-
-
Adam Osewski authored
* Add int8 specialization for elementwise Add and Subtract. * CGEMM examples bf16, fp32, int8 * Add convert reference output to CDataType. * Skip BF16 data type during testing. * Lower K value to get rid of accumulation error. * Fix merge artifact. * Fix changed function name: GetElementSpaceSize() * Fix merge artifact. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 29 Jul, 2022 1 commit
-
-
Chao Liu authored
* convnd_fwd fp16 example * update example * update example * update instance * updating refernce conv * update reference conv * update conv fwd profiler * update conv 1d and 3d instance * update include path * clean * update profiler for conv bwd data and weight * update conv bwd weight * clean * update conv example * update profiler for conv bwd weight * update ckprofiler for conv bwd data * fix reference conv bwd data bug; update conv bwd data test * update examples * fix initialization issue * update test for conv fwd * clean * clean * remove test case too sensitive to error threshhold * fix test * clean * fix build * adding conv multiple d * adding conv multiple D * add matrix padder * add gemm padding to convnd * adding group conv * update gemm multi-d * refactor * refactor * refactor * clean * clean * refactor * refactor * reorg * add ds * add bias * clean * add G * adding group * adding group * adding group * update Tensor * clean * update example * update DeviceGemmMultipleD_Xdl_CShuffle * update conv bwd-data and bwd-weight * upate contraction example * update gemm and batch gemm with e permute * fix example build * instance for grouped conv1d * update example * adding group conv instance * update gemm bilinear instance * update gemm+add+add+fastgelu instance * update profiler * update profiler * update test * update test and client example * clean * add grouped conv into profiler * update profiler * clean * add test grouped conv, update all conv test to gtest * update test
-
- 07 Jul, 2022 1 commit
-
-
Chao Liu authored
* adding contraction * add contraction example * update examle * update example * format * update readme * clean header * clean header * contraction with multiple D * rename * fix naming issue; add instances for contraction+bilinear * change assumed virtual layout of contraction; add client example * update example * update * contraction+scale * use type_convert * rename
-
- 02 Jul, 2022 1 commit
-
-
Chao Liu authored
* refactor * update example * update example * gemm bilinear * clean * update
-
- 01 Jul, 2022 1 commit
-
-
Anthony Chang authored
* dump lds content in appropriate precision type * add squared add reduction op; allows sq sum * initial stub from regular gemm impl * layernorm example code & host verification * initial layernorm implementation * tidy up * make C0 precision type consistent with C * clang-tidy and additional comments * tighten up example code * account for extra flops/bytes from normalization * clang-format * c0 bias/beta/gamma now have its own precision type * AccElemOp for gemm outputs prior to feeding to layernorm * update workgroup mapping * rename kernel template param to reflect its dual use * use LDS mem pool for reduction workspace * change cshuffle precision type to f16; clean up * clang-format * correct naming * explicit cast * fully implemented gemm + bias + activation + add + norm * activation in correct order * reflect reduction API's recent change * amend * clean up; add comment * keep up with recent changes in reduction API * format * resolve merge conflicts Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 30 Jun, 2022 1 commit
-
-
Anthony Chang authored
* use 'sweep once' softmax kernel where applicable * threadwise copy's dst buffer can specify invalid element value * add int8 in/out float compute softmax support give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error * format * softmax inherits DeviceNormalization * softmax profiler stub * tighten up reference softmax interface * example prints tensor dimension * add fp32 to softmax profiler * rename header * hook with ckProfiler * format * resolve merge conflict * resolve merge conflicts * update normalization profiler help string * resolve conflict * typo * remove residual * softmax profiler: address feedback * test for mixed precision input/output * fully qualify ck::math::isnan * add comment for device normalization interface * revise wording * constness for alpha/beta scaler pointer
-
- 25 Jun, 2022 2 commits
-
-
Chao Liu authored
-
Chao Liu authored
* ad gelu and fast_gelu * added GeLU and fast GeLU * clean up * add gemm+fastgelu example * add gemm+gelu instances * update profiler * clean up * clean up * adding gemm+bias+activation * clean * adding bias * clean * adding gemm multiple d * debugging * add gemm bias add fastgelu * rename, clean * refactoring; add readme * refactor * refactor * refactor * refactor * refactor * refactor * fix * fix * update example * update example * rename * update example * add ckProfiler * clean * clean * clean * clean * add client app example * update readme * delete obselete files * remove old client app * delete old file * cleaning * clean * remove half * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path for all examples * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * revert client app example * clean build * fix build * temporary disable client test on Jenkins * clean * clean * clean
-
- 21 Jun, 2022 1 commit
-
-
Anthony Chang authored
* initial stub for standalone softmax * start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m * host softmax validates * compiles; to implement beta scaling * use NaN trick to efficiently ignore OOB values during sum of exponentials * freeload device_reduce's utility functions * clean up interface * adding prior value (beta scaling) * remove restriction related to perf considerations * apply clang-format * clean; disable diagnostics * resolve conflicts * add exp wrapper * honor HostTensorDesc interface; allow implicit cast from different vector<T> type * test softmax for fp16/fp32 * update readme * amend commit NaN trick * remove redundant param added during development * format * replace ScalarDataType with AccDataType * separate out test programs by precision type * move softmax sample code to its own folder * format * keep up with recent changes in reduction API * remove extra header
-
- 19 Jun, 2022 1 commit
-
-
Chao Liu authored
* ad gelu and fast_gelu * added GeLU and fast GeLU * clean up * add gemm+fastgelu example * add gemm+gelu instances * update profiler * clean up * clean up * adding gemm+bias+activation * clean * adding bias * clean * adding gemm multiple d * debugging * add gemm bias add fastgelu * rename, clean * refactoring; add readme * refactor * refactor * refactor * refactor * refactor * refactor * fix * fix * update example * update example * rename * update example * add ckProfiler * clean * clean * clean * clean * add comment * use type_convert * clean * clean element wise op
-
- 17 Jun, 2022 1 commit
-
-
Qianfeng authored
* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces * Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers * Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers * Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation * Use struct-scope operator template instantiation for binary and unary element-wise operations * Change a few more elementwise operations to use template for operator() * Tiny correction in Normalize operator * Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons * Correction in some examples with regard to using ReduceAccDataType * Use static_assert for UnaryDivide * Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly * Tiny fix with regard to SetWorkSpacePointer()
-
- 02 Jun, 2022 1 commit
-
-
Qianfeng authored
* Use the unified naming for math functions on host and HIP kernel * Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming * Renaming GetReductionZeroVal() to GetIdentityValue() * Tiny renaming in profile_reduce_impl.hpp * More renaming in profile_reduce_impl.hpp * Replace zeroVal by identiyVal * Remove ck_ prefix in the naming of ck::math provided functions
-
- 26 May, 2022 1 commit
-
-
ltqin authored
* add intrin_mfma_f64_16x16x4f64 * add example * gemm reference add double data type * chang init data * fix M N PerXdlops * fix ifdef * add comparsion config * add conv fwd example * format log out * change rc matrix egister layout * reorganize example * reorganize example 2 * format,because merge develop * fix call impl adding acc data type * lost ; * add compiler warning * change example tunning parameters * add test for fp64 * add instance * add test/gemm/gemm_fp64.cpp * fix get name issue * remove some tunning parameter * fix conflict * format * use integer value for GEMM test * add acc data type * remove typeid because fp16 * fix streamconfig etc bug from merging develop * format * remove test_gemm_xdl_fp64 * add AccDataType * AccDataType problem Co-authored-by:
qinletao <letaoqin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 25 May, 2022 1 commit
-
-
Chao Liu authored
* minor fix * clean
-