- 06 Aug, 2024 3 commits
-
-
Juan Manuel Martinez Caamaño authored
-
jakpiase authored
* fix for beta!=0 in reduce * add reviewers suggestions
-
Bartłomiej Kocot authored
* Support 64 bit indexing * Add new grouped conv fwd kernel for large tensors * Add instances large tensor * Fixes for transform conv to gemm * Fixes * fixes * Remove not needed instances * examples fixes * Remove not need ds arrays * Fix tests * Add 2GB check in gridwise dl * Fixes
-
- 31 Jul, 2024 1 commit
-
-
arai713 authored
* added isSupportedArgument check into codegen device op * adding function call * remove commented code
-
- 30 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 25 Jul, 2024 1 commit
-
-
zjing14 authored
* add rotating_buff for gemm_multi_d * format * Update flush_cache.hpp * Update gtest.cmake --------- Co-authored-by:
Jing Zhang <jizhan@fb.com> Co-authored-by:
Haocong WANG <haocwang@amd.com>
-
- 24 Jul, 2024 2 commits
-
-
Andriy Roshchenko authored
Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412) * Add CMakePresets configurations. * Add binary elementwise ConvScaleAdd and an example. * Numerical verification of results. Observed significant irregularities in F8 to F32 type conversions: ```log ConvScaleAdd: float=145.000000 f8_t=160.000000 e=144.000000 ConvScaleAdd: float=97.000000 f8_t=96.000000 e=104.000000 ConvScaleAdd: float=65.000000 f8_t=64.000000 e=72.000000 ``` * Implemented ConvScaleAdd + Example. * Add ConvScale+Bias Instances * Add Client Example for ConvScale+Bias * Fix number of bytes in an example.. * Cleanup.
-
Bartłomiej Kocot authored
* Add support for half_t and bfloat to reduction operations * Fix bhalf convert * Next fix bf16
-
- 22 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
-
- 19 Jul, 2024 3 commits
-
-
Haocong WANG authored
* add ab_scale init support * enabled interwave * add scale type; update isSupport * adjust example * clean * enable f8 pure gemm rcr ckprofiler * Add gemm_multiply_multiply instances * clang format * Optimize for ScaleBlockMNK=128 * enable abscale f8 gemm ck profiler * Add pure f8 gemm test suite * Reverting to the state of project at f60fd77 * update copyright * clang format * update copyright --------- Co-authored-by:root <jizhan@amd.com>
-
ltqin authored
* init for reduce_threadwise multi_d * add reduce_threadwise_multi_d * add reduce_multi_d * clean * start add an other splitk device op * add reduce template parameter to SplitKBatchOffset * add reduce c matrix * clean up code * change example data type to bf16 * add bf16Ai8B example * remove reduce template parameter * add splitk atomic status to v4 * example add multi d parameters * device op add multi-d parameters * add multi-d to reduce * fix kbach=1 bug * change B layout to col in bf16Ai8B example * remove float adding struct * change multi-d interface * change file and class name * remove multi-d of bf16Ai8B example * change IsReduce function to IsReduceAdd * change example layout to RRR from RCR * according layout to set ds stride * reset parameter layout * add gemm universal reduce instance * add reduce factory * add profile_gemm_universal_reduce * add reduce to profiler * fix reduce instance * fix profiler reduce compiling bug * format * format library instance code * add mem instance for reduce library * fix call instance names * add workspace for reduce in ckProfiler * format * add mnpading to reduce library instance * add fp16 instance to reduce of profiler * change copyright time * restore profiler cmake file * add reduce text to instances * add DsLayout and DsDataType to instances template parameter * fixed gemm_reduce_multi_d * add an example without multi_d * Update common.hpp * Update gtest.cmake * Update gemm_xdl_splitk_reduce_bf16.cpp * clean * Update gtest.cmake * format * fixe api * format * default parameter change to RRR * add vector_len for multi_d * format * Update gtest.cmake * fix bf16A iBB elementwiseop * add ReduceDataType * move ReduceDataType to end position * format * remove googletest git method address * fix copyright time * update init data --------- Co-authored-by:
root <jizhan@amd.com> Co-authored-by:
letaoqin <letaoqin@amd.com> Co-authored-by:
Jing Zhang <jizhan@meta.com> Co-authored-by:
zjing14 <zhangjing14@gmail.com>
-
Bartłomiej Kocot authored
* Refactor transform conv to gemm fwd * fixes codegen * wmma fixes * fix wmma * Fix copyright
-
- 17 Jul, 2024 1 commit
-
-
Qianfeng authored
-
- 16 Jul, 2024 1 commit
-
-
Andriy Roshchenko authored
Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386) * Add CMakePresets configurations. * Add ConvScale+ReLU Functor and an Example * Account for ReLU FLOPs. * Add instances of 3D convolutions with ConvscaleRelu operation. * Implement Client Example * Cleanup
-
- 12 Jul, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support access per groups and filter3x3 in grouped conv fwd * Fixes for large cases * Fixes for large tensors
-
- 06 Jul, 2024 1 commit
-
-
Harisankar Sadasivan authored
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1
-
- 04 Jul, 2024 2 commits
- 27 Jun, 2024 2 commits
-
-
jakpiase authored
* first version of smfmac test * add reviewer comments * add reviewer suggestions
-
Illia Silin authored
-
- 25 Jun, 2024 1 commit
-
-
arai713 authored
* Format * Format * Format * Remove const * Use the right template * Format * Format * add row/col instances * Add missing file * fixed * fixing block to etile error * Format * Updates * Format * fixed rrr layout * generating a sample JSON file: currently contains includes, prologue/epilogue and instances * version where the json is passed into the instances to generate a key * updated run function to just launch kernel * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer * adding in testing files * cleaned up comments, still need to work on including header files * removed unneeded files * removed/commented out JSON implementation * added fusion(prologue/epilogue) into instance generation * working on instance selection * added instance selection, need to fix instance validation * removed block2etile map validity check for testing purposes * test running: failing due to incorrect files/input * all grid descs/ptrs completed, but device file not found * Update test and embed modules * Restore older version * added convolution operation, written test, debugging generated code for compilation * attempting to include CK in host directory: _Float16 error * CK header file issues * slight fix * don't crash when hip can't report total memory * dump generated code to a file * changing sizes * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed * some fixes to call the device code * separating test files for conv and gemm * completed arg ptr, now have linking errors * clang format fix * resolved linker issues in conv test * remove dependency on libutility from ck * resolved num dim error * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration * undo the commenting of device function * hand created kernel code to find rtc issues * dump the full src to file * resolved redeclaration errors, cleaned up errors for Amber's kernel code * debugging purposes: redeclaration error * config files * resolved errors for NumTensor and redeclaration, formatted version.h * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type * WIP: close to getting kernel compiled * WIP: fixing rtc errors * fixed sequence errors, formatting, still one error with run fcn * yay: kernel compiles and runs * updated templated/generated version to run and compile * minor fixes * working generated example, resolved memory access error due to padding * adding in reference kernel, validation failing against reference * debugging: printing kernel argsz * reduced error in results * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues * working validation (using reference convolution) with prologue function for both hard-coded and generated version * WIP: create an alt version that creates Argument on the device * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments * wip: making necessary methods device code * added grid descs, working on grid pointers, errors with stl numerics * wip: updating kernel args - issue, replacing some std functions * replaced std::accumulate call with temp hardcoded version * wip: args causing memory issue * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes * adding object file dump * temporary hardcoding of grid size, can remove device op inst + arg ptr * minor fix for grid size * added modified example where arg ptr is created on the device for generated version as well * removed device op instance and arg ptr from modified examples * moving device op file for testing purposes and to properly build CK * commenting out print-outs * adjust compiler args to produce a valid ELF file * temporary removal of validation * reverting compiler args back for working example * retrieve necessary arguments from generated template parameters in correct format * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example * temporary change to generate ELF format binary object file * removed unecessary code, added comments * formatting fix * cleaned up code, added new tests, restructured library: move helper into CK * refactored launch parameter calculation to be more concise * renamed files and variables for more clarity/uniformity * more code cleaning, removed debug statements * moved majority of my files into codegen directory, running properly * updated Embed.cmake(string_view) in codegen directory * updated host directory to match Embed.cmake as well * added old tests in * updated instance generation methods to be more concise * removed layout from launch parameter calculation * working test * fixed issue with verification, all instances working * updated verification in other tests * removed duplicate matrix padder file, removed code dumps * removed old hard-coded tests * removed old host directory, all files in codegen directory now * fixed copyright in files * commenting out validation * renamed files * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code * updated headers * removing duplicate file for fwd conv to gemm, merging with original file * fix building codegen with clang++ directly * resolving build error from conv_fwd_to_gemm * fix for previous error * renaming tests * created common test file * cleaned up code, added comments * renamed device op * fixed typos in comments * removed extra space * code cleanup: resolving Amber's comments * removed wrapper struct for matrix padder, fixed template * cleaned up if statements for better readability --------- Co-authored-by:
Paul <pfultz2@yahoo.com> Co-authored-by:
Jing Zhang <jizha@amd.com> Co-authored-by:
M. Amber Hassaan <amber_474@yahoo.com> Co-authored-by:
illsilin <Illia.Silin@amd.com> Co-authored-by:
Illia Silin <98187287+illsilin@users.noreply.github.com>
-
- 21 Jun, 2024 2 commits
-
-
carlushuang authored
* WA for rocm-6.2+ s constrait for buffer resource * add missing memory clobber
-
Bartłomiej Kocot authored
* Cmake add -Wno-nvcc-compt * Remove template without initialization list * dpp remove template without init list * Fixes
-
- 20 Jun, 2024 1 commit
-
-
ThruptiRajLakshmanaGowda authored
* Initial Push * First Push * Fixed Clang format * Resolve merge conflict * Addressed review comments * Addressed review comments * Addressed review comments
-
- 18 Jun, 2024 3 commits
-
-
Bartłomiej Kocot authored
-
jakpiase authored
* switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
Bartłomiej Kocot authored
* Fix continous dim selection in contraction * Fixes
-
- 17 Jun, 2024 1 commit
-
-
zjing14 authored
-
- 14 Jun, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Support large tensors in grouped conv fwd * Multi ABD fixes * Fix calculate element space size
-
- 10 Jun, 2024 1 commit
-
-
Rostyslav Geyyer authored
* Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by:illsilin <Illia.Silin@amd.com>
-
- 05 Jun, 2024 2 commits
-
-
Bartłomiej Kocot authored
* Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check
-
Rostyslav Geyyer authored
* Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by:
Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by:
Bartłomiej Kocot <barkocot@amd.com>
-
- 01 Jun, 2024 1 commit
-
-
zjing14 authored
* add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example * post-merge fix * fix * reduce init range
-
- 28 May, 2024 1 commit
-
-
zjing14 authored
* add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example
-
- 22 May, 2024 1 commit
-
-
Bartłomiej Kocot authored
* Optimize grouped conv bwd weight for small M and N * Fixes
-
- 20 May, 2024 1 commit
-
-
Illia Silin authored
-
- 17 May, 2024 1 commit
-
-
Illia Silin authored
-
- 15 May, 2024 2 commits
-
-
Illia Silin authored
-
jakpiase authored
* add unit tests for grouped gemm two stage * add reviewers suggestions --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 10 May, 2024 1 commit
-
-
Illia Silin authored
* code clean-up * remove the profiling output samples
-