1. 25 Jun, 2024 1 commit
    • arai713's avatar
      CK Instance Gen (#1145) · 3e9711f0
      arai713 authored
      
      
      * Format
      
      * Format
      
      * Format
      
      * Remove const
      
      * Use the right template
      
      * Format
      
      * Format
      
      * add row/col instances
      
      * Add missing file
      
      * fixed
      
      * fixing block to etile error
      
      * Format
      
      * Updates
      
      * Format
      
      * fixed rrr layout
      
      * generating a sample JSON file: currently contains includes, prologue/epilogue and instances
      
      * version where the json is passed into the instances to generate a key
      
      * updated run function to just launch kernel
      
      * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer
      
      * adding in testing files
      
      * cleaned up comments, still need to work on including header files
      
      * removed unneeded files
      
      * removed/commented out JSON implementation
      
      * added fusion(prologue/epilogue) into instance generation
      
      * working on instance selection
      
      * added instance selection, need to fix instance validation
      
      * removed block2etile map validity check for testing purposes
      
      * test running: failing due to incorrect files/input
      
      * all grid descs/ptrs completed, but device file not found
      
      * Update test and embed modules
      
      * Restore older version
      
      * added convolution operation, written test, debugging generated code for compilation
      
      * attempting to include CK in host directory: _Float16 error
      
      * CK header file issues
      
      * slight fix
      
      * don't crash when hip can't report total memory
      
      * dump generated code to a file
      
      * changing sizes
      
      * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed
      
      * some fixes to call the device code
      
      * separating test files for conv and gemm
      
      * completed arg ptr, now have linking errors
      
      * clang format fix
      
      * resolved linker issues in conv test
      
      * remove dependency on libutility from ck
      
      * resolved num dim error
      
      * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration
      
      * undo the commenting of device function
      
      * hand created kernel code to find rtc issues
      
      * dump the full src to file
      
      * resolved redeclaration errors, cleaned up errors for Amber's kernel code
      
      * debugging purposes: redeclaration error
      
      * config files
      
      * resolved errors for NumTensor and redeclaration, formatted version.h
      
      * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type
      
      * WIP: close to getting kernel compiled
      
      * WIP: fixing rtc errors
      
      * fixed sequence errors, formatting, still one error with run fcn
      
      * yay: kernel compiles and runs
      
      * updated templated/generated version to run and compile
      
      * minor fixes
      
      * working generated example, resolved memory access error due to padding
      
      * adding in reference kernel, validation failing against reference
      
      * debugging: printing kernel argsz
      
      * reduced error in results
      
      * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues
      
      * working validation (using reference convolution) with prologue function for both hard-coded and generated version
      
      * WIP: create an alt version that creates Argument on the device
      
      * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments
      
      * wip: making necessary methods device code
      
      * added grid descs, working on grid pointers, errors with stl numerics
      
      * wip: updating kernel args - issue, replacing some std functions
      
      * replaced std::accumulate call with temp hardcoded version
      
      * wip: args causing memory issue
      
      * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes
      
      * adding object file dump
      
      * temporary hardcoding of grid size, can remove device op inst + arg ptr
      
      * minor fix for grid size
      
      * added modified example where arg ptr is created on the device for generated version as well
      
      * removed device op instance and arg ptr from modified examples
      
      * moving device op file for testing purposes and to properly build CK
      
      * commenting out print-outs
      
      * adjust compiler args to produce a valid ELF file
      
      * temporary removal of validation
      
      * reverting compiler args back for working example
      
      * retrieve necessary arguments from generated template parameters in correct format
      
      * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly
      
      * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example
      
      * temporary change to generate ELF format binary object file
      
      * removed unecessary code, added comments
      
      * formatting fix
      
      * cleaned up code, added new tests, restructured library: move helper into CK
      
      * refactored launch parameter calculation to be more concise
      
      * renamed files and variables for more clarity/uniformity
      
      * more code cleaning, removed debug statements
      
      * moved majority of my files into codegen directory, running properly
      
      * updated Embed.cmake(string_view) in codegen directory
      
      * updated host directory to match Embed.cmake as well
      
      * added old tests in
      
      * updated instance generation methods to be more concise
      
      * removed layout from launch parameter calculation
      
      * working test
      
      * fixed issue with verification, all instances working
      
      * updated verification in other tests
      
      * removed duplicate matrix padder file, removed code dumps
      
      * removed old hard-coded tests
      
      * removed old host directory, all files in codegen directory now
      
      * fixed copyright in files
      
      * commenting out validation
      
      * renamed files
      
      * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code
      
      * updated headers
      
      * removing duplicate file for fwd conv to gemm, merging with original file
      
      * fix building codegen with clang++ directly
      
      * resolving build error from conv_fwd_to_gemm
      
      * fix for previous error
      
      * renaming tests
      
      * created common test file
      
      * cleaned up code, added comments
      
      * renamed device op
      
      * fixed typos in comments
      
      * removed extra space
      
      * code cleanup: resolving Amber's comments
      
      * removed wrapper struct for matrix padder, fixed template
      
      * cleaned up if statements for better readability
      
      ---------
      Co-authored-by: default avatarPaul <pfultz2@yahoo.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarM. Amber Hassaan <amber_474@yahoo.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      3e9711f0
  2. 21 Jun, 2024 2 commits
  3. 20 Jun, 2024 1 commit
  4. 18 Jun, 2024 3 commits
  5. 17 Jun, 2024 1 commit
  6. 14 Jun, 2024 1 commit
  7. 10 Jun, 2024 1 commit
  8. 05 Jun, 2024 2 commits
    • Bartłomiej Kocot's avatar
      Integrate universal gemm with conv forward (#1320) · ac58cc5d
      Bartłomiej Kocot authored
      * Integrate universal gemm with conv fwd
      
      * Fix conv fwd wmma test
      
      * Fix instances
      
      * Remove direct load check
      ac58cc5d
    • Rostyslav Geyyer's avatar
      Add a scale op, related instances and examples (#1242) · cb0645be
      Rostyslav Geyyer authored
      
      
      * Add a scale op
      
      * Update the element op
      
      * Add instances
      
      * Add an example
      
      * Add a client example
      
      * Add a flag check
      
      * Revert flag check addition
      
      * Fix flag check
      
      * Update d strides in example
      
      * Update d strides in client example
      
      * Apply suggestions from code review
      
      Update copyright header
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Move the example
      
      * Move the client example
      
      * Update element op
      
      * Update example with the new element op
      
      * Add scalar layout
      
      * Update example
      
      * Update kernel for scalar Ds
      
      * Revert kernel changes
      
      * Update element op
      
      * Update example to use scales' pointers
      
      * Format
      
      * Update instances
      
      * Update client example
      
      * Move element op to unary elements
      
      * Update element op to work with values instead of pointers
      
      * Update instances to take element op as an argument
      
      * Update examples to use random scale values
      
      ---------
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      cb0645be
  9. 01 Jun, 2024 1 commit
    • zjing14's avatar
      Post-merge fix of PR 1300 (#1313) · 6fb1f4e0
      zjing14 authored
      * add f8 gemm with multiD for both row/col wise
      
      * change compute_type to fp8
      
      * changed tuning parameters in the example
      
      * add rcr example
      
      * post-merge fix
      
      * fix
      
      * reduce init range
      6fb1f4e0
  10. 28 May, 2024 1 commit
  11. 22 May, 2024 1 commit
  12. 20 May, 2024 1 commit
  13. 17 May, 2024 1 commit
  14. 15 May, 2024 2 commits
  15. 10 May, 2024 2 commits
  16. 09 May, 2024 2 commits
  17. 08 May, 2024 2 commits
  18. 07 May, 2024 1 commit
  19. 02 May, 2024 1 commit
  20. 29 Apr, 2024 1 commit
  21. 26 Apr, 2024 3 commits
    • Haocong WANG's avatar
      [GEMM] UniversalGemm update (#1262) · 764164b4
      Haocong WANG authored
      
      
      * Add bf16 instances
      
      * Add bf16 gemm universal example
      
      * tempsave
      
      * Add guard to navi compilation
      
      * workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
      
      * fix formatting condition statement issue
      
      * solve conflict
      
      ---------
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      764164b4
    • Rostyslav Geyyer's avatar
      Add element op (#1259) · f044ff71
      Rostyslav Geyyer authored
      f044ff71
    • zjing14's avatar
      bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db
      zjing14 authored
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * refactor
      
      * clean
      
      * add examples
      0d0150db
  22. 25 Apr, 2024 2 commits
    • Adam Osewski's avatar
      Grouped GEMM Multiple D tile loop. (#1247) · b4032629
      Adam Osewski authored
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Validate on host only NK dims if M is not known.
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      b4032629
    • ltqin's avatar
      Universal gemm flush cache (#1251) · f448d179
      ltqin authored
      
      
      * add flush cache to device op
      
      * add flush cache parameter to ckProfiler
      
      * change calculate size a and b method
      
      * chang evaluation time method foro AVERAGE to MEDIAN
      
      * format code
      
      * adjust some code
      
      * fix core dumped
      
      * remove loop call flush icache in kernel
      
      * remove loop(outer) call flush icache
      
      ---------
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      f448d179
  23. 23 Apr, 2024 1 commit
  24. 19 Apr, 2024 1 commit
    • Bartłomiej Kocot's avatar
      Refactor elementwise kernels (#1222) · ad1597c4
      Bartłomiej Kocot authored
      * Refactor elementwise kernels
      
      * Instances fixes
      
      * Fix cmake
      
      * Fix max pool bwd test
      
      * Update two stage gemm split k
      
      * Restore elementwise scale for hiptensor backward compatiblity
      
      * Fix Acc data type check in conv fwd multiple abd
      
      * Disable conv fp64 fwd example
      
      * Update grouped conv weight multi d
      ad1597c4
  25. 18 Apr, 2024 1 commit
  26. 16 Apr, 2024 1 commit
    • zjing14's avatar
      Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf
      zjing14 authored
      
      
      * added an example grouped_gemm_multi_abd
      
      * fixed ci
      
      * add setElementwiseOp
      
      * changed API
      
      * clean code: add multiA into example
      
      * fixed v7r2 copy
      
      * add transpose
      
      * clean
      
      * fixed vector_load check
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * add reduce
      
      * testing
      
      * add example_b16_i8
      
      * refactor example
      
      * clean
      
      * add mpading
      
      * disable reduce for kbatch = 1
      
      * seperate reduce device op
      
      * add reduce op
      
      * add guard for workspace_size
      
      * add instances
      
      * format
      
      * fixed
      
      * add client example
      
      * add a colmajor
      
      * add instances
      
      * Update cmake-ck-dev.sh
      
      * Update profile_gemm_splitk.cpp
      
      * Update gridwise_gemm_xdlops_v2r4r2.hpp
      
      * format
      
      * Update profile_gemm_splitk.cpp
      
      * fixed
      
      * fixed
      
      * adjust test
      
      * adjust precision loss
      
      * adjust test
      
      * fixed
      
      * add bf16_i8 scale bias
      
      * fixed scale
      
      * fixed scale elementwise_op
      
      * revert contraction deviceop changes
      
      * fixed
      
      * Add AddFastGelu
      
      * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"
      
      This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
      changes made to 943199a99191661c5597c51ca8371a90bf57837e.
      
      * add Scales into elementwise
      
      * add gemm_multi_abd client example
      
      * add client examples
      
      * add rcr and crr
      
      * add grouped gemm client example
      
      * add grouped gemm client example
      
      * add instance for rcr crr
      
      * format
      
      * fixed
      
      * fixed cmake
      
      * fixed
      
      * fixed client_example
      
      * format
      
      * fixed contraction isSupport
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update device_reduce_threadwise.hpp
      
      * clean
      
      * Fixes
      
      * Fix example
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      12865fbf
  27. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  28. 11 Apr, 2024 1 commit
  29. 04 Apr, 2024 1 commit