1. 25 Jul, 2024 1 commit
  2. 24 Jul, 2024 2 commits
    • Andriy Roshchenko's avatar
      Adding more instances of grouped convolution 3d forward for FP8 with... · 4a8a1bef
      Andriy Roshchenko authored
      Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)
      
      * Add CMakePresets configurations.
      
      * Add binary elementwise ConvScaleAdd and an example.
      
      * Numerical verification of results.
      
      Observed significant irregularities in F8 to F32 type conversions:
      ```log
      ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
      ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
      ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
      ```
      
      * Implemented ConvScaleAdd + Example.
      
      * Add ConvScale+Bias Instances
      
      * Add Client Example for ConvScale+Bias
      
      * Fix number of bytes in an example..
      
      * Cleanup.
      4a8a1bef
    • Bartłomiej Kocot's avatar
      Add support for half_t and bfloat to reduction operations (#1395) · ffabd70a
      Bartłomiej Kocot authored
      * Add support for half_t and bfloat to reduction operations
      
      * Fix bhalf convert
      
      * Next fix bf16
      ffabd70a
  3. 22 Jul, 2024 1 commit
  4. 19 Jul, 2024 3 commits
    • Haocong WANG's avatar
      [GEMM] F8 GEMM, performance optimized. (#1384) · 8c90f25b
      Haocong WANG authored
      
      
      * add ab_scale init support
      
      * enabled interwave
      
      * add scale type; update isSupport
      
      * adjust example
      
      * clean
      
      * enable f8 pure gemm rcr ckprofiler
      
      * Add gemm_multiply_multiply instances
      
      * clang format
      
      * Optimize for ScaleBlockMNK=128
      
      * enable abscale f8 gemm ck profiler
      
      * Add pure f8 gemm test suite
      
      * Reverting to the state of project at f60fd77
      
      * update copyright
      
      * clang format
      
      * update copyright
      
      ---------
      Co-authored-by: default avatarroot <jizhan@amd.com>
      8c90f25b
    • ltqin's avatar
      Universal gemm splitk using reduce (with multi-d) (#1341) · c544eb4d
      ltqin authored
      
      
      * init for reduce_threadwise multi_d
      
      * add reduce_threadwise_multi_d
      
      * add reduce_multi_d
      
      * clean
      
      * start add an other splitk device op
      
      * add reduce template parameter to SplitKBatchOffset
      
      * add reduce c matrix
      
      * clean up code
      
      * change example data type to bf16
      
      * add bf16Ai8B example
      
      * remove reduce template parameter
      
      * add splitk atomic status to v4
      
      * example add multi d parameters
      
      * device op add multi-d parameters
      
      * add multi-d to reduce
      
      * fix kbach=1 bug
      
      * change B layout to col in  bf16Ai8B example
      
      * remove float adding struct
      
      * change  multi-d interface
      
      * change file and class name
      
      * remove multi-d of bf16Ai8B example
      
      * change IsReduce function to IsReduceAdd
      
      * change example layout to RRR from RCR
      
      * according layout to set ds stride
      
      * reset parameter layout
      
      * add gemm universal reduce instance
      
      * add reduce factory
      
      * add profile_gemm_universal_reduce
      
      * add reduce to profiler
      
      * fix reduce instance
      
      * fix profiler reduce compiling bug
      
      * format
      
      * format library instance code
      
      * add mem instance for reduce library
      
      * fix call instance names
      
      * add workspace for reduce in ckProfiler
      
      * format
      
      * add mnpading to reduce library instance
      
      * add fp16 instance to reduce of profiler
      
      * change copyright time
      
      * restore profiler cmake file
      
      * add reduce text to instances
      
      * add DsLayout and DsDataType to instances template parameter
      
      * fixed gemm_reduce_multi_d
      
      * add an example without multi_d
      
      * Update common.hpp
      
      * Update gtest.cmake
      
      * Update gemm_xdl_splitk_reduce_bf16.cpp
      
      * clean
      
      * Update gtest.cmake
      
      * format
      
      * fixe api
      
      * format
      
      * default parameter change to RRR
      
      * add vector_len for multi_d
      
      * format
      
      * Update gtest.cmake
      
      * fix bf16A iBB elementwiseop
      
      * add ReduceDataType
      
      * move ReduceDataType to end position
      
      * format
      
      * remove googletest git method  address
      
      * fix copyright time
      
      * update init data
      
      ---------
      Co-authored-by: default avatarroot <jizhan@amd.com>
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@meta.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      c544eb4d
    • Bartłomiej Kocot's avatar
      Refactor transform conv to gemm fwd (#1391) · 70a814f1
      Bartłomiej Kocot authored
      * Refactor transform conv to gemm fwd
      
      * fixes codegen
      
      * wmma fixes
      
      * fix wmma
      
      * Fix copyright
      70a814f1
  5. 17 Jul, 2024 1 commit
  6. 16 Jul, 2024 1 commit
  7. 12 Jul, 2024 1 commit
  8. 06 Jul, 2024 1 commit
    • Harisankar Sadasivan's avatar
      Universal streamk with atomics (#1360) · 75e622f0
      Harisankar Sadasivan authored
      * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 
      
      * Update README.md
      
      * fixing clang-format issues
      
      * removed conflicts in struct members between streamk and universal streamk
      
      * corrected arg parsing for streamk and universal streamk
      
      * added stream-k policies for 3 tile and 4 tile
      
      * fixed argument type issue with parsing cmd args
      
      * changes suggested in PR review are made- removing comments and correcting copyright
      
      * file permissions updated
      
      * added default value support for grid_size and streamk-policy selection set to -1
      
      * print messages for arguments
      
      * print messages for arguments
      
      * print messages for arguments1
      75e622f0
  9. 04 Jul, 2024 2 commits
  10. 27 Jun, 2024 2 commits
  11. 25 Jun, 2024 1 commit
    • arai713's avatar
      CK Instance Gen (#1145) · 3e9711f0
      arai713 authored
      
      
      * Format
      
      * Format
      
      * Format
      
      * Remove const
      
      * Use the right template
      
      * Format
      
      * Format
      
      * add row/col instances
      
      * Add missing file
      
      * fixed
      
      * fixing block to etile error
      
      * Format
      
      * Updates
      
      * Format
      
      * fixed rrr layout
      
      * generating a sample JSON file: currently contains includes, prologue/epilogue and instances
      
      * version where the json is passed into the instances to generate a key
      
      * updated run function to just launch kernel
      
      * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer
      
      * adding in testing files
      
      * cleaned up comments, still need to work on including header files
      
      * removed unneeded files
      
      * removed/commented out JSON implementation
      
      * added fusion(prologue/epilogue) into instance generation
      
      * working on instance selection
      
      * added instance selection, need to fix instance validation
      
      * removed block2etile map validity check for testing purposes
      
      * test running: failing due to incorrect files/input
      
      * all grid descs/ptrs completed, but device file not found
      
      * Update test and embed modules
      
      * Restore older version
      
      * added convolution operation, written test, debugging generated code for compilation
      
      * attempting to include CK in host directory: _Float16 error
      
      * CK header file issues
      
      * slight fix
      
      * don't crash when hip can't report total memory
      
      * dump generated code to a file
      
      * changing sizes
      
      * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed
      
      * some fixes to call the device code
      
      * separating test files for conv and gemm
      
      * completed arg ptr, now have linking errors
      
      * clang format fix
      
      * resolved linker issues in conv test
      
      * remove dependency on libutility from ck
      
      * resolved num dim error
      
      * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration
      
      * undo the commenting of device function
      
      * hand created kernel code to find rtc issues
      
      * dump the full src to file
      
      * resolved redeclaration errors, cleaned up errors for Amber's kernel code
      
      * debugging purposes: redeclaration error
      
      * config files
      
      * resolved errors for NumTensor and redeclaration, formatted version.h
      
      * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type
      
      * WIP: close to getting kernel compiled
      
      * WIP: fixing rtc errors
      
      * fixed sequence errors, formatting, still one error with run fcn
      
      * yay: kernel compiles and runs
      
      * updated templated/generated version to run and compile
      
      * minor fixes
      
      * working generated example, resolved memory access error due to padding
      
      * adding in reference kernel, validation failing against reference
      
      * debugging: printing kernel argsz
      
      * reduced error in results
      
      * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues
      
      * working validation (using reference convolution) with prologue function for both hard-coded and generated version
      
      * WIP: create an alt version that creates Argument on the device
      
      * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments
      
      * wip: making necessary methods device code
      
      * added grid descs, working on grid pointers, errors with stl numerics
      
      * wip: updating kernel args - issue, replacing some std functions
      
      * replaced std::accumulate call with temp hardcoded version
      
      * wip: args causing memory issue
      
      * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes
      
      * adding object file dump
      
      * temporary hardcoding of grid size, can remove device op inst + arg ptr
      
      * minor fix for grid size
      
      * added modified example where arg ptr is created on the device for generated version as well
      
      * removed device op instance and arg ptr from modified examples
      
      * moving device op file for testing purposes and to properly build CK
      
      * commenting out print-outs
      
      * adjust compiler args to produce a valid ELF file
      
      * temporary removal of validation
      
      * reverting compiler args back for working example
      
      * retrieve necessary arguments from generated template parameters in correct format
      
      * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly
      
      * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example
      
      * temporary change to generate ELF format binary object file
      
      * removed unecessary code, added comments
      
      * formatting fix
      
      * cleaned up code, added new tests, restructured library: move helper into CK
      
      * refactored launch parameter calculation to be more concise
      
      * renamed files and variables for more clarity/uniformity
      
      * more code cleaning, removed debug statements
      
      * moved majority of my files into codegen directory, running properly
      
      * updated Embed.cmake(string_view) in codegen directory
      
      * updated host directory to match Embed.cmake as well
      
      * added old tests in
      
      * updated instance generation methods to be more concise
      
      * removed layout from launch parameter calculation
      
      * working test
      
      * fixed issue with verification, all instances working
      
      * updated verification in other tests
      
      * removed duplicate matrix padder file, removed code dumps
      
      * removed old hard-coded tests
      
      * removed old host directory, all files in codegen directory now
      
      * fixed copyright in files
      
      * commenting out validation
      
      * renamed files
      
      * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code
      
      * updated headers
      
      * removing duplicate file for fwd conv to gemm, merging with original file
      
      * fix building codegen with clang++ directly
      
      * resolving build error from conv_fwd_to_gemm
      
      * fix for previous error
      
      * renaming tests
      
      * created common test file
      
      * cleaned up code, added comments
      
      * renamed device op
      
      * fixed typos in comments
      
      * removed extra space
      
      * code cleanup: resolving Amber's comments
      
      * removed wrapper struct for matrix padder, fixed template
      
      * cleaned up if statements for better readability
      
      ---------
      Co-authored-by: default avatarPaul <pfultz2@yahoo.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarM. Amber Hassaan <amber_474@yahoo.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      3e9711f0
  12. 21 Jun, 2024 2 commits
  13. 20 Jun, 2024 1 commit
  14. 18 Jun, 2024 3 commits
  15. 17 Jun, 2024 1 commit
  16. 14 Jun, 2024 1 commit
  17. 10 Jun, 2024 1 commit
  18. 05 Jun, 2024 2 commits
    • Bartłomiej Kocot's avatar
      Integrate universal gemm with conv forward (#1320) · ac58cc5d
      Bartłomiej Kocot authored
      * Integrate universal gemm with conv fwd
      
      * Fix conv fwd wmma test
      
      * Fix instances
      
      * Remove direct load check
      ac58cc5d
    • Rostyslav Geyyer's avatar
      Add a scale op, related instances and examples (#1242) · cb0645be
      Rostyslav Geyyer authored
      
      
      * Add a scale op
      
      * Update the element op
      
      * Add instances
      
      * Add an example
      
      * Add a client example
      
      * Add a flag check
      
      * Revert flag check addition
      
      * Fix flag check
      
      * Update d strides in example
      
      * Update d strides in client example
      
      * Apply suggestions from code review
      
      Update copyright header
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Move the example
      
      * Move the client example
      
      * Update element op
      
      * Update example with the new element op
      
      * Add scalar layout
      
      * Update example
      
      * Update kernel for scalar Ds
      
      * Revert kernel changes
      
      * Update element op
      
      * Update example to use scales' pointers
      
      * Format
      
      * Update instances
      
      * Update client example
      
      * Move element op to unary elements
      
      * Update element op to work with values instead of pointers
      
      * Update instances to take element op as an argument
      
      * Update examples to use random scale values
      
      ---------
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      cb0645be
  19. 01 Jun, 2024 1 commit
    • zjing14's avatar
      Post-merge fix of PR 1300 (#1313) · 6fb1f4e0
      zjing14 authored
      * add f8 gemm with multiD for both row/col wise
      
      * change compute_type to fp8
      
      * changed tuning parameters in the example
      
      * add rcr example
      
      * post-merge fix
      
      * fix
      
      * reduce init range
      6fb1f4e0
  20. 28 May, 2024 1 commit
  21. 22 May, 2024 1 commit
  22. 20 May, 2024 1 commit
  23. 17 May, 2024 1 commit
  24. 15 May, 2024 2 commits
  25. 10 May, 2024 2 commits
  26. 09 May, 2024 2 commits
  27. 08 May, 2024 2 commits