1. 29 Oct, 2024 1 commit
  2. 18 Oct, 2024 1 commit
  3. 20 Sep, 2024 1 commit
  4. 12 Sep, 2024 1 commit
  5. 11 Sep, 2024 1 commit
  6. 14 Aug, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] gemm_universal related optimization (#1453) · 3049b546
      Haocong WANG authored
      
      
      * replace buffer_atomic with global_atomic
      
      * fixed global_atomic_add
      
      * added bf16 atomic_add
      
      * format
      
      * clang-format-12
      
      * clean
      
      * clean
      
      * add guards
      
      * Update gtest.cmake
      
      * enabled splitk_gemm_multi_d
      
      * format
      
      * add ckProfiler
      
      * format
      
      * fixed naming
      
      * format
      
      * clean
      
      * clean
      
      * add guards
      
      * fix clang format
      
      * format
      
      * add kbatch printout
      
      * clean
      
      * Add rocm6.2 related gemm optimization
      
      * Limit bf16 atomic usage
      
      * remove redundant RCR gemm_universal instance
      
      * Add RRR fp8 gemm universal instance
      
      * Bug fix
      
      * Add GPU_TARGET guard to FP8/BF8 target
      
      * bug fix
      
      * update cmake
      
      * remove all fp8/bf8 example if arch not support
      
      * Enable fp8 RRR support in ckProfiler
      
      * limit greedy-reverse flag to gemm_universal in ckProfiler
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@fb.com>
      Co-authored-by: default avatarJing Zhang <jizhan@meta.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      3049b546
  7. 07 Aug, 2024 1 commit
  8. 06 Aug, 2024 2 commits
  9. 24 Jul, 2024 1 commit
  10. 17 Jul, 2024 1 commit
  11. 04 Jul, 2024 1 commit
  12. 27 Jun, 2024 2 commits
  13. 25 Jun, 2024 1 commit
    • arai713's avatar
      CK Instance Gen (#1145) · 3e9711f0
      arai713 authored
      
      
      * Format
      
      * Format
      
      * Format
      
      * Remove const
      
      * Use the right template
      
      * Format
      
      * Format
      
      * add row/col instances
      
      * Add missing file
      
      * fixed
      
      * fixing block to etile error
      
      * Format
      
      * Updates
      
      * Format
      
      * fixed rrr layout
      
      * generating a sample JSON file: currently contains includes, prologue/epilogue and instances
      
      * version where the json is passed into the instances to generate a key
      
      * updated run function to just launch kernel
      
      * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer
      
      * adding in testing files
      
      * cleaned up comments, still need to work on including header files
      
      * removed unneeded files
      
      * removed/commented out JSON implementation
      
      * added fusion(prologue/epilogue) into instance generation
      
      * working on instance selection
      
      * added instance selection, need to fix instance validation
      
      * removed block2etile map validity check for testing purposes
      
      * test running: failing due to incorrect files/input
      
      * all grid descs/ptrs completed, but device file not found
      
      * Update test and embed modules
      
      * Restore older version
      
      * added convolution operation, written test, debugging generated code for compilation
      
      * attempting to include CK in host directory: _Float16 error
      
      * CK header file issues
      
      * slight fix
      
      * don't crash when hip can't report total memory
      
      * dump generated code to a file
      
      * changing sizes
      
      * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed
      
      * some fixes to call the device code
      
      * separating test files for conv and gemm
      
      * completed arg ptr, now have linking errors
      
      * clang format fix
      
      * resolved linker issues in conv test
      
      * remove dependency on libutility from ck
      
      * resolved num dim error
      
      * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration
      
      * undo the commenting of device function
      
      * hand created kernel code to find rtc issues
      
      * dump the full src to file
      
      * resolved redeclaration errors, cleaned up errors for Amber's kernel code
      
      * debugging purposes: redeclaration error
      
      * config files
      
      * resolved errors for NumTensor and redeclaration, formatted version.h
      
      * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type
      
      * WIP: close to getting kernel compiled
      
      * WIP: fixing rtc errors
      
      * fixed sequence errors, formatting, still one error with run fcn
      
      * yay: kernel compiles and runs
      
      * updated templated/generated version to run and compile
      
      * minor fixes
      
      * working generated example, resolved memory access error due to padding
      
      * adding in reference kernel, validation failing against reference
      
      * debugging: printing kernel argsz
      
      * reduced error in results
      
      * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues
      
      * working validation (using reference convolution) with prologue function for both hard-coded and generated version
      
      * WIP: create an alt version that creates Argument on the device
      
      * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments
      
      * wip: making necessary methods device code
      
      * added grid descs, working on grid pointers, errors with stl numerics
      
      * wip: updating kernel args - issue, replacing some std functions
      
      * replaced std::accumulate call with temp hardcoded version
      
      * wip: args causing memory issue
      
      * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes
      
      * adding object file dump
      
      * temporary hardcoding of grid size, can remove device op inst + arg ptr
      
      * minor fix for grid size
      
      * added modified example where arg ptr is created on the device for generated version as well
      
      * removed device op instance and arg ptr from modified examples
      
      * moving device op file for testing purposes and to properly build CK
      
      * commenting out print-outs
      
      * adjust compiler args to produce a valid ELF file
      
      * temporary removal of validation
      
      * reverting compiler args back for working example
      
      * retrieve necessary arguments from generated template parameters in correct format
      
      * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly
      
      * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example
      
      * temporary change to generate ELF format binary object file
      
      * removed unecessary code, added comments
      
      * formatting fix
      
      * cleaned up code, added new tests, restructured library: move helper into CK
      
      * refactored launch parameter calculation to be more concise
      
      * renamed files and variables for more clarity/uniformity
      
      * more code cleaning, removed debug statements
      
      * moved majority of my files into codegen directory, running properly
      
      * updated Embed.cmake(string_view) in codegen directory
      
      * updated host directory to match Embed.cmake as well
      
      * added old tests in
      
      * updated instance generation methods to be more concise
      
      * removed layout from launch parameter calculation
      
      * working test
      
      * fixed issue with verification, all instances working
      
      * updated verification in other tests
      
      * removed duplicate matrix padder file, removed code dumps
      
      * removed old hard-coded tests
      
      * removed old host directory, all files in codegen directory now
      
      * fixed copyright in files
      
      * commenting out validation
      
      * renamed files
      
      * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code
      
      * updated headers
      
      * removing duplicate file for fwd conv to gemm, merging with original file
      
      * fix building codegen with clang++ directly
      
      * resolving build error from conv_fwd_to_gemm
      
      * fix for previous error
      
      * renaming tests
      
      * created common test file
      
      * cleaned up code, added comments
      
      * renamed device op
      
      * fixed typos in comments
      
      * removed extra space
      
      * code cleanup: resolving Amber's comments
      
      * removed wrapper struct for matrix padder, fixed template
      
      * cleaned up if statements for better readability
      
      ---------
      Co-authored-by: default avatarPaul <pfultz2@yahoo.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarM. Amber Hassaan <amber_474@yahoo.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      3e9711f0
  14. 21 Jun, 2024 1 commit
  15. 18 Jun, 2024 1 commit
  16. 17 May, 2024 1 commit
  17. 10 May, 2024 1 commit
  18. 09 May, 2024 1 commit
  19. 07 May, 2024 1 commit
  20. 26 Apr, 2024 1 commit
    • zjing14's avatar
      bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db
      zjing14 authored
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * refactor
      
      * clean
      
      * add examples
      0d0150db
  21. 25 Apr, 2024 2 commits
    • Adam Osewski's avatar
      Grouped GEMM Multiple D tile loop. (#1247) · b4032629
      Adam Osewski authored
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Validate on host only NK dims if M is not known.
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      b4032629
    • ltqin's avatar
      Universal gemm flush cache (#1251) · f448d179
      ltqin authored
      
      
      * add flush cache to device op
      
      * add flush cache parameter to ckProfiler
      
      * change calculate size a and b method
      
      * chang evaluation time method foro AVERAGE to MEDIAN
      
      * format code
      
      * adjust some code
      
      * fix core dumped
      
      * remove loop call flush icache in kernel
      
      * remove loop(outer) call flush icache
      
      ---------
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      f448d179
  22. 16 Apr, 2024 1 commit
  23. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  24. 04 Apr, 2024 1 commit
  25. 02 Apr, 2024 1 commit
  26. 10 Mar, 2024 1 commit
  27. 09 Mar, 2024 1 commit
    • zjing14's avatar
      Navi3 rel (#1176) · 1837040a
      zjing14 authored
      
      
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      
      * navi3x_multipleD+example
      
      * temp save
      
      * workable
      
      * batchedgemm[OK], groupconv[debug]
      
      * groupconv: Sanity check[OK], Performance[Bad]
      
      * navi3x_groupconv_need_optimization
      
      * create necessary files
      
      * save progress
      
      * Add Inter-Row thread transfer
      
      * save progress
      
      * save debugging progress
      
      * sanity check pass
      
      * fix a host tensor bug and clean up flash-attn code
      
      * format
      
      * cancel unnecessary change
      
      * cancel unnecessary change
      
      * cancel unnecessary change
      
      * temp save, add asm backend flag to amd_wmma
      
      * Mat-A LDS Bypass sanity pass
      
      * temp save
      
      * gemm sanity fix
      
      * Porting new blockwise gemm to flash attention
      
      * Example branch provide to compiler team
      
      * tempsave
      
      * Fix a bug
      
      * batched gemm ported
      
      * conv A-skip lds ported
      
      * Skip B-Lds real gemm
      
      * Skip B Lds Gemm + MulD
      
      * batched gemm, conv, skip b lds
      
      * format
      
      * Attn, skip b lds
      
      * Change GridwiseOp nam
      
      * fix a typo caused bug
      
      * Skip A_Lds sanity pass, Skip B_Lds scratch occured
      
      * Bug found, intra-row permute off caused
      
      * bug found
      
      * a fix
      
      * disable buffer load due to incorrect 3rd dword
      
      * update fmha config, no scratch generated
      
      * update 3rd dword
      
      * fmha config update
      
      * FMHA, add support to gfx1101/gfx1102
      
      * Merge origin dev (#2)
      
      * [Navi3x] Fix Gridwise_multiple_d operation (#649)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * fix bug
      
      * standardize docs (#655)
      
      * Separate bibtex requirement from rocm-docs-core (#656)
      
      * separate bibtex requirement from rocm-docs-core
      
      * point requirements to source rocm-docs-core repo
      
      * Add CMake Option "USE_OPT_NAVI3X" (#647)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * remove navi3x opt compile option from cmake script
      
      * Conv + quantization + tanh  (#645)
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * Add a denorm test fix (#603)
      
      * Add type_convert implementations for bf16
      
      * Add the fix for conv_fwd
      
      * Add the fix for conv_bwd_data
      
      * Add the fix for conv_bwd_weight
      
      * Format
      
      * Format
      
      * Another format
      
      * Add a macro to use workaround on MI200 only
      
      * Format
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * simplify karg in device/grid of split-k op (#644)
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * use name from tensor layout
      
      * fix 3rd dword of buffer source descriptor (#659)
      
      * add fp64 instances (#658)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
      
      This reverts commit bb5530af
      
      .
      
      * Groupnorm + swish external api (#668)
      
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      
      * add a marco to turn on/off denorm fix (off by default) (#673)
      
      * add a marco to turn off denorm fix by default
      
      * expose the marco
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * fixed quant example (#672)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Add dependabot config and pin rocm-docs-core (#663)
      
      * [gtest] suppress unsafe buffer warn (#670)
      
      ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
      
      
      
      * Add memory index guard in wmma device ops (#667)
      
      * Add more macros to turn on/off denorm fix (#678)
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      
      * Fix a typo (#676)
      
      * Add (#677)
      
      * Allow using ROCm release candidate compilers. (#679)
      
      * enable use of rocm5.5 release candidate 4
      
      * upgrade to ROCM5.5 RC5
      
      * try fix the PUB_KEY error, remove the cmake-data package
      
      * upgrade to latest cmake version
      
      * use private dockerhub repo for rocm5.5 rc5
      
      * add missing bracket
      
      * add vector load check
      
      * solve conflicts
      
      ---------
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      
      * Disable SkipLDS & Align AIT api (#3)
      
      * fix layernorm, reduction Ops (#4)
      
      * [Navi3x] Fix Gridwise_multiple_d operation (#649)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * fix bug
      
      * standardize docs (#655)
      
      * Separate bibtex requirement from rocm-docs-core (#656)
      
      * separate bibtex requirement from rocm-docs-core
      
      * point requirements to source rocm-docs-core repo
      
      * Add CMake Option "USE_OPT_NAVI3X" (#647)
      
      * Add CMake Option "USE_OPT_NAVI3X"
      
      * remove navi3x opt compile option from cmake script
      
      * Conv + quantization + tanh  (#645)
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * Add a denorm test fix (#603)
      
      * Add type_convert implementations for bf16
      
      * Add the fix for conv_fwd
      
      * Add the fix for conv_bwd_data
      
      * Add the fix for conv_bwd_weight
      
      * Format
      
      * Format
      
      * Another format
      
      * Add a macro to use workaround on MI200 only
      
      * Format
      
      ---------
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      
      * simplify karg in device/grid of split-k op (#644)
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * use name from tensor layout
      
      * fix 3rd dword of buffer source descriptor (#659)
      
      * add fp64 instances (#658)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
      
      This reverts commit bb5530af
      
      .
      
      * Groupnorm + swish external api (#668)
      
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      
      * add a marco to turn on/off denorm fix (off by default) (#673)
      
      * add a marco to turn off denorm fix by default
      
      * expose the marco
      
      ---------
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * fixed quant example (#672)
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      
      * Add dependabot config and pin rocm-docs-core (#663)
      
      * [gtest] suppress unsafe buffer warn (#670)
      
      ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
      
      
      
      * Add memory index guard in wmma device ops (#667)
      
      * Add more macros to turn on/off denorm fix (#678)
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      
      * Fix a typo (#676)
      
      * Add (#677)
      
      * Allow using ROCm release candidate compilers. (#679)
      
      * enable use of rocm5.5 release candidate 4
      
      * upgrade to ROCM5.5 RC5
      
      * try fix the PUB_KEY error, remove the cmake-data package
      
      * upgrade to latest cmake version
      
      * use private dockerhub repo for rocm5.5 rc5
      
      * add missing bracket
      
      * Disable SkipLDS & Align AIT api
      
      * Update dependabot config (#682)
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      
      * update attn api
      
      * solve type_convert bug + enable
      
      ---------
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
      
      * fix typo
      
      * Fix attention with causal mask
      
      * multiple fix, try ait compile
      
      * Add A/B not use LDS pipeline
      
      * Clang format, Add gfx1101, gfx1102 support of FMHA example
      
      * cancel change of format script
      
      * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
      2. Enable FP16 accumulator blockwise_gemm
      
      * clang-format
      
      * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
      2. change kernel timing mode to 50 warmup + 50 timed repeat
      
      * Update low level abstration of blockwise gemm wmma
      
      * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
      
      * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
      
      * (4/5) grouped conv pass
      
      * (5/5) attention pass, todo: debug lds perf bug
      
      * AIT Attention API refactor (#8)
      
      * sanity pass
      
      * sanity pass 2
      
      * confirm significant performance regression.
      
      * turn on all instances
      
      * turn off instance format
      
      * Fix bug & tunning & format
      
      * DML meta, self_attn+cross_attn
      
      * sanity pass
      
      * remove useless flag
      
      * update tile and problem size used in AIT attention
      
      * bug fix in grouped conv supporting check
      
      * deprecate inline asm wmma
      
      * Bug fix: double lds skip
      
      * clang-format
      
      * Fix errors in
      1. example, fmha
      2. gridwise pipeline
      3. deviceop, fmha, change some containers from vector to array
      
      * part2 of previous commit
      
      * clang format
      
      * API fix of gridwisegemmpipeline
      
      * separate array base and vector base attention tensor transformation
      
      * fix gemm
      
      * clang format
      
      * add gemm fp16 instances
      
      * Temp save
      
      * fpAintB kernel compile pass
      
      * Sanity pass.
      
      * Temp save
      
      * debug code enabled
      
      * Fp16AInt8B_GEMM sanity
      
      * MQA implementation
      
      * GQA-4 example
      
      * tempsave
      
      * Compile pass
      
      * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
      
      * format
      
      * Todo: fix gemm_bilinear_wmma instances compilation bug
      
      * Solve a bug when K1=16
      
      * remove unnecessary changes
      
      * Remove tensor layout limitation to LDS usage in tesnor contraction
      
      * update self-attention and cross-attention
      
      * fix a typo of name
      
      * Add arch limiter for fp8 gemm
      
      * enable fp8 gemm_xdl for all gfx9 targets
      
      * temporarily disable gemm_xdl_fp16_fp8 on MI100/200
      
      * fix the cmake logic for gemm_xdl_fp16_fp8
      
      * re-enable the gemm_xdl_fp16_fp8 on MI100/200
      
      ---------
      Co-authored-by: default avataraska-0096 <haocwang@amd.com>
      Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
      Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
      Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      1837040a
  28. 01 Mar, 2024 1 commit
  29. 27 Feb, 2024 1 commit
    • Illia Silin's avatar
      Clip fp8 to +/-240 on all targets. (#1172) · d0c7b451
      Illia Silin authored
      * clip fp8 to +/-240 on all targets
      
      * if inputs to fp8 conversion are +/-inf, they remain unaltered
      
      * increase tolerance for test_elementwise_layernorm to prevent false errors
      
      * change the input values for gemm examples to floats
      
      * reduce gemm example float input values to prevent errors
      
      * increase the tolerance for gemm examples
      d0c7b451
  30. 26 Feb, 2024 1 commit
  31. 24 Feb, 2024 1 commit
  32. 21 Feb, 2024 1 commit
  33. 17 Feb, 2024 1 commit
  34. 16 Feb, 2024 2 commits
  35. 15 Feb, 2024 1 commit
  36. 14 Feb, 2024 1 commit