• Haocong WANG's avatar
    [GEMM] Gemm universal device operation (#1154) · f83e9701
    Haocong WANG authored
    
    
    * Optimize GEMM on MI200/300:
    1. Add new blockwise gemm pipeline
    2. Add irregular splitk intances
    
    * clang format + typo fix
    
    * Fix a bug
    
    * initial commit
    
    * Add more instances to irregular splitk
    
    * blkgemm pipeline v1~4 prototype
    
    * Sanity Checked. Known issue:
    1. Poor performance of splitk
    2. Register spill on blkgemmpipeline v3
    
    * Sanity and Performance fix:
    1. fix a bug related to sanity in grouped b2c mapping
    2. fix a bug related to sanity and performance in splitk offset
    
    * Sanity and API update:
    1. Remove prefetch stage
    2. Fix valid check bug
    3, Add first gemm_universal instance into ckProfiler
    
    * Add NN instances for gemm universal
    
    * 1. Add NT instances for gemm_universal
    2. Fix a bug about Kpadding in gemm_universal
    
    * Fix a bug regarding padding Odd K number
    
    * remove kernel print
    
    * Fix KPadding bug...
    
    * Update safety check
    
    * another try to fix kpadding..
    
    * Sanity checked
    
    * new instances..
    
    * clang format+typo fix
    
    * remove clang format script's change
    
    * Add non-hotloop compile option
    
    * 1. Add fp16xfp8 example
    2. pull packed convert f8 from pr1150
    
    * Some miscs.. opt and fix
    
    * Add pipeline description docs
    
    * Split universal gemm instance library to cut profiler compiling time
    
    * uncomment cmakefile
    
    * Fix a bug caused by blockwise_gemm_pipe_v2
    
    * reduce default splitk to 1
    
    * Add 224x256x64 tile size
    
    * update, including:
    1. Experiment pipeline 5~7
    2. Optimization for pipeline 4
    3. Organized instance library
    
    * temp save
    
    * temp save
    
    * Permuted lds layout, sanity and function checked
    
    * clang format
    
    * Move OOB check from RunRead to RunWrite, for better software pipeline.
    TODO: agpr spill when NN layout
    
    * clangformat
    
    * A/B splitpipe scheduler for v3
    
    * Fix two bugs
    
    * bug fix
    
    * fix a bug in oob check
    
    * Example for mixed fp16_fp8 gemm
    
    * Clean experimental code blocks
    
    * Add mixed precision gemm into profiler
    
    * tempsave
    
    * optimize m/n major lds layout
    
    * Add RRR GEMM  mixed precision instances
    
    * Optimize f8 matrix transpose
    
    * Add test_gemm_universal
    
    * A/B spilt schedule for blkpip v5
    
    * Take ds_read2 into iglp scheduling scheme
    
    * format
    
    * fixed cmake
    
    * Add llvm-option into CI cmake flag
    
    ---------
    Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
    f83e9701
profile_gemm_universal.cpp 5.67 KB