".github/git@developer.sourcefind.cn:change/sglang.git" did not exist on "e9a6203dee21cda91a8f5a113ea4171f3b221571"
  1. 01 Sep, 2022 1 commit
  2. 31 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa
      Po Yen Chen authored
      
      
      * Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle
      
      * Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface
      
      * Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle
      
      * Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'
      
      * Add 'TransformConvFwdToGemm<>' utility class (from Chao)
      
      * Use 'TransformConvFwdToGemm<>' to shorten code
      
      * Fix ill-formed method declaration
      
      * Re-implement MakeRGridDescriptor_M() function
      
      * Change problem description
      
      * Use macro to define layout types
      
      * Define K-reduced output tensor layout types
      
      * Let user to decide R output tensor layout
      
      * Rename variables
      
      * Add padding to the reduced output tensor if necessary
      
      * Extract common code as helper method
      
      * Remove debug message
      
      * Add missing include directive
      
      * Add partial fp16 Conv + Reduction example
      
      * Add example verification code for 2D Conv problem
      
      * Use type alias to simplify code
      
      * Share code across different-dimension Conv problems
      
      * Rename file/functions from run_conv_fwd* to run_convnd_fwd*
      
      * Make example code more verbose
      
      * Add code to support 1D & 3D Conv + Reduction on host
      
      * Add more examples for data type: bf16, fp32
      
      * Add example for int8
      
      * Add custom target to group examples
      
      * Use more general custom target name
      
      * Change the description in error message
      
      * Disable testing for example other than fp32
      
      * Add examplel for int4 (just copy from int8)
      
      * Fix wrong data type
      
      * Use larger data type for intermediate tensors
      
      * Finish int4 example
      
      * Undefine macro PP_DEFINE_LAYOUT_TYPE() after use
      
      * Use named variables to replace magic numbers
      
      * Remove debug messages
      
      * Use same A/B data type for host Conv in int4 example
      
      * Add check for the 'RLayout' type argument
      
      * Group same-dim-layouts together in 'LayoutSetting<>'
      
      * Add 'final' specifier to utility classes
      
      * Use different initialization method for examples
      
      * Remove macro PP_DEFINE_LAYOUT_TYPE()
      
      * Fix code-comment mismatch
      
      * Use more reasonable initialization value for all data types
      
      * Default use init_method=1 for all examples
      
      * Remove never-used code
      
      * Remove confusing out-of-date comments
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      46a675aa
    • Chao Liu's avatar
      conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
      Chao Liu authored
      * refactor conv
      
      * add conv+conv example, 1x1 only
      4df6d93f
  3. 30 Aug, 2022 2 commits
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
    • Shaojie WANG's avatar
      Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736
      Shaojie WANG authored
      
      
      * add padding algo for bmm+scale+softmax+bmm. Version for verification
      
      * remove verification code
      
      * remove comments
      
      * add padded bmm scale softmax bmm example
      
      * format
      
      * refactor
      
      * add comments for usages of padding bmm+scale+softmax+bmm
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      45adb736
  4. 25 Aug, 2022 3 commits
  5. 23 Aug, 2022 4 commits
    • Po Yen Chen's avatar
      Add examples of Gemm (data type: int4) (#367) · fa2d894b
      Po Yen Chen authored
      * Add GEMM examples for int4
      
      Currently the source files are just copied from int8 examples
      
      * Re-use pre-defined alias in int4 exmples
      
      * Distinguish user-side type from kernel-side type
      
      * Add int4_t support for check_err()
      
      * Allow conversion between Tensor<> specializations
      
      * Re-format source files
      
      * Use different type for host tensors
      
      * Re-use CopyAsType<>() to implement copy ctor
      
      * Re-use element-wise operation type alias
      
      * Fix typo in alias names
      
      * Complete the int4 examples
      
      * Add constraint to Tensor<> templated methods
      
      * Add type traits 'is_signed_integral<>'
      
      * Add type constraints for integer version check_err<>()
      
      * Allow comparing different-sized integral types in check_err()
      
      * Check converted Tensor<int4_t> with golden Tensor<int8_t>
      
      * Remove constraint of Tensor<>::CopyAsType()
      
      * Avoid compilation error while disabling ck::int4_t support
      
      * Remove debug messages
      
      * Add #error directive to prevent compile sources with wrong setting
      
      * Simplify tensor usages in examples
      
      * Add constraint to check_err() input reference type
      
      * Align design with other PR
      
      * Use ""_uz to simplify example code
      
      * Avoid too much generalizing check_err()
      
      * Re-format GEMM instance template arguments
      
      * Extract int4 example common codes
      
      * Sort include directives
      
      * Move #include directives into new header
      
      * Move common codes together
      
      * Re-format template argument in example code
      
      * Reuse same implementation code for most of GEMM examples
      
      * Re-format common.hpp
      
      * Unify structured comment in examples
      
      * Use reinterpret_cast<>() for cross-type pointer conversion
      
      * Revert "Add type traits 'is_signed_integral<>'"
      
      This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a.
      
      * Allow unsigned integer arguments for check_err()
      
      * Fix compilation error in check_err()
      
      * Remove unnecessary copy ctor for Tensor<>
      
      * Mark Tensor<> special member functions as 'default'
      
      * Use more strict condition to add code in examples
      
      * Fix wrong program return value of GEMM examples
      
      * Handle the case while user specify all the strides
      
      * Fix never-ran examples
      
      * Exit successfully if GEMM instance does not support given problem
      
      * Add missing 'else' keyword
      
      * Re-format CMakeLists.txt
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1.
      
      * Add conversion ctor for Tensor<>
      
      * Write Tensor<> conversion logics explicitly in example code
      
      * Convert Tensor<> values after transfer data to host
      fa2d894b
    • Anthony Chang's avatar
      Attention with output permutation (#370) · e0d8806c
      Anthony Chang authored
      * comment on specialization for TensorSpecialization::Packed
      
      * gemm_softmax_gemm with output permutation
      
      * scaling
      
      * refactor MatrixPadder; rename to GemmPadder
      
      * remove old sanity check
      
      * restore original gemm_softmax_gemm
      
      * revise comment in gemm_softmax_gemm example
      
      * use GetElementSpaceSize()
      
      * remove extra header
      
      * typo
      
      * remove archaic DeviceOpPtr
      e0d8806c
    • zjing14's avatar
      Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361) · 60914583
      zjing14 authored
      
      
      * add examples into grouped/batched_gemm
      
      * adding splitK examples
      
      * fixed splitK
      
      * add bfp16 int8 example into splitK
      
      * formatting
      
      * use static_cast
      
      * added common for batched_gemm
      
      * add commons for examples of splitK/batched/grouped_gemm
      
      * return true
      
      * adjust splitK check tol
      
      * update example
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      60914583
    • Po Yen Chen's avatar
      Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6
      Po Yen Chen authored
      * Add custom target to bundle examples together
      
      * Add int4 example conditionally (just copy from int8 example)
      
      * Extract common code into common.hpp
      
      * Move ref gemm type alias into data-type-specific sources
      
      * Add #error directive to prevent compile with wrong setting
      
      * Let AddAddFastGelu support int4 parameter type
      
      * Let check_err() support int4 parameter type
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Finish int4 example for GEMM + AddAddFastGelu
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit e26e7af71e1f982a4ca7406401e2fc9b1f086b32.
      
      * Add conversion ctor for Tensor<>
      
      * Add 'const' specifier to Tensor<>::CopyAsType()
      
      * Convert Tensor<> values before/after transfer between host & device
      2327f1a6
  6. 22 Aug, 2022 1 commit
  7. 17 Aug, 2022 1 commit
  8. 15 Aug, 2022 1 commit
    • Qianfeng's avatar
      Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320) · 53ea4713
      Qianfeng authored
      * Implement multiple-reduction in one kernel (kernels, device ops, examples)
      
      * Add generic elementwise kernel and device interface
      
      * Add generator for normal-distributed data initialization
      
      * Add host refer implementation of batchnorm-forward and batchnorm-infer
      
      * Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
      
      * Remove un-needed including in batchnorm example
      
      * Renaming generic_elementwise to elementiwise in kernel and device classes/functions
      
      * Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
      
      * Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
      
      * Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
      
      * Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
      
      * Removing and renaming files
      
      * Update to synchronize gemm_layernorm client example to the generic element-wise device op API
      
      * Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
      
      * Merge two static member functions in device_elementwise.hpp
      
      * Remove unary_elementwise_1d kernel and device
      53ea4713
  9. 13 Aug, 2022 7 commits
    • Chao Liu's avatar
      fix build issue (#357) · 5ee30459
      Chao Liu authored
      * fix build
      
      * excludeexample_gemm_max_xdl_fp16 from testing due to random failure on gfx908
      5ee30459
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • ltqin's avatar
      Skip lds of b matrix (#326) · 10b3278b
      ltqin authored
      * start
      
      * read for gridwise gemm
      
      * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1
      
      * add thread  copy desc and register buffer
      
      * add K0PerBlock dim
      
      * add read global data
      
      * finish gridwise gemm
      
      * finish blockwise gemm
      
      * add print data
      
      * add smallest config
      
      * add compare code for gridwis gemm
      
      * fix NXdlPerWave
      
      * fix k0perthread and gridewis gemm main loop
      
      * remove b matrix lds alloc
      
      * fix name
      
      * add test code
      
      * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter
      
      * add double register
      
      * modify b_thread_desc_
      
      * add float
      
      * fp16 tag
      
      * add tail for pipeline
      
      * finish main loop
      
      * optimize main loop
      
      * start clear gridwise gemm
      
      * clear code
      
      * clear redundant code
      
      * change file name
      
      * change file name
      
      * fix bug after merge develop
      
      * fix input parameters
      
      * using MultiK0 control b load data loop
      
      * fix some config
      
      * 4 buffer
      
      * fix bug
      
      * one can use
      
      * change read order
      
      * change buffer array to tuple
      
      * change to 8 buffer
      
      * interleave buffer load
      
      * change to 16
      
      * read 8 buffer
      
      * add data buffer to template
      
      * fix after merge develop(head file)
      
      * format
      
      * change to 4 buffer
      
      * remove unnecessary lambda fun
      10b3278b
    • Qianfeng's avatar
      Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342) · 14932e8d
      Qianfeng authored
      * Update the reduce_blockwise example to support user specified data type and input+reducing dimensions
      
      * Add examples for using reduce_multiblock_atomic_add
      
      * Add more running examples to the default command-line
      
      * Remove un-necessary header including
      
      * Update to the example README.md
      14932e8d
    • rocking5566's avatar
      Gemm multiple d multiple r (#335) · 6c3c06bf
      rocking5566 authored
      * Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction
      
      * Implement run of kernel
      
      * Add example
      
      * Fix parameter of typo
      
      * Rewrite the reduceMax example
      
      * Rewrite the reduceMean + reduceMeanSquare example
      
      * Refine naming
      
      * Refine folder name
      
      * refine naming
      
      * Rewrite the gemm + bias + relu + add + layernorm example
      
      * Rewrite the gemm + layernorm example
      
      * clang-format
      
      * Fix bug if sync lds
      
      * Fix compile error
      6c3c06bf
    • Anthony Chang's avatar
      Fused attention (#345) · cac014f1
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * attention host validation
      
      * add blockwsie softmax v1
      
      * iteratively update softmax+gemm
      
      * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
      
      * add init method for easier debugging
      
      * do away with manual thread cluster calculation
      
      * generalize blockwise softmax interface
      
      * row-wise softmax sum & max
      
      * format
      
      * rename to DeviceBatchedGemmSoftmaxGemm
      
      * add gemm_softmax_gemm instances and tests
      
      * comment
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cac014f1
  10. 12 Aug, 2022 3 commits
  11. 11 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340) · 68b61504
      Po Yen Chen authored
      * Add always_false<> util to delay symbol resolution
      
      * Use always_false<> to prevent trying instantiate unwanted method
      
      * Add new specializations of AddAddFastGelu::operator() method
      
      * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
      
      * Use floating point literal to simplify code
      
      * Remove unnecessary capture in lambda expressions
      
      * Extract fast GeLU calculation as standalone method
      
      * Mark methods as 'constexpr'
      
      * Add constraint for HostTensorDescriptor templated ctors
      
      * Simplify HostTensorDescriptor ctor calls
      
      * Add C++23 std::size_t literal suffix
      
      * Use _uz suffix to shorten example code
      
      * Remove unnecessary conversion to std::array<>
      
      * Re-order include directives
      
      * Remove C-style casting by literal suffix
      
      * Remove unnecessary statements in main()
      
      * Remove unused type parameter of always_false<>
      
      * Remove unused include directive
      
      * Exit main() by returning meaningful value
      
      * Use 'if constexpr' to switch example flow
      
      * Use std::is_same_v<> to shorten example code
      
      * Add 'inline' specifier to literal functions
      
      * Unify output methods in example
      
      * Move common codes into .inc file
      
      * Add type check in type_convert<>()
      
      * Add type_convert<float>() before computation
      
      * Merge AddAddFastGelu method specializations
      
      * Remove always_false<>
      
      * Add constraint to AddAddFastGelu::operator() parameter types
      68b61504
    • rocking5566's avatar
      ckProfiler for layernorm (#330) · fdfd7eb5
      rocking5566 authored
      * Refine parameter
      
      * Add base class for layernorm
      
      * Add layernorm instance
      
      * Add layernorm to ckProfiler
      
      * Remove redundant
      
      * Add verification
      
      * Fix compile error due to merge
      fdfd7eb5
  12. 10 Aug, 2022 1 commit
    • zjing14's avatar
      Add batched/grouped_gemm contraction deviceOps (#349) · e08d68d2
      zjing14 authored
      
      
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      
      * change gemm_c_permute with contraction
      
      * add grouped_contraction
      
      * add contraction in group_gemm
      
      * add example of grouped_gemm with contraction
      
      * add example of grouped_contraction_bias_e_permute
      
      * clean
      
      * fixed ds
      
      * add m3n2 m2n3 examples into gemm_bias_e_permute
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e08d68d2
  13. 07 Aug, 2022 1 commit
  14. 03 Aug, 2022 1 commit
  15. 02 Aug, 2022 1 commit
    • Adam Osewski's avatar
      CGEMM examples bf16, fp32, int8 (#332) · fb0dc358
      Adam Osewski authored
      
      
      * Add int8 specialization for elementwise Add and Subtract.
      
      * CGEMM examples bf16, fp32, int8
      
      * Add convert reference output to CDataType.
      
      * Skip BF16 data type during testing.
      
      * Lower K value to get rid of accumulation error.
      
      * Fix merge artifact.
      
      * Fix changed function name: GetElementSpaceSize()
      
      * Fix merge artifact.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      fb0dc358
  16. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995
  17. 22 Jul, 2022 1 commit
  18. 21 Jul, 2022 1 commit
    • zjing14's avatar
      Grouped Gemm device with multiD grid (#319) · 7959dad5
      zjing14 authored
      
      
      * replace gridwise_v2r3 with multiD
      
      * adjust parameters
      
      * add instances
      
      * fixed test_grouped_gemm
      
      * fix standalone softmax race condition around blockwise reduction
      
      * fixed ci
      
      * fixed comment: remove redundant workspace
      
      * use instanceFactory
      
      * add test layout
      
      * add empty Ds
      
      * add bias example
      
      * use array
      
      * sperate examples
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      7959dad5
  19. 13 Jul, 2022 1 commit
    • rocking5566's avatar
      Standalone layernorm (#315) · 7f216620
      rocking5566 authored
      
      
      * Implement layernorm kernel and deviceOp
      
      * verify gpu kernel with host code
      
      * 1. Separate gamma aand beta from affine
      2. Check if argument is valid
      
      * clean
      
      * Sync the naming
      
      * Support sweep once mode if we can put k dimension data inside one block
      
      * [What] Get length from upper length.
      [Why] if we get length directly, we may get length after padding.
      
      * We only use one block in K dimension.
      Hence, we can simplify the indexing of global R/W.
      
      * Use 1d descriptor for gamma and beta
      
      * Add accElementwiseOp
      
      * Extract layernorm host code
      
      * Support different YVectorDim in GridwiseLayernorm
      
      * Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
      
      * Gamma and beta can share the VGPR.
      
      * Add test for fp32 and fp16
      
      * Fix bug of concurrency and add test case which may fail orignally
      
      * Propagate NaN for layernorm
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      7f216620
  20. 08 Jul, 2022 1 commit
    • Po Yen Chen's avatar
      GEMM pipeline v2 (#317) · 63914743
      Po Yen Chen authored
      
      
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * use pipeline_v2 for gemm kernel
      
      * Remove inconsistent indent
      
      * Fix compilation errors due to incomplete merge process
      
      * Add missing include directives
      
      * Fix compilation errors in currently unused files
      
      * Add license in newly added files
      
      * Re-format touched files by clang-format-10
      
      * Fix wrong template argument count of DeviceGemm<>
      
      * Use language construct to choose between types
      
      * Use language construct to choose GEMM example instance
      
      * Fix compilation error due to interface change
      
      * Re-use type alias to avoid duplication
      
      * Unify type alias usage in source file
      
      * Only use v2 pipeline in one gridwise GEMM type
      
      * Remove no-longer used include directives
      
      * Add static_assert() to check pipeline type requirements
      
      * Revert "Add static_assert() to check pipeline type requirements"
      
      This reverts commit f0985f0a132671a1caaea92810c9f30dcf062bde.
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      63914743
  21. 07 Jul, 2022 1 commit
    • Chao Liu's avatar
      N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393
      Chao Liu authored
      * adding contraction
      
      * add contraction example
      
      * update examle
      
      * update example
      
      * format
      
      * update readme
      
      * clean header
      
      * clean header
      
      * contraction with multiple D
      
      * rename
      
      * fix naming issue; add instances for contraction+bilinear
      
      * change assumed virtual layout of contraction; add client example
      
      * update example
      
      * update
      
      * contraction+scale
      
      * use type_convert
      
      * rename
      4fe9c393
  22. 06 Jul, 2022 1 commit
  23. 02 Jul, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Bilinear (#316) · 9e4429f9
      Chao Liu authored
      * refactor
      
      * update example
      
      * update example
      
      * gemm bilinear
      
      * clean
      
      * update
      9e4429f9
  24. 01 Jul, 2022 1 commit
    • Anthony Chang's avatar
      Single-kernel GEMM + layernorm (#263) · 63fd5da6
      Anthony Chang authored
      
      
      * dump lds content in appropriate precision type
      
      * add squared add reduction op; allows sq sum
      
      * initial stub from regular gemm impl
      
      * layernorm example code & host verification
      
      * initial layernorm implementation
      
      * tidy up
      
      * make C0 precision type consistent with C
      
      * clang-tidy and additional comments
      
      * tighten up example code
      
      * account for extra flops/bytes from normalization
      
      * clang-format
      
      * c0 bias/beta/gamma now have its own precision type
      
      * AccElemOp for gemm outputs prior to feeding to layernorm
      
      * update workgroup mapping
      
      * rename kernel template param to reflect its dual use
      
      * use LDS mem pool for reduction workspace
      
      * change cshuffle precision type to f16; clean up
      
      * clang-format
      
      * correct naming
      
      * explicit cast
      
      * fully implemented gemm + bias + activation + add + norm
      
      * activation in correct order
      
      * reflect reduction API's recent change
      
      * amend
      
      * clean up; add comment
      
      * keep up with recent changes in reduction API
      
      * format
      
      * resolve merge conflicts
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      63fd5da6