Commits · bc4bf9bd03a74aba1860b80fbeb85fb1f47b8b19 · gaoqiong / composable_kernel_ROCM

03 Dec, 2023 1 commit

Add support for double buffering in direct load GEMM kernel (#1052) · bc4bf9bd

Bartlomiej Wroblewski authored Dec 03, 2023

This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.

Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.

Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.

bc4bf9bd

25 Nov, 2023 1 commit

Add basic support for direct loads from global to LDS (#999) · 627054b9

Bartlomiej Wroblewski authored Nov 25, 2023

* Add basic support for direct loads from global to LDS

* Clean the code and comments

* Add support for fp16

* Add comments

* Add check for thread cluster lengths

* Align non-direct-load fp16 example

* Small fixes

* Extend IsSupported to check for supported GPU gens

* Build examples only on the supported HW

* Do not throw when instance not supported in 04 example

* Review: Apply review suggestions

* Review: small fix

* Review: small fix

627054b9

12 Oct, 2023 1 commit

simplified buffer_load/store (#971) · f3b02ecf

zjing14 authored Oct 11, 2023



* simplified buffer_load/store

* add bfp8/fp8

* fixed

* fixed buffer_load

* fixed buffer_store

---------
Co-authored-by: Jing Zhang <jizha@amd.com>

f3b02ecf

02 Oct, 2023 1 commit

Add fp8 @ bf8 gemm support and example (#933) · bd09b5c5

Rostyslav Geyyer authored Oct 02, 2023

* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

bd09b5c5

12 Sep, 2023 1 commit

Refactor f8_t, add bf8_t (#792) · 62d4af74

Rostyslav Geyyer authored Sep 12, 2023

* Refactor f8_t to add bf8_t

* Add check_err impl for f8_t

* Update fp8 test

* Format

* Revert the fix

* Update vector_type implementation

* Add bf8 test

* Add bf8, use BitInt types

* Add bf8 conversion methods

* Update type_convert for fp8/bf8

* Add check_err fp8/bf8 support

* Add subnorm fp8 tests

* Add subnorm bf8 tests

* Fix conversion

* Add bf8 cmake bindings

* Add macros to enable build with disabled fp8/bf8

* Remove is_native method

* Update flag combination for mixed precision instances

* Add more flag checks

* Add another flag to a client example

* Add type traits, decouple f8/bf8 casting

* Clean up

* Decouple fp8 and bf8 flags

* Remove more redundant flags

* Remove leftover comments

62d4af74

26 Jul, 2023 1 commit

initial stream-k implementation with example (#699) · e7dca79d

carlushuang authored Jul 27, 2023



* initial stream-k implementation with example

* fix unexpected change in err

* improve a little bit performance by reorganize pipeline.

* improve perf a little bit by swizzle block idx

* add profiler

* update example

* fix spelling

* shrink karg for streamk

* support dynamic buffer using memory coherence glc_slc bit from template

* control memory coherence while construct dynamic buffer

* update reduction for streamk(not ready yet)

* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting

* fix build issue

* fix several bug

* now result is correct, everything works (but has scratch)

* remove scratch by manually reset coordinate

* update device code

* fix a bug in final reduce

* fix something in example

* update async memset

* fix enum as camel case

* modify coherence enum name

* clean code and use atomic streamk by default

* remove unused var

* throw exception if have empty pointer

* fix format

* fix CI warning

* fix type in init

* modify CI error

* filter out on gfx10+

* restore changed example code

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

e7dca79d

05 Jul, 2023 1 commit

Add fp8 GEMM and an example for it (#767) · 1cf50031

Rostyslav Geyyer authored Jul 04, 2023

* Add fp8 xdl gemm

* Add example

* Use int8 intrinsics for buffer load/store

* Format

* Update cmakelists

1cf50031

08 Jun, 2023 1 commit
- support dynamic buffer using memory coherence glc_slc bit from template (#725) · 016ebaa7
  carlushuang authored Jun 08, 2023
  
  016ebaa7
31 May, 2023 1 commit
- update copyright headers (#726) · b94fd0b2
  Illia Silin authored May 31, 2023
  
  b94fd0b2
09 Mar, 2023 1 commit
- fix a bug with non-dword-aligned offset when OOB, in case crash (#616) · 76fcdc60
  carlushuang authored Mar 09, 2023
```
Co-authored-by: zjing14 <zhangjing14@gmail.com>
```
  76fcdc60
09 Sep, 2022 1 commit

embedding fuse layernorm (#405) · efd1d257

carlushuang authored Sep 09, 2022



* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

efd1d257

25 Jun, 2022 1 commit
- add license in file (#303) · d3051d75
  Chao Liu authored Jun 24, 2022
  
  d3051d75
19 Jun, 2022 1 commit

GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9

Chao Liu authored Jun 19, 2022

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add comment

* use type_convert

* clean

* clean element wise op

56adf7e9

20 May, 2022 1 commit

Gemm reduce max (#209) · 0ffe956a

rocking5566 authored May 20, 2022



* [What] Rename the example
[Why] Prepare to add unary reduction

* Add global oparation to the parameter

* Add atomicmax

* Fix compile error

* Support atomicMax (hip library)

* Rename the reduction example

* Fix target name

* use p_d1_grid as the indicator directly

* Prevent performance issue. Let passthrough handle it.

* Implement the function template the specialize the float2

* No need to separate into two lines

* Remove empty line

* add comment

* Fix compile error due to merge from develop

* make the implementation of atomic_max / atomic_add explicit for each datatype

* Refine typo

* For future CI test

* Fix compiler error in ckProfiler

* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'

* simply use remove_pointer

* Rename type and var

* Refine example

* Modify reducemax example

* Fix bug in reduction

* Change initialize range

* Implement F64 version of atomicMax

* Move reduction  code together

* Add buffer atomic_max

* Fix coding style by clang-format

* Integrate new api of DeviceGemmReduce_Xdl_CShuffle

* Integrate Batch gemm reduction

* Fix example

* fix example

* clean up

* Fix batch gemm tensor operation

* Fix coding style

* Fix template augument

* Fix clang format

* Keep flexible of different stride for each D tensor

* Fix compile error for ckProfiler

* Fix typo

* [What] Fix naming
[Why] Prepare to add out elementop

* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>

0ffe956a

31 Mar, 2022 1 commit

Compile for gfx908 and gfx90a (#130) · cd167e49

Chao Liu authored Mar 31, 2022

* adding compilation for multiple targets

* fix build

* clean

* update Jekinsfile

* update readme

* update Jenkins

* use ck::half_t instead of ushort for bf16

* rename enum classes

* clean

* rename

* clean

cd167e49

22 Mar, 2022 1 commit

Reduction for int8 and bfloat16 (#125) · 9a8ee8a3

Qianfeng authored Mar 23, 2022



* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction

* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter

* Rename the folder name for the pool2d and reduce examples

* Update to reduction test scripts

* Add Readme for pool2d_fwd and reduce_blockwise examples

* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)

* Tiny fix in reduce profiler and tiny update in reduce testing scripts

* Tiny fix in testing script profile_reduce_no_index.sh

* Tiny fix in testing script profile_reduce_no_index.sh

* Add support for bfp16 reduction (using bhalf_t = ushort)

* Tiny fix in amd_buffer_addressing.hpp

* Tiny change in script/profile_reduce_with_index.sh

* Use AccDataType for Beta value and use element_wise::PassThrough

* Use type_convert for type converting in host layer reduction

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming all NumReduceDims to NumReduceDim

* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2

* Update to testing scripts to add bf16 support

* added more static_assert

* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp

* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations

* minor change

* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass

* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp

* Tiny fix in script/profile_reduce_no_index.sh

* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims

* Generic renaming in host reduction and DeviceReduce layer

* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances

* Use multi-thread and simplification for host Reduction implementation

* Add ctest for reduction

* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/

* Update to the reduce CTest executables to enable default testing behavior when no command argument

* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>

9a8ee8a3

09 Mar, 2022 1 commit

Reorganize files, Part 1 (#119) · 5d37d7bf

Chao Liu authored Mar 08, 2022

* delete obselete files

* move files

* build

* update cmake

* update cmake

* fix build

* reorg examples

* update cmake for example and test

5d37d7bf

04 Mar, 2022 1 commit

[Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32

rocking5566 authored Mar 05, 2022



* Add int8 of mk_nk_mn to the ckProfiler

* Add example of int8 gemm

* Fix typo, use ushort instead of half_t for bfloat16

* replace ushortXXX_t to bhalfXXX_t

* rename ushort to bhalf_t

* Add bf16 example

* Add bf16 gemm to ckProfiler

* Fix alignment

* Fix typo

* Add unit test for gemm_xdl int8

* Add gemm_xdl fp32 unit test

* Add gemm_xdl bf16 unit test

* fix build

* fix build issue due to merge conflict

* Fix build

* Fix build error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7e9a9d32

23 Feb, 2022 1 commit

Conv3d new (#94) · 6dfb92bb

Jianfeng Yan authored Feb 22, 2022



* conv3d compiles but has memory error

* conv3d works

* fix performance issue by using __builtin_amdgc_readfirstlane

* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*

* clang-format

* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d

* format

* remove useless marc

* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6dfb92bb

26 Dec, 2021 1 commit

Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7

Chao Liu authored Dec 26, 2021

* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

acbd7bd7

18 Nov, 2021 2 commits

Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
Chao Liu authored Nov 18, 2021
```
* reworking vector_type

* use __builtin_memcpy for bit_cast and vector access of scalar pointer

* clean up
```
64350aff

v5r1 fusion kernels for inference (#49) · 970fa3e9

zjing14 authored Nov 18, 2021



* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com>

970fa3e9

15 Nov, 2021 1 commit

Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03

zjing14 authored Nov 15, 2021



* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

3737bb03

27 Aug, 2021 2 commits

Misc fixes (#24) · 10bb8110

Chao Liu authored Aug 26, 2021

* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64

* use remove_cvref_t

10bb8110

[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) · 9e80cdce

Qianfeng authored Aug 27, 2021



* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* make inner product compatible on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)

* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction

* Update to generic reduction C++ host interface layer to support dynamic generic reduction

* Update to remove tidy complaints in host interface layer

* Change the unary operator form from void op(T &x) to T op(T x)

* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)

* Use cppcheck-suppress to prevent some strange warnings

* Re-use operator [] and () for DynamicBuffer and update to depending codes

* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers

* [performance] Remove un-needed local buffer initialization
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

9e80cdce

25 Aug, 2021 1 commit

GlobalAtomicAdd for fp32/int32 (#23) · a7a758d8

zjing14 authored Aug 25, 2021



* add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3

* fixed

* fixed

* update comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

a7a758d8

19 Aug, 2021 1 commit

Composable kernel init integration v3 (#1097) · 6fe3627a

Chao Liu authored Aug 19, 2021

* Squashed 'src/composable_kernel/' content from commit f6edda61

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda61

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5

5781adf5 Update develop (#5) (#6)
97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e refactor
49c33aae refactor
54b3e73d rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5



* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix
Co-authored-by: JD <Jehandad.Khan@amd.com>

6fe3627a

13 Aug, 2021 1 commit
- DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element · a91b68df
  Chao Liu authored Aug 13, 2021
  
  a91b68df
09 Aug, 2021 1 commit
- tidy · 80120f0a
  Chao Liu authored Aug 09, 2021
  
  80120f0a
27 Jul, 2021 1 commit

[MIOpen Downstream] Initial MIOpen integration (#52) · f63a23ac

Chao Liu authored Jul 27, 2021

* update online kernel wrapper bundle all descriptors in a tuple

* change __CONSTANT__ to CONSTANT

* rename

* adding tuning

* added IsValidCompileParameter

* reorginze

* adding tunable for fp16 and int8

* fix kernel compile warning and bug fixes

* suppress warning about cast CONSTANT (address space 4) pointer

* fix building issue

f63a23ac

01 Jul, 2021 1 commit

xdlops_v4r4_fwd fp32/fp16 (#34) · 3835318c

zjing14 authored Jul 01, 2021



* create files for xdlops

* working on blockwise_gemm_xdlops

* add KReduction

* add m/n repeats

* add 2x2 pipeline

* added 128x128 wavegemm

* use StaticBuffer of vector_type

* break vector type to blk_size

* add kpack into xldops_gemm and blockwise_gemm

* abroadcast only

* add fp32 mfma instructions

* adding fp16 mfma

* pack half4_t

* rename kperwave to kpack

* add 32x32x8fp16

* add fp16 mfma

* clean code

* clean code

* V4r4 xdlops kpack (#35)

* add kpack with incorrect results

* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2

* add 1x1 kernel

* add gridwise_gemm_v2 - single_buffer

* enabled dwordx4 for fp16
Co-authored-by: Chao Liu <chao.liu2@amd.com>

* refactor fwd-v4r4-xdlops

* add v4r4-nhwc-xdlop

* improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop

* tweak scheduling in gridwise gemm

* add v4r3 with a single output copy

* init commit: output with slice win

* adding sliceWin

* add multiple repeats pattern

* starting adding bwd-v4r1-xdlops

* use tuple as SrcBuffer

* adding bwd-data v4r1 nhwc xdlops

* fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()

* fix bug in host bwd-data conv

* initial implementation of bwd-data v4r1 nhwc xdlops

* add launch bound flags

* enable launch bound

* add m/nrepeat=4

* tweak bwd-data v4r1 nhwc xdlops

* added bwd-data v4r1 nhwc xlops with output A and weight B

* add fwd-v4r4 nhwc xdlops, A input, B weight, C output
Co-authored-by: Chao Liu <chao.liu2@amd.com>

3835318c

12 May, 2021 1 commit

Use DynamicBuffer instead of raw pointer (#32) · 78b987fb

Chao Liu authored May 12, 2021

* Use DynamicBuffer to hold raw pointer (to global and LDS memory)

* add workaround for compiler issue (inefficient ISA) of ds_write for int8x4, int8x8, int8x16

78b987fb

13 Apr, 2021 1 commit
- Overhaul vector_type and use real vector for int8x4_t instead of aliasing from int32_t (#29) · e4790c25
  Chao Liu authored Apr 12, 2021
```
* overhaul vector_type, make int8x4_t real vector instead of aliasing from int32_t
```
  e4790c25
07 Apr, 2021 1 commit
- Hybrid direct + implicit GEMM forward convolution NCHWc v5r1 (#25) · 792a20fa
  zjing14 authored Apr 07, 2021
```
* Hybrid direct + implicit GEMM forward convolution NCHWc v5r1. Input tensor bypass LDS. Support fp32/fp16/int8
```
  792a20fa
25 Mar, 2021 1 commit

Dynamic tensor descriptor (#24) · fcbb9788

Chao Liu authored Mar 25, 2021



* support dynamic tensor descriptor

* use buffer load OOB feature for padding case

* add navi support

* add int8x4 inference kernel
Co-authored-by: Chao Liu <chao@ixt-rack-81.local.lan>
Co-authored-by: Jing Zhang <jizhan@amd.com>

fcbb9788