Commits · 78b987fbd6a7897ee9827187a231441794b13490 · yangql / composable_kernel-1

12 May, 2021 1 commit

Use DynamicBuffer instead of raw pointer (#32) · 78b987fb

Chao Liu authored May 12, 2021

* Use DynamicBuffer to hold raw pointer (to global and LDS memory)

* add workaround for compiler issue (inefficient ISA) of ds_write for int8x4, int8x8, int8x16

78b987fb

11 May, 2021 1 commit

No raw index calculation (#31) · 01055d95

Chao Liu authored May 11, 2021



* Replace most raw index calculation to coordinate transformation
* Overhaul blockwise and threadwise GEMM
* Overhaul driver for gridwies GEMM kernel
Co-authored-by: Jing Zhang <jizhan@amd.com>

01055d95

28 Apr, 2021 1 commit
- Use Tuple and vector_type instead of Array for holding tensor data (#30) · d075adf1
  Chao Liu authored Apr 28, 2021
```
* replacing array with tuple and vector for tensor data
```
  d075adf1
13 Apr, 2021 2 commits
- Overhaul vector_type and use real vector for int8x4_t instead of aliasing from int32_t (#29) · e4790c25
  Chao Liu authored Apr 12, 2021
```
* overhaul vector_type, make int8x4_t real vector instead of aliasing from int32_t
```
  e4790c25
- Initial implementation of magic number division and "Merge" transformation that use it (#28) · 3bf52e60
  Chao Liu authored Apr 12, 2021
```
* initial implementation for magic number division and DynamicMerge_v2_magic_division that uses it

* turn off DynamicMerge_v2_magic_division that use magic number division by default
```
  3bf52e60
07 Apr, 2021 1 commit
- Hybrid direct + implicit GEMM forward convolution NCHWc v5r1 (#25) · 792a20fa
  zjing14 authored Apr 07, 2021
```
* Hybrid direct + implicit GEMM forward convolution NCHWc v5r1. Input tensor bypass LDS. Support fp32/fp16/int8
```
  792a20fa
06 Apr, 2021 2 commits
- Fix performance issue when passing tensor descriptor from host to kernel by void pointers (#27) · d2217f30
  Chao Liu authored Apr 06, 2021
```
* use address_space(4) in kernel signature to fix performance issue when passing tensor descriptor from host to kernel by (void) pointers

* remove passing by pointer* option (only use pass by value or void*)
```
  d2217f30
- bug fix for buffer resource setting (#26) · 6a5ea493
  zjing14 authored Apr 06, 2021
  
  6a5ea493
25 Mar, 2021 1 commit

Dynamic tensor descriptor (#24) · fcbb9788

Chao Liu authored Mar 25, 2021



* support dynamic tensor descriptor

* use buffer load OOB feature for padding case

* add navi support

* add int8x4 inference kernel
Co-authored-by: Chao Liu <chao@ixt-rack-81.local.lan>
Co-authored-by: Jing Zhang <jizhan@amd.com>

fcbb9788

06 Aug, 2020 1 commit

Bwd Data NHWC (#22) · bbcb67d0

Chao Liu authored Aug 06, 2020

* fix buffer_store bug
* remove obsolete kernels
* add bwd-data-v5r1-nhwc

bbcb67d0

29 Jul, 2020 1 commit

Improve buffer address for out of bound check (#21) · ac62d13e

Chao Liu authored Jul 29, 2020

* Use buffer load built-in OOB check. buffer size is limited to 2GB.
* buffer APIs use combined wave and thread offset
* use uint32_t for addr shift in buffer addressing

ac62d13e

24 Jun, 2020 1 commit

Code clean up (#20) · 5c7cec11

Chao Liu authored Jun 23, 2020



* tuning para,

* testing on v100

* add fp16

* remove deprecated tensor descriptor

* sync with miopen

* update build script
Co-authored-by: Jing Zhang <jizhan@amd.com>

5c7cec11

17 Feb, 2020 1 commit
- MIopen integration (#13) · 1a66e35b
  Chao Liu authored Feb 17, 2020
```
* update for miopen integration: cosmetic refactor
```
  1a66e35b
27 Jan, 2020 1 commit
- Update for recent MIOpen integration (#11) · 3406a114
  Chao Liu authored Jan 27, 2020
```
* update for MIOpen integration
```
  3406a114
20 Jan, 2020 1 commit

Added bwd data v3r1 v4r1, tweaking v1 (#10) · c5da0377

Chao Liu authored Jan 20, 2020

* Added bwd data v3r1: breaking down compute into a series of load balanced GEMM, and launch in a single kernel
* Added bwd data v4r1: like v3r1, but launch GEMMs in multiple kernels
* Tweaked v1r1  and v1r2 (atomic) on AMD GPU

c5da0377

03 Dec, 2019 1 commit

backward data (#7) · 8f5f6496

Chao Liu authored Dec 03, 2019

* enabled atomic add in tensor copy
* added gridwise GEMM
* added backward data conv using GEMM + atomic
* added backward data conv using GEMM, no atomic

8f5f6496

04 Nov, 2019 2 commits
- remove dead file (#6) · 31ded4ac
  Chao Liu authored Nov 04, 2019
  
  31ded4ac
- MIOpen integration: recent bug fixes from MIOpen (#5) · 562e1e27
  Chao Liu authored Nov 04, 2019
  
  562e1e27
11 Oct, 2019 1 commit
- Refactor for MIOpen integration (#4) · 52c3fe05
  Chao Liu authored Oct 11, 2019
```
Refactor, so can bring multi-index transformation and padding support into MIOpen
```
  52c3fe05
27 Sep, 2019 4 commits
- tweaking · 012d3a07
  Chao Liu authored Sep 27, 2019
  
  012d3a07
- debugging · ebe38f3d
  Chao Liu authored Sep 27, 2019
  
  ebe38f3d
- remove dead code · 9b280cc5
  Chao Liu authored Sep 27, 2019
  
  9b280cc5
- nvidia build · 98a2cfcc
  Chao Liu authored Sep 27, 2019
  
  98a2cfcc
26 Sep, 2019 3 commits
- removing dependency on old tensor descriptor · 51a9fa1d
  Chao Liu authored Sep 26, 2019
  
  51a9fa1d
- added type conversion in threadwise and blockwise copy · b3d4595f
  Chao Liu authored Sep 25, 2019
  
  b3d4595f
- removing old implementation of tensor descriptor · 39d92e7d
  Chao Liu authored Sep 25, 2019
  
  39d92e7d
25 Sep, 2019 2 commits
- added GetLinearDimensionMask · e1ae8f18
  Chao Liu authored Sep 25, 2019
  
  e1ae8f18
- adding GetLinearDimensionMask() · 4f4aba48
  Chao Liu authored Sep 24, 2019
  
  4f4aba48
24 Sep, 2019 1 commit
- refactor · 545d9305
  Chao Liu authored Sep 24, 2019
  
  545d9305
22 Sep, 2019 3 commits
- nvidia build · 37f4e2b6
  Chao Liu authored Sep 22, 2019
  
  37f4e2b6
- done: explicitly separate offset component into compile-time, block-invariant... · 6c2c50b0
  Chao Liu authored Sep 22, 2019
```
done: explicitly separate offset component into compile-time, block-invariant and per-thread components. Experimenting
```
  6c2c50b0
- WIP: explicitly separate offset component into compile-time, block-invariant... · 51884fc2
  Chao Liu authored Sep 21, 2019
```
WIP: explicitly separate offset component into compile-time, block-invariant and per-thread components
```
  51884fc2
21 Sep, 2019 3 commits
- refactor · 740da00a
  Chao Liu authored Sep 20, 2019
  
  740da00a
- nvidia build · 184c6e7d
  Chao Liu authored Sep 20, 2019
  
  184c6e7d
- adding logic to judge linear dimension · f00c1381
  Chao Liu authored Sep 20, 2019
  
  f00c1381
20 Sep, 2019 1 commit
- refactor · bf7e7d62
  Chao Liu authored Sep 19, 2019
  
  bf7e7d62
19 Sep, 2019 1 commit
- use buffer_load buffer_store intrinsic · b6e1c52a
  Chao Liu authored Sep 19, 2019
  
  b6e1c52a
18 Sep, 2019 3 commits
- add global_load and buffer_load inline asm · 86cc678f
  Chao Liu authored Sep 18, 2019
  
  86cc678f
- experimenting global and buffer load/store · 5b7a18c5
  Chao Liu authored Sep 18, 2019
  
  5b7a18c5
- experimenting global and buffer load/store · c7a6545e
  Chao Liu authored Sep 18, 2019
  
  c7a6545e