Commits · 24ea1c41470d2f3a36d9a5b460cf09aa7b9b9d9d · gaoqiong / MIGraphX

17 Jun, 2022 2 commits
- Format · d97b3111
  Paul authored Jun 17, 2022
  
  d97b3111
- Fix failures when mlir is disabled · 6f768f82
  Paul authored Jun 17, 2022
  
  6f768f82
13 Jun, 2022 4 commits
- Format · 1770a342
  Paul authored Jun 13, 2022
  
  1770a342
- Correctly add module · aeb60bce
  Paul authored Jun 13, 2022
  
  aeb60bce
- Format · f75c5a38
  Paul authored Jun 12, 2022
  
  f75c5a38
- Add source locations · af09c35f
  Paul authored Jun 12, 2022
  
  af09c35f
10 Jun, 2022 1 commit

Add vectorized reduce (#1202) · aa7ff911

Paul Fultz II authored Jun 09, 2022



Consolidate the vectorize and preload
Add vectorization to reduction
Co-authored-by: kahmed10 <15948690+kahmed10@users.noreply.github.com>

aa7ff911

09 Jun, 2022 2 commits
- Format · 6b5c64ff
  Paul authored Jun 09, 2022
  
  6b5c64ff
- Move mlir compile to jit pipeline · 02b0095c
  Paul authored Jun 09, 2022
  
  02b0095c
07 Jun, 2022 1 commit

Prioritizing int8 over int8x4 when it is applicable (#1218) · 37c47504

Zhuoran Yin authored Jun 07, 2022



prioritizing int8 over int8x4 when it is applicable
Amend return to continue in apply loop
Adding error handling in case int8x4 compilation failed
Co-authored-by: Paul Fultz II <pfultz2@yahoo.com>

37c47504

03 Jun, 2022 1 commit

Group code objects by kernel name in perf report summary (#1234) · 7271ddbc

Paul Fultz II authored Jun 02, 2022

Break up the gpu::code_object  print to show the actual kernels...

gpu::code_object::add_kernel: 0.646121ms, 5%
gpu::code_object::mul_kernel: 0.623822ms, 5%
gpu::code_object::add_mul_erf_add_mul_mul_kernel: 0.498902ms, 4%
gpu::code_object::mul_add_kernel: 0.478352ms, 4%

7271ddbc

02 Jun, 2022 1 commit
- Fix dangling reference with gemm add fusion (#1233) · 1339ba35
  Paul Fultz II authored Jun 01, 2022
  
  1339ba35
26 May, 2022 1 commit
- Upgrade to cppcheck 2.8 and fix new issues found (#1225) · a401e72a
  Paul Fultz II authored May 26, 2022
```
* Upgrade to cppcheck 2.8
```
  a401e72a
25 May, 2022 2 commits
- Format · 79ffac9f
  Paul authored May 24, 2022
  
  79ffac9f
- Cleanup debug output · b7f31df5
  Paul authored May 24, 2022
  
  b7f31df5
24 May, 2022 5 commits
- Format · 9dcbd52b
  Paul authored May 24, 2022
  
  9dcbd52b
- Handle symetrical padding · 4272fff1
  Paul authored May 24, 2022
  
  4272fff1
- Improve applicable batched gemms (#1214) · bf0a4713
  Paul Fultz II authored May 24, 2022
```
* Improve applicable batched gemms for bert
```
  bf0a4713
- Remove std references in runtime compilation (#1186) · 150d6d20
  Paul Fultz II authored May 24, 2022
```
Remove std references in runtime compilation since these are not available when using hiprtc and the headers may not be available on the system
```
  150d6d20
- Fuse gemm add with pointwise fusions (#1213) · a500620e
  Paul Fultz II authored May 24, 2022
```
* Fuse gemm add with pointwise fusions
```
  a500620e
20 May, 2022 1 commit

Rename pointwise ops (#1145) · 4a312201

kahmed10 authored May 20, 2022

For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.

4a312201

18 May, 2022 3 commits
- Use func.return · 56a6b232
  Paul authored May 18, 2022
  
  56a6b232
- Format · 516779cb
  Paul authored May 18, 2022
  
  516779cb
- Use func dialect · a4d40fd0
  Paul authored May 18, 2022
  
  a4d40fd0
17 May, 2022 1 commit
- renamed variables for module from p to m (#1204) · a27dd28c
  shivadbhavsar authored May 17, 2022
```
Updated variable names according to #1193
```
  a27dd28c
11 May, 2022 1 commit

Prefuse layernorm for gpu (#1190) · 671f24be

Paul Fultz II authored May 11, 2022

Fuse layernorm and added triadd_layernorm fusion.  This is a prep performance booster

671f24be

09 May, 2022 1 commit

Refactor vectorization and preloading for pointwise fusions (#1184) · ddbbe54b

Paul Fultz II authored May 09, 2022

Improves performance for add_gelu.  In bert it is 4x faster and for mul_add it is 50% faster than what we current have.

ddbbe54b

06 May, 2022 3 commits
- Format · 34a7c072
  Paul authored May 06, 2022
  
  34a7c072
- Update triple · 722d5f5c
  Paul authored May 06, 2022
  
  722d5f5c
- upgrade docker images to ROCm 5.0.2 (#1133) · f55d7c24
  Chris Austen authored May 06, 2022
```
Move to CI containers to rocm 5.0.2
upgrade to 20.04
free up some more file space in github action environments
```
  f55d7c24
05 May, 2022 7 commits
- Cppcheck fixes (#1195) · d582425b
  Paul Fultz II authored May 05, 2022
```
Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.
```
  d582425b
- Format · fd313588
  Paul authored May 05, 2022
  
  fd313588
- Add namespace · 9ff87ee1
  Paul authored May 05, 2022
  
  9ff87ee1
- Format · bf6cf5b0
  Paul authored May 05, 2022
  
  bf6cf5b0
- Whitelist operators · f7a59edb
  Paul authored May 05, 2022
  
  f7a59edb
- Format · 561456e7
  Paul authored May 05, 2022
  
  561456e7
- Fix compilation errors · 10f4486c
  Paul authored May 05, 2022
  
  10f4486c
29 Apr, 2022 1 commit
- Add GatherND operator (#1089) · 4ec35e5f
  turneram authored Apr 28, 2022
```
Add ref and gpu implementations for ONNX op GatherND

Resolves #1032
```
  4ec35e5f
27 Apr, 2022 1 commit

Add lane reduction (#1180) · 4c72cc95

Paul Fultz II authored Apr 27, 2022

With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:

# lane
gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
# block
gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
# original
gpu::reduce_sum[axes={1}]: 6.73456ms
There is some basic logic to pick between lane and block reduce automatically.

4c72cc95

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e