Commits · f3aa2c67c9eadc12f2c2e53189aba40ae4509105 · gaoqiong / MIGraphX

26 Jun, 2022 1 commit
- Add layernorm post fusion · f3aa2c67
  Paul authored Jun 25, 2022
  
  f3aa2c67
25 Jun, 2022 3 commits
- Format · b973563b
  Paul authored Jun 25, 2022
  
  b973563b
- Move pointwise gen to a seperate header · 5a140c90
  Paul authored Jun 25, 2022
  
  5a140c90
- Use jit for contiguous operator (#1217) · b75c83d8
  Paul Fultz II authored Jun 24, 2022
```
* Jit contiguous
```
  b75c83d8
22 Jun, 2022 1 commit
- Update license files (#1248) · e44cecbc
  Ted Themistokleous authored Jun 22, 2022
```
Updated each source file in the repo with the existing license.
```
  e44cecbc
10 Jun, 2022 1 commit

Add vectorized reduce (#1202) · aa7ff911

Paul Fultz II authored Jun 09, 2022



Consolidate the vectorize and preload
Add vectorization to reduction
Co-authored-by: kahmed10 <15948690+kahmed10@users.noreply.github.com>

aa7ff911

04 Jun, 2022 2 commits
- Fix div by zero issue · 187a4769
  Paul authored Jun 04, 2022
  
  187a4769
- Replace layernorm · a62ef598
  Paul authored Jun 04, 2022
  
  a62ef598
31 May, 2022 1 commit
- Add layernorm kernel · d94c54f0
  Paul authored May 31, 2022
  
  d94c54f0
25 May, 2022 2 commits
- Format · 49e1e618
  Paul authored May 25, 2022
  
  49e1e618
- Set kernel name · 85f22ffd
  Paul authored May 25, 2022
  
  85f22ffd
24 May, 2022 1 commit

Remove std references in runtime compilation (#1186) · 150d6d20

Paul Fultz II authored May 24, 2022

Remove std references in runtime compilation since these are not available when using hiprtc and the headers may not be available on the system

150d6d20

23 May, 2022 2 commits
- Format · 92324d57
  Paul authored May 23, 2022
  
  92324d57
- Vectorize softmax · 4d66f031
  Paul authored May 23, 2022
  
  4d66f031
20 May, 2022 1 commit

Rename pointwise ops (#1145) · 4a312201

kahmed10 authored May 20, 2022

For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.

4a312201

19 May, 2022 1 commit
- Fix perf regression · c84154b8
  Paul authored May 19, 2022
  
  c84154b8
12 May, 2022 1 commit
- Fix tidy · 8344791c
  Paul authored May 12, 2022
  
  8344791c
11 May, 2022 4 commits
- Format · db2def39
  Paul authored May 10, 2022
  
  db2def39
- Fix vec issues · f1f60be1
  Paul authored May 10, 2022
  
  f1f60be1
- Format · c13780c2
  Paul authored May 10, 2022
  
  c13780c2
- Add vectorization to reduction · 15fd8205
  Paul authored May 10, 2022
  
  15fd8205
10 May, 2022 2 commits
- Format · 8a6ae079
  Paul authored May 10, 2022
  
  8a6ae079
- Consolidate the vecotrize and preload · d60364a3
  Paul authored May 10, 2022
  
  d60364a3
09 May, 2022 1 commit

Refactor vectorization and preloading for pointwise fusions (#1184) · ddbbe54b

Paul Fultz II authored May 09, 2022

Improves performance for add_gelu.  In bert it is 4x faster and for mul_add it is 50% faster than what we current have.

ddbbe54b

06 May, 2022 1 commit

upgrade docker images to ROCm 5.0.2 (#1133) · f55d7c24

Chris Austen authored May 06, 2022

Move to CI containers to rocm 5.0.2
upgrade to 20.04
free up some more file space in github action environments

f55d7c24

03 May, 2022 4 commits
- Format · bb0fff52
  Paul authored May 03, 2022
  
  bb0fff52
- Slice inner · c9bc461c
  Paul authored May 03, 2022
  
  c9bc461c
- Format · 2851a6e9
  Paul authored May 03, 2022
  
  2851a6e9
- Add softmax kernel · efa5dcce
  Paul authored May 03, 2022
  
  efa5dcce
29 Apr, 2022 1 commit
- Add GatherND operator (#1089) · 4ec35e5f
  turneram authored Apr 28, 2022
```
Add ref and gpu implementations for ONNX op GatherND

Resolves #1032
```
  4ec35e5f
27 Apr, 2022 1 commit

Add lane reduction (#1180) · 4c72cc95

Paul Fultz II authored Apr 27, 2022

With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:

# lane
gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
# block
gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
# original
gpu::reduce_sum[axes={1}]: 6.73456ms
There is some basic logic to pick between lane and block reduce automatically.

4c72cc95

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e

29 Mar, 2022 1 commit

Refactor runtime compiled kernels to use the same compile_ops pipeline (#1125) · 661046c6

Paul Fultz II authored Mar 29, 2022

This adds the infrastructure so we can compile everything in parallel, whereas before only pointwise kernels were compiled in parallel. This will also directly integrate with lowering and the gpu-driver. The kernels for pointwise and roialign are using this infrastructure. Scatternd is not since it does require standard shape.

This also makes it easier to add new runtime compiled kernels in the future.

661046c6