Commits · d4ee79841b49f707bf9d2294a79df5e32ecee006 · gaoqiong / MIGraphX

17 May, 2022 11 commits
- Formatting · 1bc16951
  turneram authored May 17, 2022
  
  1bc16951
- Use parse_layernorm to un-fuse layernorm op · c7096299
  turneram authored May 17, 2022
  
  c7096299
- Format · 835cc1e2
  Paul authored May 17, 2022
  
  835cc1e2
- Fuse contiguous · 77be2528
  Paul authored May 17, 2022
  
  77be2528
- Format · 9426aae5
  Paul authored May 17, 2022
  
  9426aae5
- Dont hinder eliminate_contiguous · e83dc134
  Paul authored May 17, 2022
  
  e83dc134
- Format · 8e49a9f2
  Paul authored May 17, 2022
  
  8e49a9f2
- Jit contiguous · 407acb7d
  Paul authored May 17, 2022
  
  407acb7d
- Formatting · 1974671d
  turneram authored May 17, 2022
  
  1974671d
- Fix transpose kernels · fe9a42f1
  turneram authored May 17, 2022
  
  fe9a42f1
- renamed variables for module from p to m (#1204) · a27dd28c
  shivadbhavsar authored May 17, 2022
```
Updated variable names according to #1193
```
  a27dd28c
12 May, 2022 2 commits
- Formatting · 96663815
  turneram authored May 12, 2022
  
  96663815
- Add transposectx and transposeqkv · 0ccee797
  turneram authored May 12, 2022
  
  0ccee797
11 May, 2022 10 commits
- Prefuse layernorm for gpu (#1190) · 671f24be
  Paul Fultz II authored May 11, 2022
```
Fuse layernorm and added triadd_layernorm fusion.  This is a prep performance booster
```
  671f24be
- Formatting · 5ded4ac1
  turneram authored May 11, 2022
  
  5ded4ac1
- Formatting · f99a3036
  turneram authored May 10, 2022
  
  f99a3036
- Add onnx parsers for gelu, fastgelu; using silu apprx for fastgelu · 2237c5de
  turneram authored May 10, 2022
  
  2237c5de
- Formatting · 732a4a73
  turneram authored May 09, 2022
  
  732a4a73
- Fix scale and bias ops · 7afefd9b
  turneram authored May 09, 2022
  
  7afefd9b
- Formatting · ae513aa8
  turneram authored May 02, 2022
  
  ae513aa8
- Update ref to match gpu and handle scale and bias in parser · f89638ec
  turneram authored May 02, 2022
  
  f89638ec
- Formatting · ba7a370a
  turneram authored May 02, 2022
  
  ba7a370a
- Add attention and layernorm ops · eea36256
  turneram authored May 02, 2022
  
  eea36256
10 May, 2022 1 commit
- Expose `add_literal` in C and Python API (#1173) · 5e5ed37a
  Umang Yadav authored May 10, 2022
```
Expose add_literal method in C/C++ api
```
  5e5ed37a
09 May, 2022 1 commit

Refactor vectorization and preloading for pointwise fusions (#1184) · ddbbe54b

Paul Fultz II authored May 09, 2022

Improves performance for add_gelu.  In bert it is 4x faster and for mul_add it is 50% faster than what we current have.

ddbbe54b

06 May, 2022 1 commit

upgrade docker images to ROCm 5.0.2 (#1133) · f55d7c24

Chris Austen authored May 06, 2022

Move to CI containers to rocm 5.0.2
upgrade to 20.04
free up some more file space in github action environments

f55d7c24

05 May, 2022 1 commit

Cppcheck fixes (#1195) · d582425b

Paul Fultz II authored May 05, 2022

Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.

d582425b

03 May, 2022 7 commits
- Format · bb0fff52
  Paul authored May 03, 2022
  
  bb0fff52
- Slice inner · c9bc461c
  Paul authored May 03, 2022
  
  c9bc461c
- Format · b9b761eb
  Paul authored May 03, 2022
  
  b9b761eb
- Compile-time fixes · ffbc0918
  Paul authored May 03, 2022
  
  ffbc0918
- Format · 2851a6e9
  Paul authored May 03, 2022
  
  2851a6e9
- Add softmax kernel · efa5dcce
  Paul authored May 03, 2022
  
  efa5dcce
- Extend lifetimes in C++ API (#1139) · 4a5a23a4
  Paul Fultz II authored May 02, 2022
```
Helps avoid dangling references. This also deprecates the constructors that didnt take a lifetime annotation since its ambiguous the lifetime.
```
  4a5a23a4
29 Apr, 2022 1 commit
- Add GatherND operator (#1089) · 4ec35e5f
  turneram authored Apr 28, 2022
```
Add ref and gpu implementations for ONNX op GatherND

Resolves #1032
```
  4ec35e5f
27 Apr, 2022 1 commit

Add lane reduction (#1180) · 4c72cc95

Paul Fultz II authored Apr 27, 2022

With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:

# lane
gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
# block
gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
# original
gpu::reduce_sum[axes={1}]: 6.73456ms
There is some basic logic to pick between lane and block reduce automatically.

4c72cc95

26 Apr, 2022 1 commit
- Expose get_queue method for context in API (#1161) · 36656030
  Umang Yadav authored Apr 26, 2022
```
* expose get_queue method
```
  36656030
23 Apr, 2022 1 commit

ReverseSequence op (#1177) · 31906785

Charlie Lin authored Apr 22, 2022

Implements the ReverseSequence ONNX operator as a parser.

This parser can only handle a constant sequence_lens input. This is the same as what is handled for TensorRT as far as I can tell.
We could handle a variable sequence_lens input; that would require ref and GPU implementations of the operator.
The ONNX backend tests are disabled because this does not handle variable sequence_lens.

31906785

19 Apr, 2022 1 commit

Refactor Pooling and implement ONNX LpPool and GlobalLpPool (#1152) · 764273e4

Charlie Lin authored Apr 18, 2022

Refactored the reference implementation of pooling to something like what was done for roialign. Moved the reference implementation of pooling from targets/ref/lowering.cpp to pooling.hpp.
Removed cpu_pooling, instead using reference pooling in pooling.hpp
Added reference implementation of Lp Norm pooling and the global version
Added tests for the Lp Norm Pooling

764273e4

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e