Commits · eb8f205b5e2d5d40b95e24bfc1604d9e6cfc2576 · gaoqiong / MIGraphX

25 May, 2022 2 commits
- Dynamic weight handling · eb8f205b
  charlie authored May 25, 2022
  
  eb8f205b
- Add to conv dynamic batch test · 421769c0
  charlie authored May 24, 2022
  
  421769c0
24 May, 2022 5 commits

Improve applicable batched gemms (#1214) · bf0a4713
Paul Fultz II authored May 24, 2022
```
* Improve applicable batched gemms for bert
```
bf0a4713

Remove std references in runtime compilation (#1186) · 150d6d20

Paul Fultz II authored May 24, 2022

Remove std references in runtime compilation since these are not available when using hiprtc and the headers may not be available on the system

150d6d20

Fuse gemm add with pointwise fusions (#1213) · a500620e
Paul Fultz II authored May 24, 2022
```
* Fuse gemm add with pointwise fusions
```
a500620e

Fix onnx mean parsing for integral inputs (#1209) · d895104a

shivadbhavsar authored May 23, 2022

As described in #1196, the ONNX mean parser does not work correctly for integral types. This update fixes the issue by handling integral types separately, where summation is performed before division. Additional test cases have also been added for handling integral types.

d895104a

Dynamic conv draft progress · a465fc9d
charlie authored May 23, 2022

a465fc9d

20 May, 2022 2 commits
- Rename pointwise ops (#1145) · 4a312201
  kahmed10 authored May 20, 2022
```
For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.
```
  4a312201
- Improve matching with has_value when there are convert operators (#1212) · 27af0170
  Paul Fultz II authored May 19, 2022
  
  27af0170
19 May, 2022 2 commits
- Formatting · 79e27dac
  charlie authored May 19, 2022
  
  79e27dac
- progress · 53c4b899
  charlie authored May 19, 2022
  
  53c4b899
17 May, 2022 1 commit
- renamed variables for module from p to m (#1204) · a27dd28c
  shivadbhavsar authored May 17, 2022
```
Updated variable names according to #1193
```
  a27dd28c
11 May, 2022 7 commits
- formatting · c3861fb1
  charlie authored May 11, 2022
  
  c3861fb1
- Seralize and reflect changes · d9cd32a4
  charlie authored May 11, 2022
  
  d9cd32a4
- Prefuse layernorm for gpu (#1190) · 671f24be
  Paul Fultz II authored May 11, 2022
```
Fuse layernorm and added triadd_layernorm fusion.  This is a prep performance booster
```
  671f24be
- Comments fix · 54a09384
  charlie authored May 11, 2022
  
  54a09384
- formatting · 8f76125c
  charlie authored May 11, 2022
  
  8f76125c
- element_space, min,max,opt _lens change · c497c12d
  charlie authored May 11, 2022
  
  c497c12d
- avoid typedef · 33e5534c
  charlie authored May 11, 2022
  
  33e5534c
10 May, 2022 3 commits
- Use std::initializer_list in constructor · ac0224a9
  charlie authored May 10, 2022
```
Reverts the dyn_data struct change
Should get around the ambiguous braced initialization list error
```
  ac0224a9
- Tidy fix: use move · 2e27a823
  charlie authored May 10, 2022
  
  2e27a823
- Expose `add_literal` in C and Python API (#1173) · 5e5ed37a
  Umang Yadav authored May 10, 2022
```
Expose add_literal method in C/C++ api
```
  5e5ed37a
09 May, 2022 3 commits
- Tidy fix: emplace_back() over for loop · 77ba9cb1
  charlie authored May 09, 2022
  
  77ba9cb1
- Add dyn_data struct to avoid ambiguous constructor · de4c1b44
  charlie authored May 09, 2022
  
  de4c1b44
- Refactor vectorization and preloading for pointwise fusions (#1184) · ddbbe54b
  Paul Fultz II authored May 09, 2022
```
Improves performance for add_gelu.  In bert it is 4x faster and for mul_add it is 50% faster than what we current have.
```
  ddbbe54b
06 May, 2022 3 commits
- Fix serialize errors · b31735e8
  charlie authored May 06, 2022
  
  b31735e8
- Dynamic shape tests · 7c63b13b
  charlie authored May 06, 2022
  
  7c63b13b
- upgrade docker images to ROCm 5.0.2 (#1133) · f55d7c24
  Chris Austen authored May 06, 2022
```
Move to CI containers to rocm 5.0.2
upgrade to 20.04
free up some more file space in github action environments
```
  f55d7c24
05 May, 2022 1 commit

Cppcheck fixes (#1195) · d582425b

Paul Fultz II authored May 05, 2022

Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.

d582425b

04 May, 2022 3 commits
- Remove const on dyn_dim copy getters · aa085491
  charlie authored May 04, 2022
  
  aa085491
- Fixing serialization errors · f8822187
  charlie authored May 04, 2022
  
  f8822187
- Shape class changes to handle dynamic · 8bf8e161
  charlie authored May 03, 2022
```
* More throw errors for functions that don't make sense for dynamic shape
* Print output changes
* Serialization changes
```
  8bf8e161
03 May, 2022 2 commits
- Dynamic shape handling in shape object · c0e18e78
  charlie authored May 03, 2022
  
  c0e18e78
- Extend lifetimes in C++ API (#1139) · 4a5a23a4
  Paul Fultz II authored May 02, 2022
```
Helps avoid dangling references. This also deprecates the constructors that didnt take a lifetime annotation since its ambiguous the lifetime.
```
  4a5a23a4
29 Apr, 2022 1 commit
- Add GatherND operator (#1089) · 4ec35e5f
  turneram authored Apr 28, 2022
```
Add ref and gpu implementations for ONNX op GatherND

Resolves #1032
```
  4ec35e5f
27 Apr, 2022 1 commit

Add lane reduction (#1180) · 4c72cc95

Paul Fultz II authored Apr 27, 2022

With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:

# lane
gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
# block
gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
# original
gpu::reduce_sum[axes={1}]: 6.73456ms
There is some basic logic to pick between lane and block reduce automatically.

4c72cc95

26 Apr, 2022 1 commit
- Expose get_queue method for context in API (#1161) · 36656030
  Umang Yadav authored Apr 26, 2022
```
* expose get_queue method
```
  36656030
23 Apr, 2022 1 commit

ReverseSequence op (#1177) · 31906785

Charlie Lin authored Apr 22, 2022

Implements the ReverseSequence ONNX operator as a parser.

This parser can only handle a constant sequence_lens input. This is the same as what is handled for TensorRT as far as I can tell.
We could handle a variable sequence_lens input; that would require ref and GPU implementations of the operator.
The ONNX backend tests are disabled because this does not handle variable sequence_lens.

31906785

19 Apr, 2022 1 commit

Refactor Pooling and implement ONNX LpPool and GlobalLpPool (#1152) · 764273e4

Charlie Lin authored Apr 18, 2022

Refactored the reference implementation of pooling to something like what was done for roialign. Moved the reference implementation of pooling from targets/ref/lowering.cpp to pooling.hpp.
Removed cpu_pooling, instead using reference pooling in pooling.hpp
Added reference implementation of Lp Norm pooling and the global version
Added tests for the Lp Norm Pooling

764273e4

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e