Commits · 811c333d289d170d4dece8d9f438fd4454c5b7ae · gaoqiong / MIGraphX

01 Jun, 2022 1 commit
- Fix epsilon bug · 811c333d
  turneram authored Jun 01, 2022
  
  811c333d
31 May, 2022 1 commit
- Remove layernorm op · 5af79bd7
  turneram authored May 31, 2022
  
  5af79bd7
30 May, 2022 1 commit

Improve eliminate contiguous pass (#1223) · 86061b4d

shivadbhavsar authored May 29, 2022

Following up on issue #1166 and PR #1220. Using the same approach as in #1220 for parallelizing the eval calls, we can significantly reduce the time spent on eliminate_contiguous pass.

86061b4d

26 May, 2022 4 commits

Parallelize evaluations in propagate_constant (#1220) · bf603a76

shivadbhavsar authored May 26, 2022

Addressing issue #1166 - propagate_constant pass currently uses a recursive approach to find all instructions in a module that can be evaluated to a literal and performs the replacement in the same call.

New approach:

Perform single pass though instructions in the module to determine which instructions can be evaluated
Evaluate selected instructions in parallel
Replace the selected instructions with the corresponding literal

bf603a76

Upgrade to cppcheck 2.8 and fix new issues found (#1225) · a401e72a
Paul Fultz II authored May 26, 2022
```
* Upgrade to cppcheck 2.8
```
a401e72a
Formatting · 74b947ed
turneram authored May 26, 2022

74b947ed
Use parse_layernorm to un-fuse LayerNormalization op · 6ca16d98
turneram authored May 26, 2022

6ca16d98

24 May, 2022 4 commits

Improve applicable batched gemms (#1214) · bf0a4713
Paul Fultz II authored May 24, 2022
```
* Improve applicable batched gemms for bert
```
bf0a4713

Remove std references in runtime compilation (#1186) · 150d6d20

Paul Fultz II authored May 24, 2022

Remove std references in runtime compilation since these are not available when using hiprtc and the headers may not be available on the system

150d6d20

Fuse gemm add with pointwise fusions (#1213) · a500620e
Paul Fultz II authored May 24, 2022
```
* Fuse gemm add with pointwise fusions
```
a500620e

Fix onnx mean parsing for integral inputs (#1209) · d895104a

shivadbhavsar authored May 23, 2022

As described in #1196, the ONNX mean parser does not work correctly for integral types. This update fixes the issue by handling integral types separately, where summation is performed before division. Additional test cases have also been added for handling integral types.

d895104a

20 May, 2022 10 commits
- Add contiguous before reshape · eff3d2d3
  turneram authored May 20, 2022
  
  eff3d2d3
- Formatting · cd96c1c8
  turneram authored May 20, 2022
  
  cd96c1c8
- Remove transpose kernels · 37351ed6
  turneram authored May 20, 2022
  
  37351ed6
- Formatting · b745f416
  turneram authored May 20, 2022
  
  b745f416
- Fix layernorm verify test · a16adb42
  turneram authored May 20, 2022
  
  a16adb42
- Remove non-inference portions of parse_attention · 7757cfd0
  turneram authored May 20, 2022
  
  7757cfd0
- Formatting · 05e8bfde
  turneram authored May 20, 2022
  
  05e8bfde
- Add attention, layernorm op, transposectx, and transposeqkv · 3ea9fe4c
  turneram authored May 20, 2022
  
  3ea9fe4c
- Rename pointwise ops (#1145) · 4a312201
  kahmed10 authored May 20, 2022
```
For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.
```
  4a312201
- Improve matching with has_value when there are convert operators (#1212) · 27af0170
  Paul Fultz II authored May 19, 2022
  
  27af0170
17 May, 2022 1 commit
- renamed variables for module from p to m (#1204) · a27dd28c
  shivadbhavsar authored May 17, 2022
```
Updated variable names according to #1193
```
  a27dd28c
11 May, 2022 1 commit

Prefuse layernorm for gpu (#1190) · 671f24be

Paul Fultz II authored May 11, 2022

Fuse layernorm and added triadd_layernorm fusion.  This is a prep performance booster

671f24be

10 May, 2022 1 commit
- Expose `add_literal` in C and Python API (#1173) · 5e5ed37a
  Umang Yadav authored May 10, 2022
```
Expose add_literal method in C/C++ api
```
  5e5ed37a
09 May, 2022 1 commit

Refactor vectorization and preloading for pointwise fusions (#1184) · ddbbe54b

Paul Fultz II authored May 09, 2022

Improves performance for add_gelu.  In bert it is 4x faster and for mul_add it is 50% faster than what we current have.

ddbbe54b

06 May, 2022 1 commit

upgrade docker images to ROCm 5.0.2 (#1133) · f55d7c24

Chris Austen authored May 06, 2022

Move to CI containers to rocm 5.0.2
upgrade to 20.04
free up some more file space in github action environments

f55d7c24

05 May, 2022 1 commit

Cppcheck fixes (#1195) · d582425b

Paul Fultz II authored May 05, 2022

Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.

d582425b

03 May, 2022 1 commit

Extend lifetimes in C++ API (#1139) · 4a5a23a4

Paul Fultz II authored May 02, 2022

Helps avoid dangling references. This also deprecates the constructors that didnt take a lifetime annotation since its ambiguous the lifetime.

4a5a23a4

29 Apr, 2022 1 commit
- Add GatherND operator (#1089) · 4ec35e5f
  turneram authored Apr 28, 2022
```
Add ref and gpu implementations for ONNX op GatherND

Resolves #1032
```
  4ec35e5f
27 Apr, 2022 1 commit

Add lane reduction (#1180) · 4c72cc95

Paul Fultz II authored Apr 27, 2022

With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:

# lane
gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
# block
gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
# original
gpu::reduce_sum[axes={1}]: 6.73456ms
There is some basic logic to pick between lane and block reduce automatically.

4c72cc95

26 Apr, 2022 1 commit
- Expose get_queue method for context in API (#1161) · 36656030
  Umang Yadav authored Apr 26, 2022
```
* expose get_queue method
```
  36656030
23 Apr, 2022 1 commit

ReverseSequence op (#1177) · 31906785

Charlie Lin authored Apr 22, 2022

Implements the ReverseSequence ONNX operator as a parser.

This parser can only handle a constant sequence_lens input. This is the same as what is handled for TensorRT as far as I can tell.
We could handle a variable sequence_lens input; that would require ref and GPU implementations of the operator.
The ONNX backend tests are disabled because this does not handle variable sequence_lens.

31906785

19 Apr, 2022 1 commit

Refactor Pooling and implement ONNX LpPool and GlobalLpPool (#1152) · 764273e4

Charlie Lin authored Apr 18, 2022

Refactored the reference implementation of pooling to something like what was done for roialign. Moved the reference implementation of pooling from targets/ref/lowering.cpp to pooling.hpp.
Removed cpu_pooling, instead using reference pooling in pooling.hpp
Added reference implementation of Lp Norm pooling and the global version
Added tests for the Lp Norm Pooling

764273e4

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e

14 Apr, 2022 1 commit

Half2 overloads (#1157) · 12007dba

bpickrel authored Apr 14, 2022

Issue 1127 Updates the math.hpp header file to perform overloads of various standard functions (ops) for the hip half2 type. The half2 type is two 16-bit floats packed into a 32-bit number and therefore the overloads act on vectors of sizes that are multiples of 2. They are invoked in runtime compilation any time one of the ops is called on a tensor declared with the data type shape::half_type.

Defined new template, made instances of the template for those math operations that the hip library contains, added verify tests for the sqrt operator for three cases:

tensor size not divisible by 2
tensor size divisible by 2 but not by 4
tensor size divisible by 4

12007dba

13 Apr, 2022 1 commit
- Fix problem with incomplete types with older clang versions (#1174) · a11ef66a
  Paul Fultz II authored Apr 13, 2022
```
also added the PYTHON_DISABLE_VERSIONS cmake variable to disable python versions.
```
  a11ef66a
12 Apr, 2022 2 commits
- Fix out-of-bounds access when generate uses nonpacked tensors (#1160) · 262ba721
  Paul Fultz II authored Apr 12, 2022
```
out-of-bounds access when generate uses nonpacked tensors and add some additional asserts for gpu memory.
```
  262ba721
- parallelize the ref implementation of the gemm operator (#1142) · 88b3dd34
  Shucai Xiao authored Apr 12, 2022
```
ref implementation of the gemm op is sequential, this PR is to parallelize the gemm computation in the ref implementation.
```
  88b3dd34
11 Apr, 2022 2 commits

scatter operator refactoring to include reduction (#1124) · 701c2014

bpickrel authored Apr 11, 2022

Change the "scatter" struct and op to a base/child set of three: scatter_none, scatter_add, scatter_mul to mirror Onnx' ScatterElements op. and its three reduction options. (Onnx Scatter op is deprecated and is equivalent to scatter_none.)

Provides both a reference op. and update to Onnx parsing. Tests updated and new test case added.

701c2014

fix a bug in create tensor_view with vec data type (#1155) · 3c301efa

Shucai Xiao authored Apr 11, 2022

When create a tensor_view with vector date type, the last dimension of the shape should be divided by the vec_size.

3c301efa