Commits · 1784584e0bedc81d00add1b4c663c06f3c84f77b · gaoqiong / MIGraphX

17 Aug, 2022 1 commit
- Add jit layernorm fusion (#1301) · 1784584e
  Paul Fultz II authored Aug 16, 2022
  
  1784584e
25 Jul, 2022 1 commit

Add onnx mod operator (#1302) · 77e80b8e

Ted Themistokleous authored Jul 25, 2022

* Add in changes for onnx Mod operator

Initial operator for mod implementation and test cases for integer and floating based types.

Need to use fmod from stdlib for floating point types. half_float::half thankfully is specced to the use the existing std::fmod() call when looking at the half.hpp implementation.

fmod_flag should mirror the onnx fmod attribute. Right now using a floating point type without setting that on the user side to true will result in an exception.

Ref ticket #1283

77e80b8e

25 Jun, 2022 1 commit
- Use jit for contiguous operator (#1217) · b75c83d8
  Paul Fultz II authored Jun 24, 2022
```
* Jit contiguous
```
  b75c83d8
22 Jun, 2022 1 commit
- Update license files (#1248) · e44cecbc
  Ted Themistokleous authored Jun 22, 2022
```
Updated each source file in the repo with the existing license.
```
  e44cecbc
10 Jun, 2022 1 commit

Add vectorized reduce (#1202) · aa7ff911

Paul Fultz II authored Jun 09, 2022



Consolidate the vectorize and preload
Add vectorization to reduction
Co-authored-by: kahmed10 <15948690+kahmed10@users.noreply.github.com>

aa7ff911

20 May, 2022 1 commit

Rename pointwise ops (#1145) · 4a312201

kahmed10 authored May 20, 2022

For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.

4a312201

09 May, 2022 1 commit

Refactor vectorization and preloading for pointwise fusions (#1184) · ddbbe54b

Paul Fultz II authored May 09, 2022

Improves performance for add_gelu.  In bert it is 4x faster and for mul_add it is 50% faster than what we current have.

ddbbe54b

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e

29 Mar, 2022 1 commit

Refactor runtime compiled kernels to use the same compile_ops pipeline (#1125) · 661046c6

Paul Fultz II authored Mar 29, 2022

This adds the infrastructure so we can compile everything in parallel, whereas before only pointwise kernels were compiled in parallel. This will also directly integrate with lowering and the gpu-driver. The kernels for pointwise and roialign are using this infrastructure. Scatternd is not since it does require standard shape.

This also makes it easier to add new runtime compiled kernels in the future.

661046c6