- 17 Jun, 2022 2 commits
- 13 Jun, 2022 4 commits
- 10 Jun, 2022 1 commit
-
-
Paul Fultz II authored
Consolidate the vectorize and preload Add vectorization to reduction Co-authored-by:kahmed10 <15948690+kahmed10@users.noreply.github.com>
-
- 09 Jun, 2022 2 commits
- 07 Jun, 2022 1 commit
-
-
Zhuoran Yin authored
prioritizing int8 over int8x4 when it is applicable Amend return to continue in apply loop Adding error handling in case int8x4 compilation failed Co-authored-by:Paul Fultz II <pfultz2@yahoo.com>
-
- 03 Jun, 2022 1 commit
-
-
Paul Fultz II authored
Break up the gpu::code_object print to show the actual kernels... gpu::code_object::add_kernel: 0.646121ms, 5% gpu::code_object::mul_kernel: 0.623822ms, 5% gpu::code_object::add_mul_erf_add_mul_mul_kernel: 0.498902ms, 4% gpu::code_object::mul_add_kernel: 0.478352ms, 4%
-
- 02 Jun, 2022 1 commit
-
-
Paul Fultz II authored
-
- 26 May, 2022 1 commit
-
-
Paul Fultz II authored
* Upgrade to cppcheck 2.8
-
- 25 May, 2022 2 commits
- 24 May, 2022 5 commits
-
-
Paul authored
-
Paul authored
-
Paul Fultz II authored
* Improve applicable batched gemms for bert
-
Paul Fultz II authored
Remove std references in runtime compilation since these are not available when using hiprtc and the headers may not be available on the system
-
Paul Fultz II authored
* Fuse gemm add with pointwise fusions
-
- 20 May, 2022 1 commit
-
-
kahmed10 authored
For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.
-
- 18 May, 2022 3 commits
- 17 May, 2022 1 commit
-
-
shivadbhavsar authored
Updated variable names according to #1193
-
- 11 May, 2022 1 commit
-
-
Paul Fultz II authored
Fuse layernorm and added triadd_layernorm fusion. This is a prep performance booster
-
- 09 May, 2022 1 commit
-
-
Paul Fultz II authored
Improves performance for add_gelu. In bert it is 4x faster and for mul_add it is 50% faster than what we current have.
-
- 06 May, 2022 3 commits
-
-
Paul authored
-
Paul authored
-
Chris Austen authored
Move to CI containers to rocm 5.0.2 upgrade to 20.04 free up some more file space in github action environments
-
- 05 May, 2022 7 commits
- 29 Apr, 2022 1 commit
-
-
turneram authored
Add ref and gpu implementations for ONNX op GatherND Resolves #1032
-
- 27 Apr, 2022 1 commit
-
-
Paul Fultz II authored
With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum: # lane gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms # block gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms # original gpu::reduce_sum[axes={1}]: 6.73456ms There is some basic logic to pick between lane and block reduce automatically.
-
- 17 Apr, 2022 1 commit
-
-
Paul Fultz II authored
There is significant improvement on larger tensors with half almost 50% faster: lens: [1024, 384, 768] gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms gpu::reduce_sum[axes={2}]: 1.73126ms Also for non-trivial layouts this can sometimes be over 2x faster: lens: [64, 1024, 768, 4] gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms gpu::reduce_sum[axes={1}]: 2.63375ms Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR. Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.
-