- 20 May, 2022 1 commit
-
-
kahmed10 authored
For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.
-
- 17 May, 2022 1 commit
-
-
shivadbhavsar authored
Updated variable names according to #1193
-
- 11 May, 2022 1 commit
-
-
Paul Fultz II authored
Fuse layernorm and added triadd_layernorm fusion. This is a prep performance booster
-
- 09 May, 2022 1 commit
-
-
Paul Fultz II authored
Improves performance for add_gelu. In bert it is 4x faster and for mul_add it is 50% faster than what we current have.
-
- 06 May, 2022 1 commit
-
-
Chris Austen authored
Move to CI containers to rocm 5.0.2 upgrade to 20.04 free up some more file space in github action environments
-
- 05 May, 2022 1 commit
-
-
Paul Fultz II authored
Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.
-
- 29 Apr, 2022 1 commit
-
-
turneram authored
Add ref and gpu implementations for ONNX op GatherND Resolves #1032
-
- 27 Apr, 2022 1 commit
-
-
Paul Fultz II authored
With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum: # lane gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms # block gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms # original gpu::reduce_sum[axes={1}]: 6.73456ms There is some basic logic to pick between lane and block reduce automatically.
-
- 17 Apr, 2022 1 commit
-
-
Paul Fultz II authored
There is significant improvement on larger tensors with half almost 50% faster: lens: [1024, 384, 768] gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms gpu::reduce_sum[axes={2}]: 1.73126ms Also for non-trivial layouts this can sometimes be over 2x faster: lens: [64, 1024, 768, 4] gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms gpu::reduce_sum[axes={1}]: 2.63375ms Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR. Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.
-
- 14 Apr, 2022 1 commit
-
-
bpickrel authored
Issue 1127 Updates the math.hpp header file to perform overloads of various standard functions (ops) for the hip half2 type. The half2 type is two 16-bit floats packed into a 32-bit number and therefore the overloads act on vectors of sizes that are multiples of 2. They are invoked in runtime compilation any time one of the ops is called on a tensor declared with the data type shape::half_type. Defined new template, made instances of the template for those math operations that the hip library contains, added verify tests for the sqrt operator for three cases: tensor size not divisible by 2 tensor size divisible by 2 but not by 4 tensor size divisible by 4
-
- 12 Apr, 2022 1 commit
-
-
Paul Fultz II authored
out-of-bounds access when generate uses nonpacked tensors and add some additional asserts for gpu memory.
-
- 11 Apr, 2022 2 commits
-
-
bpickrel authored
Change the "scatter" struct and op to a base/child set of three: scatter_none, scatter_add, scatter_mul to mirror Onnx' ScatterElements op. and its three reduction options. (Onnx Scatter op is deprecated and is equivalent to scatter_none.) Provides both a reference op. and update to Onnx parsing. Tests updated and new test case added.
-
Shucai Xiao authored
When create a tensor_view with vector date type, the last dimension of the shape should be divided by the vec_size.
-
- 29 Mar, 2022 1 commit
-
-
Paul Fultz II authored
This adds the infrastructure so we can compile everything in parallel, whereas before only pointwise kernels were compiled in parallel. This will also directly integrate with lowering and the gpu-driver. The kernels for pointwise and roialign are using this infrastructure. Scatternd is not since it does require standard shape. This also makes it easier to add new runtime compiled kernels in the future.
-
- 28 Mar, 2022 1 commit
-
-
Paul Fultz II authored
* Use ccache for runtime compilation
-
- 18 Mar, 2022 1 commit
-
-
turneram authored
Add exclusive and reverse modes to gpu implementation of prefix_scan_sum, which completes support for ONNX op CumSum
-
- 15 Mar, 2022 1 commit
-
-
Paul Fultz II authored
This adds iterators to tensor_view, which can allow kernels to work with non-standard shapes like for roialign. To improve the performance of indexing when using the iterators, the shape class was updated to use integral_constants since the compiler doesn't always fold the const values. An integral_constant will at least enforce that in the AST. Finally, since index calculations with single integers are improved, I also updated pointwise to use single index rather than multi index. There is about 4% improvement in some cases.
-
- 14 Mar, 2022 1 commit
-
-
Shucai Xiao authored
change max number of groups in a kernel to 1B for greater performance
-
- 04 Mar, 2022 1 commit
-
-
bpickrel authored
Changed the pooling values for two structures from strings to specialized enum classes. Many test and operator parsing changes to support this. Introduces one new source file, op_enums.cpp.
-
- 03 Mar, 2022 3 commits
-
-
Paul Fultz II authored
Boost the max number of workgroups for pointwise ops by matching what we are doing in launch.hpp
-
kahmed10 authored
better performance doing it this way
-
turneram authored
Add onnx parser and ref and gpu implementations of ONNX op ScatterND
-
- 02 Mar, 2022 2 commits
-
-
Charlie Lin authored
Implements the IsNaN operator, ref, gpu, and onnx parser.
-
bpickrel authored
Update the base version of clang-format from 5.0 to 10.0
-
- 25 Feb, 2022 1 commit
-
-
Paul Fultz II authored
wrapped in a any_ptr class so the type can be checked at runtime for a mismatch.
-
- 24 Feb, 2022 1 commit
-
-
Paul Fultz II authored
Make doc/CMakeLists.txt standalone Switch to use rocm-cmake modules for document generation Add CONFIGURE_DEPENDS to file(GLOB) so it will update without an explicit cmake run Add STRINGS property for build type to make it easier to switch build types with ccmake Various fixes and improvements
-
- 09 Feb, 2022 1 commit
-
-
Paul Fultz II authored
There is now a MIGRAPHX_DISABLE_POINTWISE_FUSION to disable it
-
- 08 Feb, 2022 2 commits
-
-
Paul Fultz II authored
This causes incorrect memory coloring, which was causing the accuracy failures in the vision model when enabling the pointwise fusions. Resnet50, inceptionv3 and inceptionv4 do verify now in the driver.
-
Paul Fultz II authored
Enforce types to avoid compilation error in pointwise fusions This fixes compile failure: gpt-2, fp16 on Navi
-
- 28 Jan, 2022 1 commit
-
-
Paul Fultz II authored
* Enable auto vectorization * Handle vector types with convert function * Dont vectorize when it will cause problems with preload
-
- 27 Jan, 2022 1 commit
-
-
Umang Yadav authored
allow nonstd shape for the arg ops, non-standard shapes include broadcast, slice and transpose
-
- 21 Jan, 2022 1 commit
-
-
Paul Fultz II authored
* Improve handling of generator expressions when getting the flags for hip
-
- 10 Jan, 2022 1 commit
-
-
Paul Fultz II authored
* Add matcher for conv_bias pointwise * Add fusion op
-
- 09 Dec, 2021 1 commit
-
-
Shucai Xiao authored
Changed the number of threads in a block from 256 to 128 Increased the max number of blocks in the kernel from 256 to 1M. For the case that the axis is the last dimension, we removed the computation of index since it is not required. With these change, we can get about 2x speedup compared to the develop branch for the softmax op used in the BertSquad model.
-
- 08 Dec, 2021 1 commit
-
-
Paul Fultz II authored
-
- 07 Dec, 2021 1 commit
-
-
Paul Fultz II authored
simple variable rename
-
- 02 Dec, 2021 1 commit
-
-
Paul Fultz II authored
Fix pointwise compile error with half sqrt
-
- 30 Nov, 2021 2 commits
-
-
turneram authored
Fix whitespace bug in fusable_conv matcher and add unit test
-
Paul Fultz II authored
-
- 24 Nov, 2021 1 commit
-
-
Paul Fultz II authored
* Check jit kernels files with clang-tidy
-