- 26 May, 2022 1 commit
-
-
Paul Fultz II authored
* Upgrade to cppcheck 2.8
-
- 09 May, 2022 1 commit
-
-
Paul Fultz II authored
Improves performance for add_gelu. In bert it is 4x faster and for mul_add it is 50% faster than what we current have.
-
- 05 May, 2022 1 commit
-
-
Paul Fultz II authored
Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.
-
- 17 Apr, 2022 1 commit
-
-
Paul Fultz II authored
There is significant improvement on larger tensors with half almost 50% faster: lens: [1024, 384, 768] gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms gpu::reduce_sum[axes={2}]: 1.73126ms Also for non-trivial layouts this can sometimes be over 2x faster: lens: [64, 1024, 768, 4] gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms gpu::reduce_sum[axes={1}]: 2.63375ms Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR. Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.
-
- 29 Mar, 2022 1 commit
-
-
Paul Fultz II authored
This adds the infrastructure so we can compile everything in parallel, whereas before only pointwise kernels were compiled in parallel. This will also directly integrate with lowering and the gpu-driver. The kernels for pointwise and roialign are using this infrastructure. Scatternd is not since it does require standard shape. This also makes it easier to add new runtime compiled kernels in the future.
-
- 28 Mar, 2022 1 commit
-
-
Paul Fultz II authored
* Use ccache for runtime compilation
-
- 03 Mar, 2022 1 commit
-
-
Paul Fultz II authored
Boost the max number of workgroups for pointwise ops by matching what we are doing in launch.hpp
-
- 28 Oct, 2021 1 commit
-
-
Shucai Xiao authored
GPU implementation of the roialign operator, using the jit approach to reduce the lib size.
-
- 10 Aug, 2021 1 commit
-
-
Paul Fultz II authored
* Add hiprtc compile option * Add cross compile test * Update error reporting * Add tests for errors and warnings * Fix tidy warning * Add comment to ifdefs * Skip null character at end of log * Assert there is null at the end
-
- 05 Aug, 2021 1 commit
-
-
Paul Fultz II authored
* Add method to compile pointwise * Formatting * Add lambda * Add semicolon * Rename variable * Add driver to run jit kernels * Formatting * Add context * Formatting * Make seperate driver folder * Add more general gpu driver * Formatting * Print out wll time * Formatting * Run multiple times and skip first run * Formatting * Seperate time_op * Run an op for comparison * Formatting * Add debug asserts * Formatting * Change parameer name * Formatting * Fix argument order * Formatting * Add preloading * Formatting * Allow a different data type * Formatting * Pipeline transformations * Formatting * Add vectorization * Formatting * Reduce dims * Formatting * Compile with launch params as constant * Formatting * Make sure buffer can be vecotrized * Formatting * Enable vectorization and preloading * Formatting * Add print header * Formatting * Avoid allocating to large of LDS * Formatting * Add some vec functions to a seperate header * Formatting * Add stride loops * Formatting * Improve the transform pipeline * Formatting * Add const * Fix shape check * Formatting * Just check stride axis is zero * Remove extra finc_vector_axis overload * Simplify some mroe functions * Formatting * Remove some more extra functions * Formatting * Simplify more decltypes * Add another const * Fix test * Get buffer pointer different for older compilers Co-authored-by:
Shucai Xiao <shucai@gmail.com> Co-authored-by:
Chris Austen <causten@users.noreply.github.com>
-
- 19 Apr, 2021 1 commit
-
-
Paul Fultz II authored
* Add definitions for all pointwise operators * Formatting * Add cpp generator class * Formatting * Move compilation to core * Formatting * Add clock to tmp name * Add dynamic loader * Formatting * Add tests for code gen * Formatting * Add test for literals * Formatting * Use with_char * Add missing header * Fix mismerge * Ignore tidy warning * Fxx gcc 5 errors * Apply fixits * Skip signed bitwise of status * Remove unused parameters * Explicitly add c++14 flag * Fix tidy warning * Remove .o files
-
- 26 Mar, 2021 1 commit
-
-
Paul Fultz II authored
* Add code object op * Formattting * Add more value tests * Formatting * Fix from_value conversion from binary * Formatting * Dont use offload copy * Remove iostream header * Fix compilation errors * Formatting * Rename var * Add missing files * Formatting * Remove duplicate variable * Remove comment * Template the function so sfinae will work * Formatting * Use template specialization since ADL is broken on hcc * Formatting * Annotate the constructor with HD for hcc * Make variable const Co-authored-by:mvermeulen <5479696+mvermeulen@users.noreply.github.com>
-
- 18 Jan, 2021 1 commit
-
-
Paul Fultz II authored
Co-authored-by:mvermeulen <5479696+mvermeulen@users.noreply.github.com>
-
- 09 Nov, 2020 1 commit
-
-
Paul Fultz II authored
* Add compiler flags * Add missing include * Add filesystem header * Formatting * Add tmp_dir to run * Formatting * Kernel compilation and launching * Formatting * Seperate pack_args * Formatting * Add alignment tests * Formatting * Add compile test * Formatting * Complete compile test * Formatting * Use is_regular_file free function * Fix is_regular_file call * Fix tidy issues * Fix tidy * Fix tidy issue * Print size in read_buffer to debug issue on jenkins * Add hip flags before src file * Fix reading output files * Fix unsued variable warning * Formatting * Formatting * Disable tidy check Co-authored-by:
Shucai Xiao <shucai.xiao@amd.com> Co-authored-by:
mvermeulen <5479696+mvermeulen@users.noreply.github.com>
-