- 05 Jul, 2022 2 commits
-
-
Paul Fultz II authored
* Add softmax kernel
-
Paul Fultz II authored
This reorders the transposes across slice to improve horizontal fusion for contiguous. This also improves eliminate_contiguous to remove contiguous better across splits.
-
- 03 Jul, 2022 1 commit
-
-
Paul Fultz II authored
* Add mlir c api * Formatting * Create a type attribute * Formatting * Parse module * Formatting * Add mlir dump function * Add test case * Formatting * Fix tidy issues * Update mlit version * Update to newer mlir * Format * Move mlir to the gpu and update the test * Formatting * Fix bug when appending module * Format * Remove old cmake flag * Update message * Add return * Format * Add mlir_compile * Format * Register dialect * Handle unsinged integers * Dont provide output for return instruction * Format * Add code to insert memrefs * Format * Add mlir verification * Formatting * Enable pointwise_fusion * Disable eliminate_data_type * Set kernal name * Format * Fix device name * Formatting * Fix output arg * Format * Updates * Upate hash * Add fuse_mlir pass * Format * Add fuse mlir * Format * Update mlir * Sort parameter names * Format * Reenable disabled passes * Remove old mlir conv * Remove asym default padding * Add more verbose tracing * Format * Fix compilation errors * Format * Whitelist operators * Format * Add namespace * Format * Update triple * Format * Use func dialect * Format * Use func.return * Format * Upgrade mlir version * Add comment * Handle symetrical padding * Format * Cleanup debug output * Format * List failed tests * Move mlir compile to jit pipeline * Format * Update version * Add source locations * Format * Correctly add module * Format * Update failed tests * Fix failures when mlir is disabled * Format * Update mlir version * Check type for fp32 * Format * Remove failed test * Update mlir in driver * Tidy fixes * Foramt * Tidy fixes * Format * Fix const * Remove from requirements * Fix cmake version * Fix tidy warning * Use another ifdef * Fix tidy * Other tidy fix * Format * Update hash * Add missing license files * Format * Format * Fix fnction name
-
- 30 Jun, 2022 1 commit
-
-
Paul Fultz II authored
This is an extension to insert_module_instructions, but instead of just inserting from a module, it can insert a range or a vector of instructions.
-
- 29 Jun, 2022 2 commits
-
-
Charlie Lin authored
Allows PyTorch converted version of SSD-resnet34 to work
-
Paul Fultz II authored
Compiles significantly faster than constructing all the objects. It also reduces recompiles as well.
-
- 26 Jun, 2022 1 commit
-
-
Paul Fultz II authored
* Add function to get a module tree * Get parent module in the pass manager
-
- 25 Jun, 2022 2 commits
-
-
Brian Pickrell authored
One-line fix to register the op miopen_fusion. This error was causing loading of compiled model files (*.mxr) to fail.
-
Paul Fultz II authored
* Jit contiguous
-
- 24 Jun, 2022 1 commit
-
-
Umang Yadav authored
Adds compute_method for the experimental custom ops. Adds a test for the same using HIP APIs. Depends on #1183 Solves #1101
-
- 23 Jun, 2022 1 commit
-
-
kahmed10 authored
* remove eliminate workspace * remove sync device and other tags
-
- 22 Jun, 2022 1 commit
-
-
Ted Themistokleous authored
Updated each source file in the repo with the existing license.
-
- 20 Jun, 2022 1 commit
-
-
Zhuoran Yin authored
* Fixing misspelled macro Co-authored-by:Paul Fultz II <pfultz2@yahoo.com>
-
- 17 Jun, 2022 3 commits
-
-
Umang Yadav authored
* remove code for allocation of C param in dot lowering * formatting Co-authored-by:Paul Fultz II <pfultz2@yahoo.com>
-
Ted Themistokleous authored
* [#935] Update tf_parser to have add_common_op() for parse_relu6 Similar to that of the onnx_parser.cpp add a add_common_op template and functionality to support clip based operations. This is done so clip operations can be guarenteed to have the same dimensions. * fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6 * fixup! fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6 * fixup! fixup! fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6 * fixup! fixup! fixup! fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6 * Formatting * fixup! Formatting Co-authored-by:
Umang Yadav <29876643+umangyadav@users.noreply.github.com> Co-authored-by:
Paul Fultz II <pfultz2@yahoo.com>
-
kahmed10 authored
* add allocate op header * formatting * add replace_allocate pass * formatting * move output param to remove_allocate pass * formatting * fix bugs in replace_allocate pass * formatting * fix verify if tests * formatting * move if op logic * formatting * cleanup lowering * cleanup lowering * formatting * fix tidy * formatting * fix tidy * add cpu allocate check * formatting * change cpu allocate in pass * formatting * add some tests for replace_allocate pass * formatting * pass by ref * fix run_pass * formatting * update variable name for module * update dce to use contains() and fix tidy * formatting * update cppcheck * add if test * formatting * add if test * rename var to mod_output_names * formatting * remove conditional * update allocate op and tests * formatting * update replace_allocate tests * update create_output_names() and conditional in replace_allocate * formatting * remove extra variable in replace_allocate * update tools script for allocation_model Co-authored-by:
Umang Yadav <29876643+umangyadav@users.noreply.github.com> Co-authored-by:
Chris Austen <causten@users.noreply.github.com> Co-authored-by:
Paul Fultz II <pfultz2@yahoo.com>
-
- 16 Jun, 2022 1 commit
-
-
Charlie Lin authored
* Use custom distance function * Pass module, skip order check if other module * Change other valid() * Remove unnecessary declaration * test multiple module dependency * Refactor to make more clear * Code cleanup * Simplify fix * Test EXPECT Co-authored-by:Paul Fultz II <pfultz2@yahoo.com>
-
- 10 Jun, 2022 1 commit
-
-
Paul Fultz II authored
Consolidate the vectorize and preload Add vectorization to reduction Co-authored-by:kahmed10 <15948690+kahmed10@users.noreply.github.com>
-
- 07 Jun, 2022 1 commit
-
-
Zhuoran Yin authored
prioritizing int8 over int8x4 when it is applicable Amend return to continue in apply loop Adding error handling in case int8x4 compilation failed Co-authored-by:Paul Fultz II <pfultz2@yahoo.com>
-
- 03 Jun, 2022 1 commit
-
-
Paul Fultz II authored
Break up the gpu::code_object print to show the actual kernels... gpu::code_object::add_kernel: 0.646121ms, 5% gpu::code_object::mul_kernel: 0.623822ms, 5% gpu::code_object::add_mul_erf_add_mul_mul_kernel: 0.498902ms, 4% gpu::code_object::mul_add_kernel: 0.478352ms, 4%
-
- 02 Jun, 2022 1 commit
-
-
Paul Fultz II authored
-
- 30 May, 2022 1 commit
-
-
shivadbhavsar authored
Following up on issue #1166 and PR #1220. Using the same approach as in #1220 for parallelizing the eval calls, we can significantly reduce the time spent on eliminate_contiguous pass.
-
- 26 May, 2022 2 commits
-
-
shivadbhavsar authored
Addressing issue #1166 - propagate_constant pass currently uses a recursive approach to find all instructions in a module that can be evaluated to a literal and performs the replacement in the same call. New approach: Perform single pass though instructions in the module to determine which instructions can be evaluated Evaluate selected instructions in parallel Replace the selected instructions with the corresponding literal
-
Paul Fultz II authored
* Upgrade to cppcheck 2.8
-
- 24 May, 2022 4 commits
-
-
Paul Fultz II authored
* Improve applicable batched gemms for bert
-
Paul Fultz II authored
Remove std references in runtime compilation since these are not available when using hiprtc and the headers may not be available on the system
-
Paul Fultz II authored
* Fuse gemm add with pointwise fusions
-
shivadbhavsar authored
As described in #1196, the ONNX mean parser does not work correctly for integral types. This update fixes the issue by handling integral types separately, where summation is performed before division. Additional test cases have also been added for handling integral types.
-
- 20 May, 2022 2 commits
-
-
kahmed10 authored
For clarity on kernel names found when profiling. The new names are set to the order of the ops being compiled. For example: add + relu = add_relu_kernel.
-
Paul Fultz II authored
-
- 17 May, 2022 1 commit
-
-
shivadbhavsar authored
Updated variable names according to #1193
-
- 11 May, 2022 1 commit
-
-
Paul Fultz II authored
Fuse layernorm and added triadd_layernorm fusion. This is a prep performance booster
-
- 10 May, 2022 1 commit
-
-
Umang Yadav authored
Expose add_literal method in C/C++ api
-
- 09 May, 2022 1 commit
-
-
Paul Fultz II authored
Improves performance for add_gelu. In bert it is 4x faster and for mul_add it is 50% faster than what we current have.
-
- 06 May, 2022 1 commit
-
-
Chris Austen authored
Move to CI containers to rocm 5.0.2 upgrade to 20.04 free up some more file space in github action environments
-
- 05 May, 2022 1 commit
-
-
Paul Fultz II authored
Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.
-
- 03 May, 2022 1 commit
-
-
Paul Fultz II authored
Helps avoid dangling references. This also deprecates the constructors that didnt take a lifetime annotation since its ambiguous the lifetime.
-
- 29 Apr, 2022 1 commit
-
-
turneram authored
Add ref and gpu implementations for ONNX op GatherND Resolves #1032
-
- 27 Apr, 2022 1 commit
-
-
Paul Fultz II authored
With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum: # lane gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms # block gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms # original gpu::reduce_sum[axes={1}]: 6.73456ms There is some basic logic to pick between lane and block reduce automatically.
-
- 26 Apr, 2022 1 commit
-
-
Umang Yadav authored
* expose get_queue method
-