1. 30 Jun, 2022 1 commit
  2. 29 Jun, 2022 2 commits
  3. 26 Jun, 2022 1 commit
  4. 25 Jun, 2022 2 commits
  5. 24 Jun, 2022 1 commit
  6. 23 Jun, 2022 1 commit
  7. 22 Jun, 2022 1 commit
  8. 20 Jun, 2022 1 commit
  9. 17 Jun, 2022 3 commits
    • Umang Yadav's avatar
      Update lowering of Dot operator (#1247) · c99be32c
      Umang Yadav authored
      
      
      * remove code for allocation of C param in dot lowering
      
      * formatting
      Co-authored-by: default avatarPaul Fultz II <pfultz2@yahoo.com>
      c99be32c
    • Ted Themistokleous's avatar
      Update tf_parser to have add_common_op() for parse_relu6 (#1241) · 421a5621
      Ted Themistokleous authored
      
      
      * [#935] Update tf_parser to have add_common_op() for parse_relu6
      
      Similar to that of the onnx_parser.cpp add a add_common_op template and functionality to support clip based operations. This is done so clip operations can be guarenteed to have the same dimensions.
      
      * fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6
      
      * fixup! fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6
      
      * fixup! fixup! fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6
      
      * fixup! fixup! fixup! fixup! [#935] Update tf_parser to have add_common_op() for parse_relu6
      
      * Formatting
      
      * fixup! Formatting
      Co-authored-by: default avatarUmang Yadav <29876643+umangyadav@users.noreply.github.com>
      Co-authored-by: default avatarPaul Fultz II <pfultz2@yahoo.com>
      421a5621
    • kahmed10's avatar
      Create allocate op and replace_allocate pass (#1183) · add6fb3b
      kahmed10 authored
      
      
      * add allocate op header
      
      * formatting
      
      * add replace_allocate pass
      
      * formatting
      
      * move output param to remove_allocate pass
      
      * formatting
      
      * fix bugs in replace_allocate pass
      
      * formatting
      
      * fix verify if tests
      
      * formatting
      
      * move if op logic
      
      * formatting
      
      * cleanup lowering
      
      * cleanup lowering
      
      * formatting
      
      * fix tidy
      
      * formatting
      
      * fix tidy
      
      * add cpu allocate check
      
      * formatting
      
      * change cpu allocate in pass
      
      * formatting
      
      * add some tests for replace_allocate pass
      
      * formatting
      
      * pass by ref
      
      * fix run_pass
      
      * formatting
      
      * update variable name for module
      
      * update dce to use contains() and fix tidy
      
      * formatting
      
      * update cppcheck
      
      * add if test
      
      * formatting
      
      * add if test
      
      * rename var to mod_output_names
      
      * formatting
      
      * remove conditional
      
      * update allocate op and tests
      
      * formatting
      
      * update replace_allocate tests
      
      * update create_output_names() and conditional in replace_allocate
      
      * formatting
      
      * remove extra variable in replace_allocate
      
      * update tools script for allocation_model
      Co-authored-by: default avatarUmang Yadav <29876643+umangyadav@users.noreply.github.com>
      Co-authored-by: default avatarChris Austen <causten@users.noreply.github.com>
      Co-authored-by: default avatarPaul Fultz II <pfultz2@yahoo.com>
      add6fb3b
  10. 16 Jun, 2022 1 commit
  11. 10 Jun, 2022 1 commit
  12. 07 Jun, 2022 1 commit
  13. 03 Jun, 2022 1 commit
    • Paul Fultz II's avatar
      Group code objects by kernel name in perf report summary (#1234) · 7271ddbc
      Paul Fultz II authored
      Break up the gpu::code_object  print to show the actual kernels...
      
      gpu::code_object::add_kernel: 0.646121ms, 5%
      gpu::code_object::mul_kernel: 0.623822ms, 5%
      gpu::code_object::add_mul_erf_add_mul_mul_kernel: 0.498902ms, 4%
      gpu::code_object::mul_add_kernel: 0.478352ms, 4%
      7271ddbc
  14. 02 Jun, 2022 1 commit
  15. 30 May, 2022 1 commit
    • shivadbhavsar's avatar
      Improve eliminate contiguous pass (#1223) · 86061b4d
      shivadbhavsar authored
      Following up on issue #1166 and PR #1220. Using the same approach as in #1220 for parallelizing the eval calls, we can significantly reduce the time spent on eliminate_contiguous pass.
      86061b4d
  16. 26 May, 2022 2 commits
    • shivadbhavsar's avatar
      Parallelize evaluations in propagate_constant (#1220) · bf603a76
      shivadbhavsar authored
      Addressing issue #1166 - propagate_constant pass currently uses a recursive approach to find all instructions in a module that can be evaluated to a literal and performs the replacement in the same call.
      
      New approach:
      
      Perform single pass though instructions in the module to determine which instructions can be evaluated
      Evaluate selected instructions in parallel
      Replace the selected instructions with the corresponding literal
      bf603a76
    • Paul Fultz II's avatar
      Upgrade to cppcheck 2.8 and fix new issues found (#1225) · a401e72a
      Paul Fultz II authored
      * Upgrade to cppcheck 2.8
      a401e72a
  17. 24 May, 2022 4 commits
  18. 20 May, 2022 2 commits
  19. 17 May, 2022 1 commit
  20. 11 May, 2022 1 commit
  21. 10 May, 2022 1 commit
  22. 09 May, 2022 1 commit
  23. 06 May, 2022 1 commit
  24. 05 May, 2022 1 commit
    • Paul Fultz II's avatar
      Cppcheck fixes (#1195) · d582425b
      Paul Fultz II authored
      Fixes the #error when using cppcheck. This no longer suppresses cppcheck errors when including those errors. This fixes the cppcheck errors that was there already.
      d582425b
  25. 03 May, 2022 1 commit
  26. 29 Apr, 2022 1 commit
  27. 27 Apr, 2022 1 commit
    • Paul Fultz II's avatar
      Add lane reduction (#1180) · 4c72cc95
      Paul Fultz II authored
      With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:
      
      # lane
      gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
      # block
      gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
      # original
      gpu::reduce_sum[axes={1}]: 6.73456ms
      There is some basic logic to pick between lane and block reduce automatically.
      4c72cc95
  28. 26 Apr, 2022 1 commit
  29. 23 Apr, 2022 1 commit
    • Charlie Lin's avatar
      ReverseSequence op (#1177) · 31906785
      Charlie Lin authored
      Implements the ReverseSequence ONNX operator as a parser.
      
      This parser can only handle a constant sequence_lens input. This is the same as what is handled for TensorRT as far as I can tell.
      We could handle a variable sequence_lens input; that would require ref and GPU implementations of the operator.
      The ONNX backend tests are disabled because this does not handle variable sequence_lens.
      31906785
  30. 19 Apr, 2022 1 commit
    • Charlie Lin's avatar
      Refactor Pooling and implement ONNX LpPool and GlobalLpPool (#1152) · 764273e4
      Charlie Lin authored
      Refactored the reference implementation of pooling to something like what was done for roialign. Moved the reference implementation of pooling from targets/ref/lowering.cpp to pooling.hpp.
      Removed cpu_pooling, instead using reference pooling in pooling.hpp
      Added reference implementation of Lp Norm pooling and the global version
      Added tests for the Lp Norm Pooling
      764273e4
  31. 17 Apr, 2022 1 commit
    • Paul Fultz II's avatar
      Reduce with runtime compilation (#1150) · f9a5b81e
      Paul Fultz II authored
      There is significant improvement on larger tensors with half almost 50% faster:
      
      lens: [1024, 384, 768]
      gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
      gpu::reduce_sum[axes={2}]: 1.73126ms
      Also for non-trivial layouts this can sometimes be over 2x faster:
      
      lens: [64, 1024, 768, 4]
      gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
      gpu::reduce_sum[axes={1}]: 2.63375ms
      Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.
      
      Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.
      f9a5b81e