1. 12 Oct, 2022 2 commits
  2. 07 Oct, 2022 1 commit
  3. 04 Oct, 2022 1 commit
  4. 27 Sep, 2022 1 commit
  5. 22 Sep, 2022 2 commits
  6. 21 Sep, 2022 1 commit
  7. 19 Sep, 2022 1 commit
    • Paul Fultz II's avatar
      Improve layernorm and reductions performance (#1348) · 97a1ed2d
      Paul Fultz II authored
      Compute mean and variance in same reduction
      Set block size to numbers divisible by 32 instead powers of 2
      Global is also set exactly instead of being divisible by block size
      More exact matching of global/local can help get rid of branching/loops
      Reduce vectors first before doing dpp_reduce
      Explicitly vectorize array operators since the compiler doesnt always vectorize them
      Still uses old for loop when its computing at compile-time since the reinterpret_cast nor the all the vector types is supported
      97a1ed2d
  8. 16 Sep, 2022 1 commit
  9. 14 Sep, 2022 2 commits
  10. 13 Sep, 2022 5 commits
  11. 12 Sep, 2022 1 commit
  12. 09 Sep, 2022 1 commit
  13. 08 Sep, 2022 1 commit
  14. 07 Sep, 2022 2 commits
  15. 06 Sep, 2022 1 commit
  16. 31 Aug, 2022 1 commit
  17. 27 Aug, 2022 1 commit
  18. 25 Aug, 2022 2 commits
  19. 17 Aug, 2022 1 commit
  20. 16 Aug, 2022 1 commit
  21. 12 Aug, 2022 2 commits
  22. 25 Jul, 2022 1 commit
    • varunsh's avatar
      Add fpga target (#1304) · 8a30d698
      varunsh authored
      * Add is_supported to the target
      * Add get_target_assignments
      * Rename assignment to target_assignments
      * Add ref target header to test
      * Add fpga target
      * Make context const in compute
      8a30d698
  23. 06 Jul, 2022 1 commit
    • Paul Fultz II's avatar
      Verify load and save (#1265) · f2531606
      Paul Fultz II authored
      *In the verification tests, check that saving and reloading the program is the same program. This also fixes serialization to always load instructions in the same order. There is also fixes for deconv and quant_conv which didn't save the solution id, and was broken for serialization.
      f2531606
  24. 22 Jun, 2022 1 commit
  25. 07 Jun, 2022 1 commit
  26. 02 Jun, 2022 1 commit
  27. 26 May, 2022 1 commit
  28. 29 Apr, 2022 1 commit
  29. 17 Apr, 2022 1 commit
    • Paul Fultz II's avatar
      Reduce with runtime compilation (#1150) · f9a5b81e
      Paul Fultz II authored
      There is significant improvement on larger tensors with half almost 50% faster:
      
      lens: [1024, 384, 768]
      gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
      gpu::reduce_sum[axes={2}]: 1.73126ms
      Also for non-trivial layouts this can sometimes be over 2x faster:
      
      lens: [64, 1024, 768, 4]
      gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
      gpu::reduce_sum[axes={1}]: 2.63375ms
      Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.
      
      Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.
      f9a5b81e
  30. 14 Apr, 2022 1 commit
    • bpickrel's avatar
      Half2 overloads (#1157) · 12007dba
      bpickrel authored
      Issue 1127 Updates the math.hpp header file to perform overloads of various standard functions (ops) for the hip half2 type. The half2 type is two 16-bit floats packed into a 32-bit number and therefore the overloads act on vectors of sizes that are multiples of 2. They are invoked in runtime compilation any time one of the ops is called on a tensor declared with the data type shape::half_type.
      
      Defined new template, made instances of the template for those math operations that the hip library contains, added verify tests for the sqrt operator for three cases:
      
      tensor size not divisible by 2
      tensor size divisible by 2 but not by 4
      tensor size divisible by 4
      12007dba