Commits · 2e79bb1b5af6761911982ce352de6d6c599174d6 · gaoqiong / MIGraphX

12 Oct, 2022 2 commits
- remove debug prints from fuse_ck · 2e79bb1b
  Alan Turner authored Oct 12, 2022
  
  2e79bb1b
- add ck_gemm_add_add_gelu fusion · f83139de
  Alan Turner authored Oct 12, 2022
  
  f83139de
07 Oct, 2022 1 commit
- Update tuning method · 78a300ff
  Alan Turner authored Oct 07, 2022
  
  78a300ff
04 Oct, 2022 1 commit
- Fast softmax (#1290) · a9a47402
  Paul Fultz II authored Oct 04, 2022
```
optimize the softmax operator
```
  a9a47402
27 Sep, 2022 1 commit
- Add onnx mod operator gpu cpu (#1306) · 40118191
  Ted Themistokleous authored Sep 26, 2022
```
Implement operator for CPU and GPU implementations
```
  40118191
22 Sep, 2022 2 commits
- Formatting · 1dd11890
  turneram authored Sep 22, 2022
  
  1dd11890
- Add xdl fp16 gemm · 07167910
  turneram authored Sep 22, 2022
  
  07167910
21 Sep, 2022 1 commit

Parameterize epsilon for layernorm kernel (#1367) · d9578ba6

kahmed10 authored Sep 21, 2022

This PR allows for other values of epsilon to be matched when finding layernorm. Similarly, the calculation now uses the variable for epsilon.

d9578ba6

19 Sep, 2022 1 commit

Improve layernorm and reductions performance (#1348) · 97a1ed2d

Paul Fultz II authored Sep 19, 2022

Compute mean and variance in same reduction
Set block size to numbers divisible by 32 instead powers of 2
Global is also set exactly instead of being divisible by block size
More exact matching of global/local can help get rid of branching/loops
Reduce vectors first before doing dpp_reduce
Explicitly vectorize array operators since the compiler doesnt always vectorize them
Still uses old for loop when its computing at compile-time since the reinterpret_cast nor the all the vector types is supported

97a1ed2d

16 Sep, 2022 1 commit
- Remove ck from cmakelists · 961cf059
  turneram authored Sep 16, 2022
  
  961cf059
14 Sep, 2022 2 commits
- Reduce problem size of unbatched_gemm tests (#1383) · 333860ce
  turneram authored Sep 14, 2022
```
The verify tests from pr #1354 were still causing some codecov timeouts after merge. This PR further reduces the problem sizes to avoid these failures.
```
  333860ce
- Implement concat using jit compilation (#1356) · 7662d9c0
  Paul Fultz II authored Sep 14, 2022
```
* Implement concat using jit compilation
```
  7662d9c0
13 Sep, 2022 5 commits
- Move ck includes to own header file · d1e27426
  turneram authored Sep 13, 2022
  
  d1e27426
- Use rocblas_gemm_ex for batched gemms with broadcasted B (#1354) · a10a8ef1
  turneram authored Sep 13, 2022
```
Improves performance for 4/6 GEMMs used by huggingface BERT models with batch_size>1 by using a non-batched rocBLAS call for GEMMs where the B input has a broadcasted batch dimension.
The four verify tests added reflect the actual configurations used by bert-base-cased, with varied batch sizes.

Also adds a matcher to simplify_reshapes to move multibroadcasts after concats.
```
  a10a8ef1
- Add gemm test · 6fb1706a
  turneram authored Sep 13, 2022
  
  6fb1706a
- Formatting · 0e237605
  turneram authored Sep 13, 2022
  
  0e237605
- Add n-dimensional inputs · 985fb0dd
  turneram authored Sep 13, 2022
  
  985fb0dd
12 Sep, 2022 1 commit
- Create half_t test · 9a7bb6d2
  turneram authored Sep 12, 2022
  
  9a7bb6d2
09 Sep, 2022 1 commit
- Call gemm from kernel · 127393f4
  turneram authored Sep 09, 2022
  
  127393f4
08 Sep, 2022 1 commit
- Merge elementwise · cc2535e0
  turneram authored Sep 08, 2022
  
  cc2535e0
07 Sep, 2022 2 commits
- Almost working · e4737e2f
  turneram authored Sep 07, 2022
  
  e4737e2f
- Fix accuracy bug when vectorizing slices (#1364) · 60aa0e48
  Paul Fultz II authored Sep 06, 2022
```
* Fix accuracy bug when vectorizing slices
```
  60aa0e48
06 Sep, 2022 1 commit
- Enable cppcheck rule for 'not', 'or' keywords (#1361) · d37a4df9
  Paul Fultz II authored Sep 06, 2022
```
Using not and or improves readability. The cppcheck rule will help ensure we are doing it consistently.
```
  d37a4df9
31 Aug, 2022 1 commit

Add pass to rewrite gelu as fast gelu (#1299) · 794a4335

turneram authored Aug 31, 2022

Rewrite_gelu pass replaces the gelu formula of x * (1/2) * (1 + erf(x/sqrt(2))) with the sigmoid approximation of x * Sigmoid(x * 1.702)

794a4335

27 Aug, 2022 1 commit

Improvements to handling and add constant passed to dot operator (#1280) · 8752875a

Paul Fultz II authored Aug 26, 2022

This will rewrite dot operators like X(Y + b) to XY + Xb when b is constant as we can fold the add away.
This improves handling pointwise with broadcasted operators, this helps improves const propagation.
Improve gemm fusion with a mul_add
Improve support for broadcast shapes in gemm

8752875a

25 Aug, 2022 2 commits
- Formatting · fb573172
  turneram authored Aug 25, 2022
  
  fb573172
- Switch to elementwise · 8d378877
  turneram authored Aug 25, 2022
  
  8d378877
17 Aug, 2022 1 commit
- Add jit layernorm fusion (#1301) · 1784584e
  Paul Fultz II authored Aug 16, 2022
  
  1784584e
16 Aug, 2022 1 commit
- Fix softmax accuracy issues (#1342) · 0e17a724
  Paul Fultz II authored Aug 16, 2022
  
  0e17a724
12 Aug, 2022 2 commits
- Formatting · 6bf3493a
  turneram authored Aug 12, 2022
  
  6bf3493a
- Add ref and jit ops · ef5a5f4e
  turneram authored Aug 12, 2022
  
  ef5a5f4e
25 Jul, 2022 1 commit

Add fpga target (#1304) · 8a30d698

varunsh authored Jul 25, 2022

* Add is_supported to the target
* Add get_target_assignments
* Rename assignment to target_assignments
* Add ref target header to test
* Add fpga target
* Make context const in compute

8a30d698

06 Jul, 2022 1 commit

Verify load and save (#1265) · f2531606

Paul Fultz II authored Jul 05, 2022

*In the verification tests, check that saving and reloading the program is the same program. This also fixes serialization to always load instructions in the same order. There is also fixes for deconv and quant_conv which didn't save the solution id, and was broken for serialization.

f2531606

22 Jun, 2022 1 commit
- Update license files (#1248) · e44cecbc
  Ted Themistokleous authored Jun 22, 2022
```
Updated each source file in the repo with the existing license.
```
  e44cecbc
07 Jun, 2022 1 commit

Prioritizing int8 over int8x4 when it is applicable (#1218) · 37c47504

Zhuoran Yin authored Jun 07, 2022



prioritizing int8 over int8x4 when it is applicable
Amend return to continue in apply loop
Adding error handling in case int8x4 compilation failed
Co-authored-by: Paul Fultz II <pfultz2@yahoo.com>

37c47504

02 Jun, 2022 1 commit
- Fix dangling reference with gemm add fusion (#1233) · 1339ba35
  Paul Fultz II authored Jun 01, 2022
  
  1339ba35
26 May, 2022 1 commit
- Upgrade to cppcheck 2.8 and fix new issues found (#1225) · a401e72a
  Paul Fultz II authored May 26, 2022
```
* Upgrade to cppcheck 2.8
```
  a401e72a
29 Apr, 2022 1 commit
- Add GatherND operator (#1089) · 4ec35e5f
  turneram authored Apr 28, 2022
```
Add ref and gpu implementations for ONNX op GatherND

Resolves #1032
```
  4ec35e5f
17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e

14 Apr, 2022 1 commit

Half2 overloads (#1157) · 12007dba

bpickrel authored Apr 14, 2022

Issue 1127 Updates the math.hpp header file to perform overloads of various standard functions (ops) for the hip half2 type. The half2 type is two 16-bit floats packed into a 32-bit number and therefore the overloads act on vectors of sizes that are multiples of 2. They are invoked in runtime compilation any time one of the ops is called on a tensor declared with the data type shape::half_type.

Defined new template, made instances of the template for those math operations that the hip library contains, added verify tests for the sqrt operator for three cases:

tensor size not divisible by 2
tensor size divisible by 2 but not by 4
tensor size divisible by 4

12007dba