Commits · 1578c0c7ac1f378702705a8edeba78f678b8a50f · OpenDAS / apex

23 Apr, 2023 1 commit

Grid optimization - Chunk_Size optimization. (#104) · 1578c0c7

aspanday authored Feb 15, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

* Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
The set of performance along with comaprison with Torch is captured here
https://amdcloud.sharepoint.com/❌

/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
See sheet chunk_opt.

* Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.

---------
Co-authored-by: aspanday <aspanday@amd.com>

1578c0c7

15 Feb, 2022 1 commit
- taking channels last 3d into account (#1284) · 39fc7ccf
  Masaki Kozuki authored Feb 15, 2022
  
  39fc7ccf
13 Dec, 2021 1 commit
- Remove deprecated THC/THC.h · 67ded2e2
  Hubert Lu authored Dec 13, 2021
  
  67ded2e2
04 Oct, 2021 1 commit
- in multi tensor apply, skip empty tensors (#54) · 297ab210
  Jeff Daily authored Oct 04, 2021
  
  297ab210
25 Feb, 2021 1 commit
- Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)" · fbb8cd93
  Jeff Daily authored Feb 25, 2021
```
This reverts commit bdd481d1.
```
  fbb8cd93
21 Jan, 2021 1 commit
- use __launch_bounds__ for multi_tensor_apply (#44) · 5baa68d3
  Jeff Daily authored Jan 21, 2021
```
use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests
```
  5baa68d3
18 Jan, 2021 1 commit
- missing #include <c10/cuda/CUDAGuard.h> · 4ebf2b90
  Jeff Daily authored Jan 18, 2021
  
  4ebf2b90
05 Aug, 2020 1 commit

set device guard for multi tensor optimizer implementations (#927) · 274cc063

ngimel authored Aug 05, 2020

* add device guards to the optimizers

* add untracked file

* set deviceGuard in multi_tensor_apply

* address review comments; fix lamb

* indent

* typo

274cc063

21 May, 2020 1 commit
- pass all TensorListMetadata as pointer to pinned host memory (#13) · bdd481d1
  Jeff Daily authored May 21, 2020
  
  bdd481d1
12 May, 2020 1 commit
- Enable support for sparse tensors for multi_tensor_apply (#6) · 02a5274b
  Chaitanya Sri Krishna Lolla authored May 12, 2020
  
  02a5274b
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
06 Sep, 2019 1 commit

Fix for #456 (#477) · 325f5a0b

mcarilli authored Sep 05, 2019

* Pushing for build tests

* Contrib files

* Removing deprecated checks

325f5a0b

03 Jul, 2019 2 commits
- Pulling in deprecation warning changes · 665b2dd7
  Michael Carilli authored Jul 03, 2019
  
  665b2dd7
- Changing AT_CHECK to TORCH_CHECK · adee29f6
  Michael Carilli authored Jul 03, 2019
  
  adee29f6
31 May, 2019 1 commit

Give multi-tensor L2 norm the ability to compute norms per-tensor as well as globally (#333) · 93338e62

mcarilli authored May 31, 2019

* Existing tests passing, still need to add per-tensor tests

* Test is passing, still need to measure performance

* ILP for l2norm functor

93338e62

12 Mar, 2019 1 commit
- Forward/backward compatibility around pytorch 3aeb78, to fix #191 · 42180bd9
  Michael Carilli authored Mar 11, 2019
  
  42180bd9
10 Mar, 2019 1 commit
- fix includes · f34686f1
  Natalia Gimelshein authored Mar 09, 2019
  
  f34686f1
28 Feb, 2019 1 commit
- Comprehensive tests for cross product of options · d24c25b9
  Michael Carilli authored Feb 27, 2019
  
  d24c25b9
24 Feb, 2019 1 commit
- Stashing work · d137b800
  Michael Carilli authored Feb 24, 2019
  
  d137b800
22 Feb, 2019 1 commit
- Allow multi-tensor unscale to handle FP16 output, so it can also be used for... · 80a3f3ca
  Michael Carilli authored Feb 21, 2019
```
Allow multi-tensor unscale to handle FP16 output, so it can also be used for copy-scatter. Rename some options.
```
  80a3f3ca
19 Feb, 2019 1 commit
- Reworked multi tensor apply, added tests · 6763a8be
  Michael Carilli authored Feb 18, 2019
  
  6763a8be