- 23 Apr, 2023 1 commit
-
-
aspanday authored
* Updating BLOCK_SIZE to 1024. tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved. For now skipping test_bfloat16 for Adam in the unittest. Ran 17 other tests and ALL other tests pass! More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization. This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers. L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails. * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam. * Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K). In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320). The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements. This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex. The set of performance along with comaprison with Torch is captured here https://amdcloud.sharepoint.com/
❌ /r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8 See sheet chunk_opt. * Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits. changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers. The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels. --------- Co-authored-by:aspanday <aspanday@amd.com>
-
- 08 Nov, 2022 1 commit
-
-
flyingdown authored
-
- 09 Dec, 2021 1 commit
-
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 01 Sep, 2021 1 commit
-
-
Burc Eryilmaz authored
* fuse norm into scale * add fused norm into dlamb Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 15 Apr, 2021 1 commit
-
-
Sudhakar Singh authored
* Add unit tests for fused-novograd * Fix: tensors should reside on the same device * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test * fixed issues mentioned in the comments
-
- 25 Feb, 2021 1 commit
-
-
Jeff Daily authored
This reverts commit bdd481d1.
-
- 05 Aug, 2020 1 commit
-
-
ngimel authored
* add device guards to the optimizers * add untracked file * set deviceGuard in multi_tensor_apply * address review comments; fix lamb * indent * typo
-
- 22 Jun, 2020 1 commit
-
-
ashishfarmer authored
-
- 21 May, 2020 1 commit
-
-
Jeff Daily authored
-
- 12 May, 2020 1 commit
-
-
rohithkrn authored
-
- 07 May, 2020 1 commit
-
-
Chaitanya Sri Krishna Lolla authored
* fix dropout scaling from p to 1/(1-p) (#816) Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com> * Improvements to apex.mlp (#804) * update fused bias relu backward kernel * adding support for not require first layer dgrad * fix bug: wrong layer in requires grad * add infrastructure for optional bias and activation, currently only support no bias and no relu * make bias and relu optional separately * add sigmoid activation option * enable wider load/store for multi_tensor_apply kernels (#763) * modify MTA axpby for wider load/store * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load * Changes to make xentropysoftmax load/store vectorized when possible: (#725) * Changes to make xentropysoftmax load/store vectorized when possible: Increase default ILP so that each thread handle 16 Bytes data in one step Make thread load/store longest vector possible Make unroll case handle adjacent data instead of strided...
-
- 30 Apr, 2020 1 commit
-
-
Deyu Fu authored
* modify MTA axpby for wider load/store * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
-
- 06 Sep, 2019 1 commit
-
-
mcarilli authored
* Pushing for build tests * Contrib files * Removing deprecated checks
-
- 08 Aug, 2019 1 commit
-
-
Deyu Fu authored
-
- 31 May, 2019 1 commit
-
-
mcarilli authored
* Existing tests passing, still need to add per-tensor tests * Test is passing, still need to measure performance * ILP for l2norm functor
-
- 26 Apr, 2019 1 commit
-
-
ptrblck authored
* change .type().ScalarType() to .scalar_type() + at::ScalarType::X to at::kX * revert scalar_type() to type() for AT_DISPATCH_FLOATING_TYPES_AND_HALF * revert scalar_type() to type() in AT_DISPATCH_FLOATING_TYPES * revert scalar_type() to type() for AT_DISPATCH_FLOATING_TYPES_AND_HALF in welford.cu * revert scalar_type() to type() in layer_norm_cuda_kernel.cu * revert at::kType to at::ScalarType::Type * use DISPATCH_FLOAT_AND_HALF to get rid of warnings * add dispatch mechanisms for double+float and double+float+half
-
- 10 Apr, 2019 2 commits
-
-
Michael Carilli authored
-
Michael Carilli authored
-